Technological breakthroughs witnessed over the past decade have led to an explosive increase in molecular profiling capabilities. This has ushered a new “data-rich era” for biomedical researchers. Indeed the recent availability of vast compendia of biomedical “Big Data” offers unique opportunities to devise novel approaches to knowledge discovery. We have launched an innovative “Collective Data to Knowledge” (CD2K) platform at the Sidra Medical and Research Center, which provides a hands-on accelerated training path for young biomedical researchers. The originality of this approach stems from the fact that it does not rely on de novo generation of data but leverages instead large dataset collections available in public repositories for the discovery of novel scientific knowledge. It does however recapitulates all other steps involved on the path to transformation of data into novel biomedical knowledge, including knowledge gap assessment and prioritization, hypothesis generation and testing, and generation of reports for publication in peer reviewed journals. Furthermore, besides providing accelerated hands-on training it also potentially constitutes a highly efficient approach to the generation of intellectual capital in Qatar.


The approach that we have devised relies on a wide range of available bioinformatics tools and resources.

a) Data integration and dissemination: For data dissemination we rely on a custom web application – the gene expression browser – that is used for integration of heterogeneous data (e.g. molecular profiles, together with clinical information, sample information and results from ancillary assays) [1]. It provides user with seamless access to large and complex datasets that can be viewed in an interactive format. This tool has now been deployed at Sidra Medical and Research Center and is used to create curated themed dataset collections that will be described in peer-reviewed communications (manuscripts in preparation).

b) Knowledge gap assessment: Knowledge gaps are identified via profiling of the biomedical literature for sets of differentially expressed genes. For instance knowledge gaps may be revealed among the hundreds of genes identified via transcriptome profiling as being induced by TNF-α, a host-derived pro-inflammatory cytokine. Among those genes many will be associated in the literature with pro-inflammatory responses, but a number of them will be shown via literature profiling not to be associated with inflammation, thus constituting a knowledge gap in which lies the opportunity for discovery.

c) Hypothesis generation and In Silico validation: the identification of knowledge gaps is the first step towards generation of novel knowledge. Next we rely on tens of thousands of publically available datasets to validate and extend the initial finding, often leading to formulation of novel hypotheses that can be immediately tested by accessing other relevant datasets.

d) Information extraction: We also devised standardized approaches for extracting and structuring information. These methods are used when profiling the literature and dataset collections in view of preparing the background and result sections of the reports. This principled approach helps trainees with manuscript preparation, which often constitutes a hurdle for scientists at an early in their careers.

e) Knowledge dissemination: trainees are then encouraged to submit their work in peer-reviewed scientific journals. It is the opportunity for them to learn about this essential process that is one of the cornerstones of the scientific discovery process.


Workshops carried out in Qatar and in several countries around the world have been instrumental in the development of a CD2K training curriculum. In addition, a proof of principle of the effectiveness of the CD2K platform has been established with identification in a short amount of time of new discoveries with potential for high impact.

1) CD2K Training Workshops

We have conducted hands-on training workshops this year for 6 organizations. Each workshop spanned between 1 to 3 days and involving overall more than 100 participants.

An introductory CD2K training workshop was organized at the Sidra Medical and Research Center. Training material was further developed by conducting CD2K workshops in a wide range of settings: in academic research institutes, in the United States at the Jackson Laboratory for Genomics Medicine and in Singapore at the A-star Institute; in a University in Thailand (Chulalongkorn University, Bangkok); in a research hospital in France (Hopital Europeen, Marseille); in a large pharmaceutical company in the United States (MedImmune, a subsidiary of AstraZenca, in Gaithersburg, Maryland).

These workshops were instrumental to the establishment of a robust training curriculum; consisting in the following learning objectives:

a) Collective biomedical data profiling

b) Literature profiling

c) Identification and prioritization of knowledge gaps

d) Hypothesis generation and in silico validation

e) Information extraction

f) Knowledge dissemination

2) CD2K Proof of principle

A post-doctoral research fellow has been assigned to the piloting the CD2K platform at Sidra, with the objective of identifying and prioritizing potential knowledge gaps and submitting reports for publication in peer reviewed journals within 12 months for three novel findings with high potential for translation. At the time of submission of this abstract 10 months within this pilot 2 manuscripts have been submitted with a third one being finalized. The first two articles have appeared online pre-peer review in March and September of this year in the journal “Faculty of 1000 Research”:

a) The first article reports the identification of “ADAM metallopeptidase 9” (ADAM9) as a candidate biomarker for the specific assessment of tissue damage caused by infection, independently of pathogen-driven inflammatory processes [2]. This work revealed a new potential role for ADAM9 in immunological homeostasis and pathogenesis. The abundance of ADAM9 transcripts in the blood was increased in patients with acute infection but changed very little after in vitro exposure to a wide range of pathogen-associated molecular patterns (PAMPs). Furthermore it was found to increase significantly in subjects as a result of tissue injury or tissue remodeling, in absence of infectious processes. Therefore this marker could potentially be used as a triage tool for patients presenting with symptoms of infection in the emergency room that may or may not require hospitalization

b) The second article reported the identification of blood molecular signatures that correlate with protection, or lack thereof, conferred by the RTS,S malaria vaccine [3]. This finding is important because this vaccine, which was licensed this year by European regulatory authorities, only protects about 40% of vaccinated individuals. Understanding the mechanisms that undermine the efficacy of this vaccine could lead to the development of a universally protective prophylactic modality for a disease that affects about 200 millions people and causes 500,000 deaths each year worldwide.

c) A third report that is being finalized investigates the role of Aquaporin 9 (AQP9), a water-selective membrane channel protein. This molecule is regulated during infection and appears to play a role in maintaining elevated metabolism associated with inflammatory responses. It may also inadvertently promote pathogen growth with adverse consequences in conditions such as pregnancy where elevated baseline metabolic states may contribute to enhance disease severity. Finally our observations also suggest a role for AQP9 in mediating pathogen clearance via phagocytosis.


The approach that we have devised recapitulates all the steps involved in the scientific discovery process, from data interpretation to knowledge dissemination. It allows screening, identification and prioritization of potential knowledge gaps, followed by in sillico validation and hypothesis generation and testing, finally resulting in preparation and publication of reports in a peer-reviewed journal. Its effectiveness as a training platform stems from the fact that it does not rely on de novo generation of data for discovery and validation. It leverages instead the vast amounts of available biomedical data, which will allow for accelerated and highly efficient hands-on training of aspiring biomedical researchers.


[1] Speake C, Presnell S, Domico K, Zeitner B, Bjork A, Anderson D et al. An interactive web application for the dissemination of human systems immunology data. J Transl Med. 2015;13:196. doi:10.1186/s12967-015-0541-x

[2] Rinchai D, Kewcharoenwong C, Kessler B et al. Abundance of ADAM9 transcripts increases in the blood in response to tissue damage [version 1; referees: 3 approved with reservations] F1000Research 2015, 4:89 (doi: 10.12688/f1000research.6241.1)

[3] Rinchai D, Presnell S and Chaussabel D. Blood Interferon Signatures Putatively Link Lack of Protection Conferred by the RTS,S Recombinant Malaria Vaccine to an Antigen-specific IgE Response [version 1; referees: awaiting peer review] F1000Research 2015, 4:919 (doi: 10.12688/f1000research.7093.1)


Article metrics loading...

Loading full text...

Full text loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error