Self updating map
Navigating this entangled web of linked data becomes even more challenging in the context of personalizing care, which requires (i) the identification of the reference sources most relevant to a given patient and (ii) delivery of computational tools for personalized analysis of a subset of the data within the same computational environment used for data discovery.A recent use of data resources in the TCGA for morphological analysis (Cooper et al., 2012) underscores the increasing use of the TCGA as a universal cancer reference, not just for genomics information, but for full patient profiles.
The need to navigate through data portals, such as reported in (Zhang et al., 2011a) or download bulk data prevents programmatic exploration of the contents of the TCGA, hampering the leveraging of this wealth of data in point-of-care scenarios.
Conversely, the ability to programmatically identify which of the ½ million data files in the TCGA are relevant to a particular problem would enable not only large-scale comprehensive study of cancer genomes, but also the creation of tools capable of real time, on the fly analysis and presentation (for an example, see
This report details the creation and study of a road map of the TCGA’s HTTP repository, created to enable the use of this unprecedented biomolecular data resource in the creation of Web 3.0 applications, and enhance the reproducibility of biomolecular research delivered as elements of a computational ecosystem.
In a previous report (Deus et al., 2010), the authors have identified a Resource Description Framework (RDF) data model describing the contents of TCGA file repository.
The resulting RDF map of the TCGA contents is available ( rdf.s3db.googlecode.com/hg/TCGA.rdf), and can be efficiently traversed by a SPARQL engine to quickly discover which files document results that satisfy any number of the constraints recognized by the model.
For example, as illustrated in a webcast accompanying that manuscript ( GU4), one could identify which files describe patients from a specific cancer center that provided samples that were profiled for DNA copy number variation.Indeed, as the TCGA and other collaborative initiatives of this scope evolve and expand, it is not reasonable to expect that they will conform to a narrowly defined format or structure for the The Author(s) 2013. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. As a consequence, an attempt to use the 2010 RDF road map linked above to traverse the current contents of the TCGA initiative is likely to produce a significant number of unresolvable links to data files.Therefore, achieving persistent interoperability of the TCGA initiative requires a different data modeling approach, one that relies on a data model with a versioned data file road map engine.Results: We developed an engine to index and annotate the TCGA files, relying exclusively on third-generation web technologies (Web 3.0).Specifically, this engine uses Java Script in conjunction with the World Wide Web Consortium’s (W3C) Resource Description Framework (RDF), and SPARQL, the query language for RDF, to capture metadata of files in the TCGA open-access HTTP directory.However, to realize this possibility, a continually updated road map of files in the TCGA is required.