1,720,985 research outputs found
TCGA2BED: Extracting, extending, integrating, and querying The Cancer Genome Atlas
Background: Data extraction and integration methods are becoming essential to effectively access and take advantage of the huge amounts of heterogeneous genomics and clinical data increasingly available. In this work, we focus on The Cancer Genome Atlas, a comprehensive archive of tumoral data containing the results of high-throughout experiments, mainly Next Generation Sequencing, for more than 30 cancer types. Results: We propose TCGA2BED a software tool to search and retrieve TCGA data, and convert them in the structured BED format for their seamless use and integration. Additionally, it supports the conversion in CSV, GTF, JSON, and XML standard formats. Furthermore, TCGA2BED extends TCGA data with information extracted from other genomic databases (i.e., NCBI Entrez Gene, HGNC, UCSC, and miRBase). We also provide and maintain an automatically updated data repository with publicly available Copy Number Variation, DNA-methylation, DNA-seq, miRNA-seq, and RNA-seq (V1,V2) experimental data of TCGA converted into the BED format, and their associated clinical and biospecimen meta data in attribute-value text format. Conclusions: The availability of the valuable TCGA data in BED format reduces the time spent in taking advantage of them: it is possible to efficiently and effectively deal with huge amounts of cancer genomic data integratively, and to search, retrieve and extend them with additional information. The BED format facilitates the investigators allowing several knowledge discovery analyses on all tumor types in TCGA with the final aim of understanding pathological mechanisms and aiding cancer treatments
Procedura per la determinazione delle tensioni residue in tubi con saldatura longitudinale
TCGA2BED: converting and querying The Cancer Genome Atlas.
Motivation
Thanks to the great advances in biomedical technologies, we are faced with huge amounts of genomic and clinical data. A striking example is The Cancer Genome Atlas (TCGA), one of the largest public repositories of genomic and clinical data about cancer. TCGA contains more than 15 TB of genomic and clinical data, whose analysis and interpretation are posing great challenges to the bioinformatics community.
In this work, we focus on data retrieval, conversion, integration and querying of Next Generation Sequencing (NGS) data and their clinical information extracted from TCGA. In particular, we focus on all publicly available Copy Number Variation (CNV), DNA-methylation, DNA-sequencing (DNA-seq), Gene Expression (RNA-seq V1 and V2), microRNA sequencing (miRNA-seq), and meta (clinical and biospecimen) data.
Methods
We propose TCGA2BED (http://bioinf.iasi.cnr.it/tcga2bed/), a software tool able to retrieve genomic and clinical data from TCGA and convert them into the tab-delimited BED format. Additionally, it integrates them with external data (e.g., gene coordinates) from other state-of-the-art biological databases and services such as UCSC Genome Browser, HUGO Gene Nomenclature Committee (HGNC), NCBI Gene, and miRBase. TCGA2BED is available with a graphic user interface and includes three different main components:
• the controller, that reads and executes the user’s requests (i.e., data download and conversion) through the graphic user interface or an XML configuration file
• the retrieval system, which handles the search and retrieval of the public genomic and clinical data available from TCGA by building ad-hoc queries and send them to the REST service of TCGA
• the BioParser, which converts all TCGA genomic data types (i.e., CNV, DNA-methylation, DNA-seq, miRNA-seq, and RNA-seq V1 and V2) into the tab-delimited BED format, and all their related clinical metadata into a tab-delimited attribute-value text format.
Results
Using TCGA2BED, we downloaded and converted all publicly available CNV, DNA-methylation, DNA-seq, miRNA-seq, and RNA-seq V1 and V2 experimental and meta data from TCGA. For each patient sample, cancer type and experiment type in TCGA, we create (i) a .bed file, containing the genomic data of the sample converted in BED format, and (ii) a .meta file, including the clinical data of the sample; additionally, (iii) a header.schema file in XML format that describes the structure of the .bed data files, and (iv) a .txt metadata dictionary file that contains all metadata attributes with all the values that each attribute assumes in the metadata.
The TCGA converted data can be easily processed and analysed with wide-spread bioinformatics tools, including the GenoMetric Query Language (GMQL) available at http://www.bioinformatics.deib.polimi.it/GMQL/, a key instrument for the integrative querying of genomic and clinical big data from heterogeneous sources. Here we report an example GMQL query that integrates DNA-seq and RNA-seq data; for each tumor sample of each patient, it searches and returns the DNA mutations that are the closest to expressed genes:
DNA = SELECT(*) DNAseq;
RNA = SELECT(*) RNAseq;
JoinDnaToRna = JOIN(left->bcr_sample_barcode == right->bcr_sample_barcode,
MINDISTANCE(1), left) DNA RNA;
MATERIALIZE JoinDnaToRna;
The use of the BED format reduces the time spent in managing and analyzing the valuable TCGA data: it is possible to efficiently deal with huge amounts of cancer data, and to easily integrate and query them using GMQL. The BED format facilitates the investigators in easily performing knowledge discovery analyses aiming at aiding cancer treatments. For example, the TCGA data in BED format can be straightforwardly analyzed with CAMUR, a tool using a supervised approach able to elicit a high amount of knowledge by computing many rule-based classification models, and therefore able to identify most of the clinical and genomic features related to the predicted cancer type
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
- …
