1,720,972 research outputs found

    String-matching and alignment algorithms for finding motifs in NGS data

    No full text
    The development of high-throughput Next Generation Sequencing (NGS) technologies allows to massively extract at low cost an extremely large amount of biological sequences in the form of reads, i.e., short fragments of an organism’s genome. The advent of NGS poses new issues for computer scientists and bioinformaticians, leading to the design of algorithms for aligning and merging the reads in order to obtain an efficient and effective reconstruction of the genome. In this chapter, we focus on methods that can quickly and precisely establish whether two reads are similar or not and that allow to analyze biological sequences extracted with NGS technologies. In particular, the most widespread string-matching, alignment-based, and alignment-free algorithms are summarized and discussed

    TCGA2BED: Extracting, extending, integrating, and querying The Cancer Genome Atlas

    Full text link
    Background: Data extraction and integration methods are becoming essential to effectively access and take advantage of the huge amounts of heterogeneous genomics and clinical data increasingly available. In this work, we focus on The Cancer Genome Atlas, a comprehensive archive of tumoral data containing the results of high-throughout experiments, mainly Next Generation Sequencing, for more than 30 cancer types. Results: We propose TCGA2BED a software tool to search and retrieve TCGA data, and convert them in the structured BED format for their seamless use and integration. Additionally, it supports the conversion in CSV, GTF, JSON, and XML standard formats. Furthermore, TCGA2BED extends TCGA data with information extracted from other genomic databases (i.e., NCBI Entrez Gene, HGNC, UCSC, and miRBase). We also provide and maintain an automatically updated data repository with publicly available Copy Number Variation, DNA-methylation, DNA-seq, miRNA-seq, and RNA-seq (V1,V2) experimental data of TCGA converted into the BED format, and their associated clinical and biospecimen meta data in attribute-value text format. Conclusions: The availability of the valuable TCGA data in BED format reduces the time spent in taking advantage of them: it is possible to efficiently and effectively deal with huge amounts of cancer genomic data integratively, and to search, retrieve and extend them with additional information. The BED format facilitates the investigators allowing several knowledge discovery analyses on all tumor types in TCGA with the final aim of understanding pathological mechanisms and aiding cancer treatments

    Clustering and Classification Techniques for Gene Expression Profile Pattern Analysis

    No full text
    The analysis of gene expression profiles from microarray/RNA sequencing (RNA-Seq) experimental samples demands new efficient methods from statistics and computer science. This chapter considers two main types of gene expression data analysis such as gene clustering and experiment classification. It introduces the transcriptome analysis, highlighting the widespread approaches to handle it. The chapter provides an overview of the microarray and RNA-Seq technologies. In addition, the integrated software packages GenePattern, Gene Expression Logic Analyzer (GELA), TM4 software suite, and other common analysis tools are illustrated. For gene expression profile pattern discovery and experiment classification, the software packages are tested on four real case studies: Alzheimer's disease versus healthy mice; multiple sclerosis samples; psoriasis tissues; and breast cancer patients. The performed experiments and the described techniques provide an effective overview to the field of gene expression profile classification and clustering through pattern analysis

    TCGA2BED: converting and querying The Cancer Genome Atlas.

    No full text
    Motivation Thanks to the great advances in biomedical technologies, we are faced with huge amounts of genomic and clinical data. A striking example is The Cancer Genome Atlas (TCGA), one of the largest public repositories of genomic and clinical data about cancer. TCGA contains more than 15 TB of genomic and clinical data, whose analysis and interpretation are posing great challenges to the bioinformatics community. In this work, we focus on data retrieval, conversion, integration and querying of Next Generation Sequencing (NGS) data and their clinical information extracted from TCGA. In particular, we focus on all publicly available Copy Number Variation (CNV), DNA-methylation, DNA-sequencing (DNA-seq), Gene Expression (RNA-seq V1 and V2), microRNA sequencing (miRNA-seq), and meta (clinical and biospecimen) data. Methods We propose TCGA2BED (http://bioinf.iasi.cnr.it/tcga2bed/), a software tool able to retrieve genomic and clinical data from TCGA and convert them into the tab-delimited BED format. Additionally, it integrates them with external data (e.g., gene coordinates) from other state-of-the-art biological databases and services such as UCSC Genome Browser, HUGO Gene Nomenclature Committee (HGNC), NCBI Gene, and miRBase. TCGA2BED is available with a graphic user interface and includes three different main components: • the controller, that reads and executes the user’s requests (i.e., data download and conversion) through the graphic user interface or an XML configuration file • the retrieval system, which handles the search and retrieval of the public genomic and clinical data available from TCGA by building ad-hoc queries and send them to the REST service of TCGA • the BioParser, which converts all TCGA genomic data types (i.e., CNV, DNA-methylation, DNA-seq, miRNA-seq, and RNA-seq V1 and V2) into the tab-delimited BED format, and all their related clinical metadata into a tab-delimited attribute-value text format. Results Using TCGA2BED, we downloaded and converted all publicly available CNV, DNA-methylation, DNA-seq, miRNA-seq, and RNA-seq V1 and V2 experimental and meta data from TCGA. For each patient sample, cancer type and experiment type in TCGA, we create (i) a .bed file, containing the genomic data of the sample converted in BED format, and (ii) a .meta file, including the clinical data of the sample; additionally, (iii) a header.schema file in XML format that describes the structure of the .bed data files, and (iv) a .txt metadata dictionary file that contains all metadata attributes with all the values that each attribute assumes in the metadata. The TCGA converted data can be easily processed and analysed with wide-spread bioinformatics tools, including the GenoMetric Query Language (GMQL) available at http://www.bioinformatics.deib.polimi.it/GMQL/, a key instrument for the integrative querying of genomic and clinical big data from heterogeneous sources. Here we report an example GMQL query that integrates DNA-seq and RNA-seq data; for each tumor sample of each patient, it searches and returns the DNA mutations that are the closest to expressed genes: DNA = SELECT(*) DNAseq; RNA = SELECT(*) RNAseq; JoinDnaToRna = JOIN(left->bcr_sample_barcode == right->bcr_sample_barcode, MINDISTANCE(1), left) DNA RNA; MATERIALIZE JoinDnaToRna; The use of the BED format reduces the time spent in managing and analyzing the valuable TCGA data: it is possible to efficiently deal with huge amounts of cancer data, and to easily integrate and query them using GMQL. The BED format facilitates the investigators in easily performing knowledge discovery analyses aiming at aiding cancer treatments. For example, the TCGA data in BED format can be straightforwardly analyzed with CAMUR, a tool using a supervised approach able to elicit a high amount of knowledge by computing many rule-based classification models, and therefore able to identify most of the clinical and genomic features related to the predicted cancer type

    TCGA2BED and CAMUR for cancer NGS data processing

    No full text
    In this work, we focus on data retrieval, conversion, integration and querying of Next Generation Sequencing (NGS) data and their clinical information extracted from TCGA. In particular, we focus on all publicly available Copy Number Variation (CNV), DNA-methylation, DNA-sequencing (DNA-seq), Gene Expression (RNA-seq V1 and V2), microRNA sequencing (miRNA-seq), and meta (clinical and biospecimen) data. We propose TCGA2BED (http://bioinf.iasi.cnr.it/tcga2bed/), a software tool able to retrieve genomic and clinical data from TCGA and convert them into the tab-delimited BED format. Additionally, it integrates them with external data (e.g., gene coordinates) from other state-of-the-art biological databases and services such as UCSC Genome Browser, HUGO Gene Nomenclature Committee (HGNC), NCBI Gene, and miRBase. Using TCGA2BED, we downloaded and converted all publicly available CNV, DNA-methylation, DNA-seq, miRNA-seq, and RNA-seq V1 and V2 experimental and meta data from TCGA. The TCGA converted data can be easily processed and analysed with wide-spread bioinformatics tools, including the GenoMetric Query Language (GMQL), a key instrument for the integrative querying of genomic and clinical big data from heterogeneous sources. The use of the BED format reduces the time spent in managing and analyzing the valuable TCGA data: it is possible to efficiently deal with huge amounts of cancer data, and to easily integrate and query them using GMQL. The BED format facilitates the investigators in easily performing knowledge discovery analyses aiming at aiding cancer treatments. For example, the TCGA data in BED format can be straightforwardly analyzed with CAMUR, a tool using a supervised approach able to elicit a high amount of knowledge by computing many rule-based classification models, and therefore able to identify most of the clinical and genomic features related to the predicted cancer type

    Integer programming models for feature selection: new extensions and a randomized solution algorithm

    No full text
    Feature selection methods are used in machine learning and data analysis to select a subset of features that may be successfully used in the construction of a model for the data. These methods are applied under the assumption that often many of the available features are redundant for the purpose of the analysis. In this paper, we focus on a particular method for feature selection in supervised learning problems, based on a linear programming model with integer variables. For the solution of the optimization problem associated with this approach, we propose a novel robust metaheuristics algorithm that relies on a Greedy Randomized Adaptive Search Procedure, extended with the adoption of short memory and a local search strategy. The performances of our heuristic algorithm are successfully compared with those of well-established feature selection methods, both on simulated and real data from biological applications. The obtained results suggest that our method is particularly suited for problems with a very large number of binary or categorical features
    corecore