1,720,999 research outputs found
Bioinformatics Algorithms for Knowledge Extraction in Biomedical Data
Recent advances in biomedical technologies led to the availability of biomedical data. Many open access repositories oer researchers a large number of heterogeneous data, that have to be properly characterized. Bioinformatics, an interdisciplinary field that integrates biology and computer science, arises for organizing and understanding the information contained in the genomic and biomedical data. This Ph.D. thesis focuses on the study of structures of biological macromolecules, in particular with an emphasis on the analysis of primary (i.e., nucleotide sequence) and secondary (i.e, folding of sequence) ones, by developing ad-hoc algorithms and computational procedures for knowledge extraction. On the one hand, this work deals with the classification analysis applied to the primary structures, by means of supervised machine learning algorithms, jointly with feature selection approaches to search for relevant subset(s) of nucleotides (chapter 2 and 3), as well as to address case-control studies (chapter 4). On the other hand, the work addresses the challenging topic of folding and comparing secondary structures of small and long RNAs, looking for structural motifs shared among them (chapter 5). The adopted approach focuses on the development of ad-hoc algorithms and then on applying them to relevant biomedical problems. The results discussed in this dissertation were achieved during my Ph.D. program at the Department of Computer, Control, and Management Engineering (DIAG) of Sapienza University of Rome, jointly with the Institute for Systems Analysis and Computer Science\A. Ruberti" (IASI) of National Research Council (CNR) of Rome and are consolidated by several journal publications and international conference proceedings, which will be properly referred in the text (and listed at the end) and constitute the research output of this dissertation
Supervised DNA Barcode species classification: analysis, comparison and results
BACKGROUND:
Specific fragments, coming from short portions of DNA (e.g., mitochondrial, nuclear, and plastid sequences), have been defined as DNA Barcode and can be used as markers for organisms of the main life kingdoms. Species classification with DNA Barcode sequences has been proven effective on different organisms. Indeed, specific gene regions have been identified as Barcode: COI in animals, rbcL and matK in plants, and ITS in fungi. The classification problem assigns an unknown specimen to a known species by analyzing its Barcode. This task has to be supported with reliable methods and algorithms.
METHODS:
In this work the efficacy of supervised machine learning methods to classify species with DNA Barcode sequences is shown. The Weka software suite, which includes a collection of supervised classification methods, is adopted to address the task of DNA Barcode analysis. Classifier families are tested on synthetic and empirical datasets belonging to the animal, fungus, and plant kingdoms. In particular, the function-based method Support Vector Machines (SVM), the rule-based RIPPER, the decision tree C4.5, and the Naïve Bayes method are considered. Additionally, the classification results are compared with respect to ad-hoc and well-established DNA Barcode classification methods.
RESULTS:
A software that converts the DNA Barcode FASTA sequences to the Weka format is released, to adapt different input formats and to allow the execution of the classification procedure. The analysis of results on synthetic and real datasets shows that SVM and Naïve Bayes outperform on average the other considered classifiers, although they do not provide a human interpretable classification model. Rule-based methods have slightly inferior classification performances, but deliver the species specific positions and nucleotide assignments. On synthetic data the supervised machine learning methods obtain superior classification performances with respect to the traditional DNA Barcode classification methods. On empirical data their classification performances are at a comparable level to the other methods.
CONCLUSIONS:
The classification analysis shows that supervised machine learning methods are promising candidates for handling with success the DNA Barcoding species classification problem, obtaining excellent performances. To conclude, a powerful tool to perform species identification is now available to the DNA Barcoding community
Network medicine and systems pharmacology approaches to predicting adverse drug effects
Identifying and understanding the relationships between drug intake and adverse effects that can occur due to inadvertent molecular interactions between drugs and targets is a difficult task, especially considering the numerous variables that can influence the onset of such events. The ability to predict these side effects in advance would help physicians develop strategies to avoid or counteract them. In this article, we review the main computational methods for predicting side effects caused by drug molecules, highlighting their performance, limitations and application cases. Furthermore, we provide an overall view of resources, such as databases and tools, useful for building side effect prediction analyses
A new procedure to analyze RNA non-branching structures
RNA structure prediction and structural motifs analysis are challenging tasks in the investigation of RNA function. We propose a novel procedure to detect structural motifs shared between two RNAs (a reference and a target). In particular, we developed two core modules: (i) nbRSSP_extractor, to assign a unique structure to the reference RNA encoded by a set of non-branching structures; (ii) SSD_finder, to detect structural motifs that the target RNA shares with the reference, by means of a new score function that rewards the relative distance of the target non-branching structures compared to the reference ones. We integrated these algorithms with already existing software to reach a coherent pipeline able to perform the following two main tasks: prediction of RNA structures (integration of RNALfold and nbRSSP_extractor) and search for chains of matches (integration of Structator and SSD_finder)
CAMUR: Knowledge extraction from RNA-seq cancer data through equivalent classification rules
Nowadays, knowledge extraction methods from Next Generation Sequencing data are highly requested. In this work, we focus on RNA-seq gene expression analysis and specifically on case-control studies with rule-based supervised classification algorithms that build a model able to discriminate cases from controls. State of the art algorithms compute a single classification model that contains few features (genes). On the contrary, our goal is to elicit a higher amount of knowledge by computing many classification models, and therefore to identify most of the genes related to the predicted class
Analysis of microarray and RNA-sequencing gene expression profiles through clustering and classification techniques
MONSTER v1.1: a tool to extract and search for RNA non-branching structures
Background: Detection of RNA structure similarities is still one of the major computational problems in the discovery of RNA functions. A case in point is the study of the new appreciated long non-coding RNAs (lncRNAs), emerging as new players involved in many cellular processes and molecular interactions. Among several mechanisms of action, some lncRNAs show specific substructures that are likely to be instrumental for their functioning. For instance, it has been reported in literature that some lncRNAs have a guiding or scaffolding role by binding chromatin-modifying protein complexes. Thus, a functionally characterized lncRNA (reference) can be used to infer the function of others that are functionally unknown (target), based on shared structural motifs. Methods: In our previous work we presented a tool, MONSTER v1.0, able to identify structural motifs shared between two full-length RNAs. Our procedure is mainly composed of two ad-hoc developed algorithms: nbRSSP_extractor for characterizing the folding of an RNA sequence by means of a sequence-structure descriptor (i.e., an array of non-overlapping substructures located on the RNA sequence and coded by dot-bracket notation); and SSD_finder, to enable an effective search engine for groups of matches (i.e., chains) common to the reference and target RNA based on a dynamic programming approach with a new score function. Here, we present an updated version of the previous one (MONSTER v1.1) accounting for the peculiar feature of lncRNAs that are not expected to have a unique fold, but appear to fluctuate among a large number of equally-stable folds. In particular, we improved our SSD_finder algorithm in order to take into account all the alternative equally-stable structures. Results: We present an application of MONSTER v1.1 on lincRNAs, which are a specific class of lncRNAs located in genomic regions which do not overlap protein-coding genes. In particular, we provide reliable predictions of the shared chains between HOTAIR, ANRIL and COLDAIR. The latter are lincRNAs which interact with the same protein complexes of the Polycomb group and hence they are expected to share structural motifs. Software availability: the software package is provided as additional file 1 ("archive_updated.zip")
- …
