1,721,040 research outputs found

    Towards a novel method for small indels detection using Illumina/Solexa data

    No full text
    Motivation: While it was long assumed that most of the genomic variation within species is due to single nucleotide polymorphisms (SNPs), the importance of genomic rearrangements such as insertions, deletions, inversions and duplications has recently become clear. Traditional sequencing based methods for the discovery of genomic structural variants (SV) are based on the mapping of paired end reads (PEM). In this approach paired end sequence reads are generated from a library of genomic DNA with a narrow range of insert sizes. The reads are mapped to a reference genome, and pairs mapping at a distance that is substantially different from the expected length, or with anomalous orientation, suggest structural variants. While earlier PEM-based methods used low-coverage Sanger sequencing, in the last few years the advent of Next Generation Sequencing has accelerated the characterization of genomic structural variation, but has also required the development of new bioinformatics approaches. In this abstract we present a novel method for the detection of small indels using Illumina/Solexa data. methods: First we map paired end reads from the donor genome on the reference genome, and estimate the general distributions of distances (GDD ) between the mapped mate pairs. We consider only mate pairs mapping on the reference genome at approximately expected distances (80-300 bp) with unique mapping solutions. Then for every position in the genome we calculate the average value and the variance of the distance between every read spanning that position and its mate pair, in a strand specific fashion. At this point we use a 2 tailed Welsch T test to assess whether the population of distances relative to a particular genomic position and strand has a mean which significantly differs from the average of the GDD. We then cluster neighbouring anomalous genomic positions into anomalous windows (AW). The presence of indels is also expected to result in a peak of mapping of unpaired tags upstream and downstream of the indel, at a distance which will be more or less equal the average of the GDD, as reads covering junctions of rearrangement events will not map to the reference genome. This means that, given sufficient depth of sequencing, to be a genuine hallmark of an indel every AW should be linked to a peak of unpaired mapped tags. We use a single tailed Welsch T test to verify that the mean coverage from unpaired reads is significantly higher than the genomic mean at the expected distance upstream or downstream of any given AW. Finally, we join the validated AW from both strands to reconstruct the indel events. results: To asses the performance of our approach we generated an artificial dataset consisting of 2000 indels ranging in size from 10 to 250 bp (10,20,30,40,60,80,120,150,180,250 bp) on the mitochondrial genome of V. vinifera strain PN40040 at different coverage levels. Preliminary results suggest that our approach performs better than other methods for the detection of short insertions and deletions, with an average recovery rate of 96.4% (combined insertions + deletions) at high coverge (>40X) for indels 30 to 80 bp long with a false positive rate of 0.1%. Remarkably our method performs well in detecting this category of indels even at moderate to low sequencing coverage (recovery rate of 83% at 10X coverage for deletions of 40 bp with a false positive rate of 1.5%, recovery rate of 93% with no false positive for deletions of 10 to 20bp when the coverage is >= 30X ). However, detection of very small insetions remains less tractable, probably due to the asymmetric distribution of distances between paired end read. High recovery and low false positive rates are also observed in the detection of larger deletions while larger insertions are inherently difficult to detect when the mean of the GDD is small

    Characterizing Structural Variation in Genomes (from humans to crops)

    No full text
    Several bioinformatics methods have been proposed for the detection and characterization of genomic structural variation from ultra-high throughput genome resequencing data. Although some of these methods demonstrate reasonably high specificity, the sensitivity of available approaches is rather low. We propose a novel method for the identification of genomic structural variation from high throughput paired end genome resequencing data. While utilizing deviations from expected library insert sizes, our approach employs additional information from local patterns of read mapping and supervised learning to predict the position and nature of structural variants. We show that our method shows notably increased sensitivity at no cost in specificity with respect to existing insert size-based tools in the identification of structural variants in the human genome. Furthermore, we show that the additional information incorporated in our approach allow us to make reliable predictions of very short insertions and deletions that are otherwise only recovered by approaches based on the split mapping of resequencing reads

    WebVar: a resource for the rapid estimation of relative site variability from multiple sequence alignments

    No full text
    WebVar is an online resource that provides estimates of relative site variability from multiple alignments of homologous protein or nucleic acid sequences. WebVar provides a variety of graphic and textual representations of estimates, designed to assist in phylogenetic analysis

    Improved detection of intra-specific genomic structural variation using paired end high throughput resequencing data and Support Vector Machine

    No full text
    Several bioinformatics methods have been proposed for the detection and characterization of genomic structural variation (SV) from ultra-high throughput genome resequencing data. Recent surveys show that comprehensive detection of SV events of different types between an individual resequenced genome and a reference sequence is best achieved through the combination of methods based on different principles (split mapping, reassembly, read depth, insert size, etc). The improvement of individual predictors is thus an important objective. Here we propose a new a method that combines deviations from expected library insert sizes and additional information from local patterns of read mapping and uses supervised learning to predict the position and nature of structural variants. We show that our approach provides greatly increased sensitivity with respect to other tools based on paired end read mapping at no cost in specificity, and it makes reliable predictions of very short insertions and deletions in repetitive and low complexity genomic contexts that can confound tools based on split-mapping of reads

    SVM2 : an improved paired-end-based tool for the detection of small genomic structural variations using high-throughput single-genome resequencing data

    No full text
    Several bioinformatics methods have been proposed for the detection and characterization of genomic structural variation (SV) from ultra high-throughput genome resequencing data. Recent surveys show that comprehensive detection of SV events of different types between an individual resequenced genome and a reference sequence is best achieved through the combination of methods based on different principles (split mapping, reassembly, read depth, insert size, etc.). The improvement of individual predictors is thus an important objective. In this study, we propose a new method that combines deviations from expected library insert sizes and additional information from local patterns of read mapping and uses supervised learning to predict the position and nature of structural variants. We show that our approach provides greatly increased sensitivity with respect to other tools based on paired end read mapping at no cost in specificity, and it makes reliable predictions of very short insertions and deletions in repetitive and low-complexity genomic contexts that can confound tools based on split mapping of reads

    Phylogenetic analyses: a brief introduction to methods and their applications

    No full text
    Phylogenetic analysis of molecular sequence data plays an increasingly important role in clinical medicine, both in the emerging field of molecular epidemiology and in the rational design of new therapeutic agents. The alms of this review are to introduce some of the methods used to construct phylogenetic trees, to illustrate some of the pitfalls that can introduce artifactual results and to speculate on the long-term importance of this area of computational biology in clinical medicine

    Definition plant microRNA primary transcripts and their splicing patterns using RNAseq

    Full text link
    Motivation. The prediction of conserved mature microRNAs and their precursor hairpins has been addressed through several computational tools, while the detection of novel and lineage specific microRNAs is typically approached through deep sequencing of small RNA species. However, a meaningful understanding of both the regulation of miRNA transcription and the potential roles of alternative splicing in posttranscriptional regulation of microRNA biogenesis require accurate, high throughput methods to describe primary microRNA transcript structure. Methods. Given that at least most primary miRNAs in plants are believed to be transcribed by RNA polymerase II, we reasoned that, despite the expected short physiological half life of such species, ultra high-throughput sequencing of cDNA should provide evidence of primary miRNA transcripts and splicing of these molecules. We tested this hypothesis using Illumina RNAseq data from the Grapevine Vitis vinifera. Reads were mapped to the genome sequence and “islands” of transcription including known miRNA precursors were analysed in detail. All possible canonical splice junctions within such islands were generated computationally and used as targets for mapping of RNAseq reads that did not map to the genome sequence (reads potentially covering splice junctions). Results. We show that for many microRNA precursors, convincing estimates of primary transcript coordinates can be obtained from RNAseq data. Furthermore, estimates of splicing events obtained from our approach can often be validated experimentally. Our data suggest that splicing and alternative splcing of primary miRNAs may be widespread, at least in the grapevine, and that alternative splicing may represent a mechanism of post-transcriptional regulation of miRNA biogenesis

    Accurate discrimination of conserved coding and non-coding regions through multiple indicators of evolutionary dynamics

    Full text link
    Abstract Background The conservation of sequences between related genomes has long been recognised as an indication of functional significance and recognition of sequence homology is one of the principal approaches used in the annotation of newly sequenced genomes. In the context of recent findings that the number non-coding transcripts in higher organisms is likely to be much higher than previously imagined, discrimination between conserved coding and non-coding sequences is a topic of considerable interest. Additionally, it should be considered desirable to discriminate between coding and non-coding conserved sequences without recourse to the use of sequence similarity searches of protein databases as such approaches exclude the identification of novel conserved proteins without characterized homologs and may be influenced by the presence in databases of sequences which are erroneously annotated as coding. Results Here we present a machine learning-based approach for the discrimination of conserved coding sequences. Our method calculates various statistics related to the evolutionary dynamics of two aligned sequences. These features are considered by a Support Vector Machine which designates the alignment coding or non-coding with an associated probability score. Conclusion We show that our approach is both sensitive and accurate with respect to comparable methods and illustrate several situations in which it may be applied, including the identification of conserved coding regions in genome sequences and the discrimination of coding from non-coding cDNA sequences.</p

    Phylogenetic analyses suggest multiple changes of substrate specificity within the Glycosyl hydrolase 20 family

    Full text link
    Abstract Background Beta-N-acetylhexosaminidases belonging to the glycosyl hydrolase 20 (GH20) family are involved in the removal of terminal β-glycosidacally linked N-acetylhexosamine residues. These enzymes, widely distributed in microorganisms, animals and plants, are involved in many important physiological and pathological processes, such as cell structural integrity, energy storage, pathogen defence, viral penetration, cellular signalling, fertilization, development of carcinomas, inflammatory events and lysosomal storage diseases. Nevertheless, only limited analyses of phylogenetic relationships between GH20 genes have been performed until now. Results Careful phylogenetic analyses of 233 inferred protein sequences from eukaryotes and prokaryotes reveal a complex history for the GH20 family. In bacteria, multiple gene duplications and lineage specific gene loss (and/or horizontal gene transfer) are required to explain the observed taxonomic distribution. The last common ancestor of extant eukaryotes is likely to have possessed at least one GH20 family member. At least one gene duplication before the divergence of animals, plants and fungi as well as other lineage specific duplication events have given rise to multiple paralogous subfamilies in eukaryotes. Phylogenetic analyses also suggest that a second, divergent subfamily of GH20 family genes present in animals derive from an independent prokaryotic source. Our data suggest multiple convergent changes of functional roles of GH20 family members in eukaryotes. Conclusion This study represents the first detailed evolutionary analysis of the glycosyl hydrolase GH20 family. Mapping of data concerning physiological function of GH20 family members onto the phylogenetic tree reveals that apparently convergent and highly lineage specific changes in substrate specificity have occurred in multiple GH20 subfamilies.</p
    corecore