1,721,188 research outputs found
Using long reads to improve haplotype phasing, genome assembly, and gene annotation
Despite their accuracy, next-generation DNA sequencing technologies have limited utility in analyzing ambiguous and repetitive parts of the genome due to the short length of reads. Third-generation long read DNA sequencing technologies, such as those from Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio), allow us to explore much more of the genome and perform more comprehensive genomic analyses. However, new software must be developed for these analyses in order to take advantage of the increased read lengths, while mitigating errors from base-level inaccuracies. In this thesis, I explore the advantages of long reads for haplotype phasing and genome assembly. I then use genome assemblies created from long reads to perform comparative genomics analyses, focusing on gene annotation of new, high-quality assemblies of primates and humans, including annotating the first fully complete human genome and a human pangenome containing over 90 distinct haplotypes
Recommended from our members
Improving sequence alignment and variant calling through the process of population and pedigree-based graph alignment
In current sequencing methodology, a linear genome reference is used to detectgenetic-variants based on collections of sequence reads. The linear reference introduces potential
misalignment of reads that don’t exactly match the reference or the copy number of
sequences in the reference doesn’t match the sample correctly. This is known as reference bias.
In the field of clinical genetics for rare diseases, a resulting reduction in genotyping accuracy in
some regions has likely prevented the resolution of some cases. Pangenome graphs embed population
variation into a reference structure to reduce reference bias. While this helps to reduce
reference bias, further performance improvements are possible with the aid of pedigree information.
In this dissertation I present my research on the methods developed to build programs
that apply pangenome graphs to solve these problems. First, I share the work I’ve contributed
towards streamlining a single-sample pangenome software workflow and the accuracy enhancements
I’ve contributed within the pangenome effort. Next, I share my methods in incorporating
pedigree information within the pangenome framework and show how performance is improved
over standard pangenomes. I describe an extension of this work to demonstrate the clinical application
of this workflow. Finally, I cover various projects I’ve contributed to that catalogue
and use detected variants for deleterious classification
Recommended from our members
Accurate genome analysis with nanopore sequencing using deep neural networks.
Nanopore sequencing, commercialized by Oxford Nanopore Technology (ONT), is a high-throughput genome sequencing platform. Unlike traditional sequencing-by-synthesis methods, nanopore sequencing uses measured current signals to sense the nucleotide sequence flowing through the pore. The signal-to-base conversion process introduces unique error patterns, making it challenging to design methods that rely on hand-crafted features. Deep learning uses multiple layers to progressively learn complex patterns in the input data, making it suitable for genome analysis. In this dissertation research, I present methods I developed based on deep neural networks to improve genome inference with nanopore sequencing. First, I introduce haplotype-aware variant calling pipeline PEPPER-Margin-DeepVariant that produces state-of-the-art results for nanopore long-reads. Next, I demonstrate a pipeline to perform de novo assembly of eleven human genomes in nine days. Then I show the application of the methods to validate and correct errors in the first complete human genome assembly. Finally, I demonstrate the utility of PEPPER-Margin-DeepVariant paired with highly multiplexed nanopore sequencing for rapidly identifying disease-causing variants
Recommended from our members
Modification Detection using Nanopore Sequencing
Both DNA and RNA modifications play critical roles in cell regulation. Traditionally, a chemical selection process alters base pairing or sequencing coverage is used to sequence modified nucleotides. Therefore, a new chemical labeling process needs to be created for each modification. Currently, we do not have methods for sequencing the majority of the over 150 RNA and over 40 DNA modifications. However, with nanopore sequencing, we can directly detect modifications on native DNA or RNA reads without any selection or chemical labeling techniques. Nanopore sequencing measures the change in current across a nanopore as a polynucleotide threads through the nanopore and we can use this signal to identify modifications. In chapter 1, we present a framework for the unsupervised determination of the number of nucleotide modifications from nanopore sequencing readouts. We demonstrate the approach can effectively recapitulate the number of modifications, the corresponding ionic current signal levels, as well as mixing proportions under both DNA and RNA contexts. We further show, by integrating information from multiple detected modification regions, that the modification status of DNA and RNA molecules can be inferred. This method forms a key step of de novo characterization of nucleotide modifications.In chapter 2, we present a graph convolutional network-based deep learning framework for predicting the mean of kmer distributions from corresponding chemical structures. We show such a framework can generalize the chemical information of the 5-methyl group from thymine to cytosine by correctly predicting 5-methylcytosine-containing DNA 6mers.In chapter 3, using a combination of yeast genetics and nanopore direct RNA sequencing, we have developed a reliable method to track the modification status of single rRNA molecules at 37 sites in 18S rRNA and 73 sites in 25S rRNA. We use our method to characterize patterns of modification heterogeneity and identify concerted modification of nucleotides found near functional centers of the ribosome. Distinct undermodified subpopulations of rRNAs accumulate when ribosome biogenesis is compromised by loss of Dbp3 or Prp43-related RNA helicase function. Modification profiles are surprisingly resistant to change in response to many genetic and environmental conditions that affect translation, ribosome biogenesis, and pre-mRNA splicing. The ability to capture complete modification profiles for RNAs at single-molecule resolution will provide new insights into the roles of nucleotide modifications in RNA function
Recommended from our members
Methodological advancements for genome reconstruction by haplotyping long read sequence data
Second-generation sequencing technology and accompanying analyses resulted in a deluge of information about variation in human populations, enabling large-scale association studies and precision medicine. However, there are genomic contexts which cannot be analyzed using these technologies. With the advent of long-read sequencing, previously unmappable regions of the genome have become accessible, paving the way for more comprehensive analyses of the human genome. However, new methods are required to leverage the increased length of these data as well as mitigate the poor sequence accuracy. In this work, I present an accurate and efficient application "Margin", which uses a Hidden Markov Model to separate read and variant data into haplotypes. I describe work to validate the method and show applicability in variant calling, I demonstrate ways to overcome systematic errors in nanopore sequence data and correct assembled sequence, and I document the tool's use in a state-of-the-art variant caller for Oxford Nanopore and PacBio HiFi data used to generate reference materials and make medical diagnoses
Recommended from our members
Data structures and algorithms for read mapping to pangenome graphs
The human reference genome is one of the most important foundational resources in biological research but its utility as a reference for all people is limited due to its lack of diversity.Pangenomes are an alternative representation of genomes that incorporate genetic variations from a population of individuals.
Using a pangenome as a reference can mitigate the bias incurred by using the current standard reference genome, but because of the increased size and complexity of pangenomes, tools that use them tend to be slower and less reliable than tools that use standard references.
Mapping sequencing reads to a reference, the first step in many genomic pipelines, is a particularly challenging problem in a pangenome context.
In this dissertation, I present my work developing data structures and algorithms to support read mapping to pangenome graphs.
The pangenomic read mapping tools that I helped develop over the course of my PhD are as efficient as linear mappers and improve variant calling and genotyping results compared to standard tools.
They are among the first practical pangenome mappers that are paving the way for the emerging field of pangenomics
Recommended from our members
Phasing Genome Assemblies and Phased Long Read Methylation Analysis
In this work, I first describe methods for ONT assembly phasing, using parental information. Second I outline the differences in long read methylation calling technologies in regards to how they may still be used for differential analysis across technologies. In the third chapter, I outline a methylation analysis framework as a part of a Nanopore-only pipeline for phased methylation calls, assembly, small, and structural variants. Finally, I present an analysis of the impact of structural variants on gene expression and methylation in a dataset of hundreds of prefrontal cortex brain tissue samples from the National Institute of Health's Center for Alzheimer's and Related Dementias (NIH CARD)
Recommended from our members
Graph Methods for Computational Pangenomics
In most sequencing experiments, sequencing reads are mapped to a reference genome assembly in order to identify the genomic elements that the reads originated from. The mapping process becomes less accurate when the sample's genome differs from the reference genome. This introduces a pervasive reference bias in which genomics analyses are systematically less accurate for non-reference alleles. In the field of pangenomics, it has been proposed that more general reference structures could mitigate reference bias. The fundamental idea is to incorporate population variation into the reference itself. The result is naturally expressible as a sequence graph. This dissertation presents the research I performed to develop methods for graph-based pangenomic analyses. First, I describe a read mapping and inference pipeline to perform haplotype-resolve transcriptomic analyses using pangenomics techniques. Next, I describe several contributions I have made to the ecosystem of pangenomic software: an interface to conventional reference methods, a software library of pangenome graph data structures, and a usable interface for indexing pangenome graphs. Finally, I describe some applications of graph theory to pangenome graphs to perform practical pangenomics tasks: identifying sites of variation and converting overlapped sequence graphs to blunt ones
Recommended from our members
Overcoming data privacy and data gravity challenges in bioinformatics research
Next-generation sequencing technologies have generated a massive amount of DNA, RNA, and protein sequences since their inception. However, data privacy policies often restrict sharing such data for the risk of re-identifying individuals from whom the sequences were generated. Even when all the data from a sequencing experiment is available, it is often insufficient for statistical power or training machine learning models. Despite the lack of data, sometimes the data sets are ironically too large to realistically share with researchers. In this thesis, I explore methods to overcome challenges of data privacy and data gravity in bioinformatics research. In collaboration with QIMR Berghofer and the Riken Center for Integrative Medical Sciences, we used federated methods to analyze genomic data from the BioBank Japan in situ to classify variants of uncertain significance while preserving privacy. With the Department of Laboratory Medicine and Pathology at the University of Washington, we developed a statistical model that demonstrates how using responsibly shared clinical evidence alone can classify variants of uncertain significance which occur at the rate of 1 in 100,000 people within just a few years. With researchers from McGill University, we reviewed the state of the art in federated computing technologies and how well they satisfy the privacy restrictions from the General Data Protection Regulation. With researchers from NASA, Amazon, and Intel, we developed a federated learning framework to run between terrestrial and space-borne compute infrastructure, laying the groundwork for subsequent experiments which preclude the need to transfer large datasets across astronomical distances. Finally, at NASA, we used a causal inference machine learning ensemble to infer robust correlation between mouse liver gene expression and a corresponding lipid density phenotype in space-flown mice
Recommended from our members
Methods for Nanopore Genome Assembly and Phasing
In this work, I present methods which aim to facilitate the generation of diverse genomic assembly data, using primarily nanopore sequencing and the Shasta assembler. First, I describe methods for improving consensus quality in nanopore assemblies. Second, I present a method for transforming graph representations of assemblies. Finally, I describe a graph based phasing method which enables accurate, de novo, chromosome-scale phasing, with a total of 2 flow cells of nanopore sequence
- …
