1,721,474 research outputs found
Bayesian selection of nucleotide substitution models and their site assignments
Probabilistic inference of a phylogenetic tree from molecular sequence data is predicated on a substitution model describing the relative rates of change between character states along the tree for each site in the multiple sequence alignment. Commonly, one assumes that the substitution model is homogeneous across sites within large partitions of the alignment, assigns these partitions a priori, and then fixes their underlying substitution model to the best-fitting model from a hierarchy of named models. Here, we introduce an automatic model selection and model averaging approach within a Bayesian framework that simultaneously estimates the number of partitions, the assignment of sites to partitions, the substitution model for each partition, and the uncertainty in these selections. This new approach is implemented as an add-on to the BEAST 2 software platform. We find that this approach dramatically improves the fit of the nucleotide substitution model compared with existing approaches, and we show, using a number of example data sets, that as many as nine partitions are required to explain the heterogeneity in nucleotide substitution process across sites in a single gene analysis. In some instances, this improved modeling of the substitution process can have a measurable effect on downstream inference, including the estimated phylogeny, relative divergence times, and effective population size histories
Phenotypic Bayesian phylodynamics : hierarchical graph models, antigenic clustering and latent liabilities
Combining models for phenotypic and molecular evolution can lead to powerful inference tools. Under the flexible framework of Bayesian phylogenetics, I develop statistical methods to address phylodynamic problems in this intersection. First, I present a hierarchical phylogeographic method that combines information across multiple datasets to draw inference on a common geographical spread process. Each dataset represents a parallel realization of this geographic process on a different group of taxa, and the method shares information between these realizations through a hierarchical graph structure. Additionally, I develop a multivariate latent liability model for assessing phenotypic correlation among sets of traits, while controlling for shared evolutionary history. This method can efficiently estimate correlations between multiple continuous traits, binary traits and discrete traits with many ordered or unordered outcomes. Finally, I present a method that uses phylogenetic information to study the evolution of antigenic clusters in influenza. The method builds an antigenic cartography map informed by the assignment of each influenza strain to one of the antigenic clusters
Recommended from our members
Big Bayesian Phylogenetic Comparative Methods
Phylogenetic comparative methods seek to untangle the complex web of selective pressures driving biological evolution. These methods seek to identify associations between different biological traits over evolutionary history. Statistical models of phenotypic evolution need to account for the shared evolutionary history between different species, and accounting for this non-independence poses computational challenges. These challenges are compounded by missing observations, high-dimensional traits and highly-structured data. Here, I develop computational and modeling approaches that dramatically improve the computational efficiency and scalability of these models to enable Bayesian phylogenetic comparative analysis of unprecedentedly large data sets. First, I develop an algorithm that analytically marginalizes missing observations in a (relatively) simple model of phenotypic evolution. This algorithm is broadly applicable beyond this simple model and allows scalable inference under a variety of model extensions. These extensions include models that accommodate residual variance, allowing measurement of phylogenetic heritability, and linear dimension reduction, allowing phylogenetic comparative analyses for high-dimensional traits. I combine this work into a generalizable modeling framework that allows researchers to build flexible, highly structured models that remain scalable for both large number of taxa and many observations per taxon. This work achieves increases in computation speed by more than two orders of magnitude across several contexts, bringing computation time down from weeks or months to minutes or hours in multiple real-world applications
Recommended from our members
General birth-death processes: probabilities, inference, and applications
A birth-death process is a continuous-time Markov chain that counts the number of particles in a system over time. Each particle can give birth to another particle or die, and the rate of births and deaths at any given time depends on how many extant particles there are. Birth-death processes are popular modeling tools in evolution, population biology, genetics, epidemiology, and ecology. Despite the widespread interest in birth-death models, no efficient method exists to evaluate the finite-time transition probabilities in a process with arbitrary birth and death rates. Statistical inference of the instantaneous particle birth and death rates also remains largely limited to continuously-observed processes in which per-particle birth and death rates are constant. The lack of theoretical progress in developing statistical tools for dealing with data from birth-death processes has hindered their adoption by applied researchers, and represents a major research frontier in statistical inference for stochastic processes. In this dissertation, I seek to fill this apparent void in three ways. First, I develop mathematical theory and computational tools for computing transition probabilities for general birth-death processes. Second, I develop algorithms for maximum likelihood estimation of rate parameters in discretely observed processes. Third, I derive probability distributions for characteristics of certain birth-death models that are fundamental in macroevolutionary studies. In each case, I give practical applications of the methodology, and show how unsolved problems can be attacked using these techniques
Scalable Inference in Bayesian Phylogenetics
Phylogenetic models with lineage-specific parameter characterizations provide a flexible framework to model ancestral changes in diffusion and evolution processes. However, increased taxonomic sampling challenges inference under these models as the number of unknown parameters grows with the number of taxa. To solve this problem, I develop scalable inference machinery as well as scalable models to permit the study of increasingly massive trees within a Bayesian phylogenetic framework. First, I introduce a method to compute the gradient of the trait data log-likelihood of the popular relaxed random walk model of trait diffusion with computational complexity that is linear with the number of tips in the tree. I use this gradient to build an efficient Hamiltonian Monte Carlo (HMC) sampler that simultaneously samples all branch-specific model parameters with high acceptance probability. Next, I propose a new, auto-correlated molecular clock rate model together with scalable inference methods. My approach permits estimating both the presence and location of local clocks without a priori knowledge of their placement and avoids inordinately shrinking clock-rates. Finally, I develop a shrinkage-based adaptive shift model that automatically detect the number and placement of shifts in adaptive trait optima along a tree. Leveraging recent fast closed-form gradient calculations, I build an efficient HMC sampler that scales inference under this new model. I demonstrate the speed and utility of each method via a range of applications, including the study of viral evolution and phenotypic trait data
Recommended from our members
Bayesian Modeling of Viral Phylodynamics
Viral phylodynamics is the study of how immunodynamics, epidemiology, and evolutionary processes act and interact to shape viral phylogenies. We build upon the foundation of Bayesian phylogenetic inference to develop statistical tools to address phylodynamic problems. First, we present a flexible nonparametric Bayesian framework to infer the effective population size as a function of time directly from molecular sequence data. The effective population size is an abstract quantity that characterizes a population's genetic diversity, and it is of fundamental interest in population genetics, conservation biology, and infectious disease epidemiology. Our model is based on the coalescent, a stochastic process that relates phylogenies to population dynamics. We enforce temporal smoothing of inferred trajectories via a Gaussian Markov random field prior. Notably, our framework incorporates data from multiple genetic loci to achieve improved inference of population dynamics. Next, we turn to phylogenetic trait evolution. Modeling the processes giving rise to nonsequence traits associated with molecular sequence data is crucial in comparative studies of phenotypic traits as well as in phylogeographic analyses that reconstruct the spatiotemporal spread of viruses. A popular, yet restrictive approach to modeling such processes is Brownian diffusion along a phylogeny. We relax a major restriction by introducing a nontrivial estimable drift vector into the Brownian diffusion. Importantly, we implement a relaxed drift process that permits the drift vector to vary along the phylogeny. We showcase improved trait evolutionary inference in three viral examples. Finally, we return to effective population size inference and extend our framework to include covariates, enabling modeling of associations between past population dynamics and external factors. We apply our model to four examples. We reconstruct the demographic history of raccoon rabies in North America and find a significant association with the spatiotemporal spread of the outbreak. Next, we examine the effective population size trajectory of the DENV-4 virus in Puerto Rico along with viral isolate count data and find similar cyclic patterns. We compare the population history of the HIV-1 CRF02_AG clade in Cameroon with HIV incidence and prevalence data and find that the effective population size is more reflective of incidence rate. Finally, we explore the hypothesis that the population dynamics of musk ox during the Late Quaternary period were related to climate change. Incorporating covariates into the demographic inference framework enables the modeling of associations between the effective population size and covariates while accounting for uncertainty in population histories. Furthermore, it can lead to more precise estimates of population dynamics
Recommended from our members
Large-scale Inference of Correlation between Complex Biological Traits
Inferring dependencies between complex biological traits while accounting for evolutionary relationships among specimens is of great scientific interest, yet remains infeasible when trait and specimen counts grow large. I aim to develop a scalable Bayesian inference framework to assess correlation between complex traits along the evolutionary tree relating the specimens and informed by molecular sequences. To accommodate discrete and continuous traits, I posit a phylogenetic multivariate probit model that uses a latent variable framework. Posterior computation under this model requires integrating many latent variables, or equivalently making many computationally expensive draws from a high-dimensional multivariate truncated normal distribution (MTN). To tackle this challenge, I propose an inference scheme that exploits 1) representative cutting-edge Markov chain Monte Carlo (MCMC) methods including the bouncy particle sampler (BPS), the Markovian Zigzag sampler (ZZ), and the Zigzag Hamiltonian Monte Carlo (Zigzag-HMC) that can simultaneously sample all truncated normal dimensions, and 2) novel dynamic programming strategies that reduce the cost of likelihood and gradient evaluations for all three samplers to linear in sample size. Compared to the previous best practices that employ multiple-try rejection sampling, my approach achieves an order-of-magnitude speedup, allowing us to tackle previously unworkable large-scale problems. In an application with 535 HIV-1 viruses and 24 traits that necessitates sampling from a 11,235-dimensional MTN, my method makes it possible to examine the conditional dependencies between 21 immune escape mutations and 3 virulence measurements. In a second application I study the evolution of influenza H1N1 glycosylations using around 900 viruses. Lastly, I extend the phylogenetic probit model to incorporate categorical traits and demonstrate its use to investigate Aquilegia flower and pollinator coevolution. In summary, the contribution of this dissertation is two-fold. First, I develop a state-of-the-art solution for the long-standing problem in Bayesian phylogenetics | learning correlation among complex biological traits with joint tree modeling. Second, further empirical and theoretical investigation of BPS, ZZ, and Zigzag-HMC yield insight into the differences and similarities between these recently developed MCMC samplers. As Zigzag-HMC outperforms the other two on MTNs, I also implement this approach in a standalone R package, aiming to provide a general efficient tool for high-dimensional MTN simulation
Recommended from our members
Translated consent documents rarely used in non-industry sponsored studies
Patients from historically underrepresented racial and ethnic groups are enrolled in cancer clinical trials at disproportionately low rates in the United States 1-3. As these patients often have limited English proficiency4-7, we hypothesized that one barrier to their inclusion is the cost to investigators of translating consent documents. To test this hypothesis, we evaluated more than twelve-thousand consent events at a large Cancer Center and assessed whether patients requiring translated consent documents would sign consent documents less frequently in studies lacking industry sponsorship (for which the principal investigator pays translation costs) than for industry sponsored studies (for which this cost is covered by the sponsor). Here, we show that the proportion of consent events for patients with limited English proficiency in studies not sponsored by industry was approximately half of that seen in industry sponsored studies. We also show that among those signing consent documents, the proportion of consent documents translated into the patient’s primary language in studies without industry sponsorship was approximately half of that seen in industry sponsored studies. Our results suggest that the cost of consent document translation in trials not sponsored by industry is a potentially modifiable barrier to the inclusion of patients with limited English proficiency
Recommended from our members
Phylogenetic Factor Analysis and Natural Extensions
Frequently in evolutionary biology we are interested in how different quan- titative traits of an organism evolve together over time. In order to properly understand these relationships, we need to adjust for the shared evolutionary history of these organisms. Previous methods rely on modeling quantitative traits as undergoing a high dimensional, correlated multivariate Brownian diffusion (MBD) down a phylogenetic tree. In order to present a more nuanced approach to understanding these trait relationships, we develop a phylogenetic factor analysis (PFA) model on these quantitative traits by assuming that the relatively low dimensional factors, rather than the traits themselves, undergo independent Brownian diffusion down a phylogenetic tree. Additionally, we develop a novel method for inferring the marginal likelihood estimates of probit models which allows for accurate model selection in the presence of discrete data. We demonstrate using Bayes factors that this PFA model is a more probable model than the MBD model. We then continue to develop this PFA method by relying on a shrinkage prior on the loadings matrix. This shrinkage prior consists of a normal prior with a global and local standard deviation component, and a half cauchy prior on these standard deviation components. With this we can distinguish trait relationships which would otherwise remain hidden using a standard normal prior on the loadings. Lastly, when we wish to incorporate a large number of taxa in our MBD and PFA models, obtaining a complete suite of measurements is difficult. These missing measurements make these analyses relatively inefficient and difficult to use for larger problems. To rectify this, we develop a method by which we can evaluate the likelihood of an MBD model by analytically integrating out missing values, and then apply similar principles to integrate out the factors in a PFA model. These innovations allow for massive speedup in our inference
Recommended from our members
Timely Treatment of Severe Maternal Hypertension and Reduction in Severe Maternal Morbidity
Objective: To determine if timely treatment within 60 minutes of confirmed diagnosis of severe maternal hypertension with antihypertensive medications was associated with reduction in severe maternal morbidity. Methods: Medical records of women with severe hypertension (at least two severe blood pressures, systolic ≥160mmHg and/or diastolic ≥110mmHg, within 60 minutes) were accessed for timing of severe blood pressures, timing of treatment, and blood pressure response to treatment. Severe maternal morbidity was confirmed by multidisciplinary case review. We compared the incidence of severe maternal morbidity between women who received timely (within 60 minutes of diagnosis) vs. not-timely treatment. Results: Of 465 women with severe hypertension, 29 (6.2%) experienced severe maternal morbidity. Fifty-six percent of women received timely treatment, of whom 1.9% had severe maternal morbidity, compared with 6.4% of women who did not receive timely treatment (p=0.02). Timely treatment was associated with a 72% reduction in relative odds of severe maternal morbidity (p=0.02). No significant difference was seen in median pre-treatment systolic pressures (p=0.20) between the groups. Conclusion: Antihypertensive treatment within 60 minutes of confirmed diagnosis of severe hypertension was associated with reduction in severe maternal morbidity. Our findings support current recommendations to treat all women with severe hypertension with antihypertensive medications in a timely fashion
- …
