1,721,096 research outputs found
Fast and accurate methods for phylogenomic analyses
Species phylogenies are not estimated directly, but rather through phylogenetic analyses of different gene datasets. However, true gene trees can differ from the true species tree (and hence from one another) due to biological processes such as horizontal gene transfer, incomplete lineage sorting, and gene duplication and loss, so that no single gene tree is a reliable estimate of the species tree. Several methods have been developed to estimate species trees from estimated gene trees, differing according to the specific algorithmic technique used and the biological model used to explain differences between species and gene trees. Relatively little is known about the relative performance of these methods. Results: We report on a study evaluating several different methods for estimating species trees from sequence datasets, simulating sequence evolution under a complex model including indels (insertions and deletions), substitutions, and incomplete lineage sorting. The most important finding of our study is that some fast and simple methods are nearly as accurate as the most accurate methods, which employ sophisticated statistical methods and are computationally quite intensive. We also observe that methods that explicitly consider errors in the estimated gene trees produce more accurate trees than methods that assume the estimated gene trees are correct. Conclusions: Our study shows that highly accurate estimations of species trees are achievable, even when gene trees differ from each other and from the species tree, and that these estimations can be obtained using fairly simple and computationally tractable methods.Computer Scienc
Disk covering methods improve phylogenomic analyses
Motivation: With the rapid growth rate of newly sequenced genomes, species tree inference from multiple genes has become a basic bioinformatics task in comparative and evolutionary biology. However, accurate species tree estimation is difficult in the presence of gene tree discordance, which is often due to incomplete lineage sorting (ILS), modelled by the multi-species coalescent. Several highly accurate coalescent-based species tree estimation methods have been developed over the last decade, including MP-EST. However, the running time for MP-EST increases rapidly as the number of species grows. Results: We present divide-and-conquer techniques that improve the scalability of MP-EST so that it can run efficiently on large datasets. Surprisingly, this technique also improves the accuracy of species trees estimated by MP-EST, as our study shows on a collection of simulated and biological datasets.NSF DEB 0733029, DBI 1062335Computer Scienc
BBCA: Improving the scalability of *BEAST using random binning
Species tree estimation can be challenging in the presence of gene tree conflict due to incomplete lineage sorting (ILS), which can occur when the time between speciation events is short relative to the population size. Of the many methods that have been developed to estimate species trees in the presence of ILS, *BEAST, a Bayesian method that co-estimates the species tree and gene trees given sequence alignments on multiple loci, has generally been shown to have the best accuracy. However, *BEAST is extremely computationally intensive so that it cannot be used with large numbers of loci; hence, *BEAST is not suitable for genome-scale analyses. Results: We present BBCA (boosted binned coalescent-based analysis), a method that can be used with *BEAST (and other such co-estimation methods) to improve scalability. BBCA partitions the loci randomly into subsets, uses *BEAST on each subset to co-estimate the gene trees and species tree for the subset, and then combines the newly estimated gene trees together using MP-EST, a popular coalescent-based summary method. We compare time-restricted versions of BBCA and *BEAST on simulated datasets, and show that BBCA is at least as accurate as *BEAST, and achieves better convergence rates for large numbers of loci. Conclusions: Phylogenomic analysis using *BEAST is currently limited to datasets with a small number of loci, and analyses with even just 100 loci can be computationally challenging. BBCA uses a very simple divide-and-conquer approach that makes it possible to use *BEAST on datasets containing hundreds of loci. This study shows that BBCA provides excellent accuracy and is highly scalable.Grant Agency of the Czech Republic P501-10-0208Academy of Sciences of the Czech Republic AVOZ50040507, AVOZ50040702, MSMT LC0604Ministry of Innovation and Science of Spain, MICINN CGL2007-64839-C02/BOSCSIC (Superior Council of Scientific InvestigationsJosé Castillejo Grant from the MICINN of the Spanish GovernmentComputer Scienc
Disjoint Tree Mergers for large-scale maximum-likelihood tree estimation
Embargo set by: Seth Robbins for item 118556
Lift date: 2023-09-17T02:34:57Z
Reason: Author requested U of Illinois access only (OA after 2yrs) in Vireo ETD systemAuthor requested U of Illinois access only (OA after 2yrs) in Vireo ETD systemU of I OnlyGene tree estimation is a biological problem that garners a lot of interest due to its ability to uncover evolutionary relationships in different genes which provides valuable insight into the hidden mechanisms of evolution. However, large-scale gene tree estimation has largely been unexplored, partially due to the limited available methods that can run on large datasets. Instead, a lot of effort has been focused on developing methods that are accurate on small to medium-sized datasets. We present a re-evaluation of divide-and-conquer pipelines on a variety of model conditions, including fragmentary sequences and heterogeneous evolution patterns, and show that our design of divide-and-conquer pipelines can consistently match or outperform FastTree and IQ-TREE with little overhead in runtime while matching or outperforming RAxML except on small datasets with very challenging model conditions. Furthermore, our divide-and-conquer pipeline is able to run on datasets that are too large for IQ-TREE or RAxML to handle.Submission published under a 24 month embargo labeled 'U of I Access', the embargo will last until 2023-05-01The student, Minhyuk Park, accepted the attached license on 2021-04-21 at 12:42.The student, Minhyuk Park, submitted this Thesis for approval on 2021-04-21 at 12:43.This Thesis was approved for publication on 2021-04-23 at 16:42.DSpace SAF Submission Ingestion Package generated from Vireo submission #16464 on 2021-09-16 at 17:04:17Made available in DSpace on 2021-09-17T02:34:41Z (GMT). No. of bitstreams: 2
PARK-THESIS-2021.pdf: 370186 bytes, checksum: 33203f5ea5bfbb0ce0f610d63924768f (MD5)
LICENSE.txt: 4209 bytes, checksum: 2128a1547ead52147cb6ec6de1f791fc (MD5)
Previous issue date: 2021-04-2
Recommended from our members
Large-scale phylogenetic analysis
textThe phylogeny problem is to reconstruct the phylogenetic tree in which
the leaves are labeled by the taxa we are interested in, and the internal nodes
are ancestral taxa. Recent advances in molecular biology and genomics have
provided biologists with molecular data at an unprecedented rate and scale;
in particular whole genome data for more and more species. First, the number
of possible phylogenetic trees grows superexponentially with the increase
of the number of species being studied. Second, detailed sequence data for
each species usually convey conflict. Third, more species usually means more
evolutionary events along the evolutionary tree. This usually leads to highly
saturated data, which are difficult to analyze in general.
In this thesis I present two possible approaches to solve this difficulty.
The first approach is to use genome rearrangement evolution, an evolutionary
process that has lower evolutionary rate than DNA sequence evolution. The
second approach is to process multiple trees returned by tree reconstruction
algorithms by applying clustering methods.Computer Scienc
Recommended from our members
Algorithmic techniques for improving the speed and accuracy of phylogenetic methods
textComputer Scienc
Recommended from our members
Phylogenetic networks
textPhylogenies, i.e., evolutionary histories, play a major role in representing
the relationships among groups of entities, such as species, genes, and
languages. Their pervasiveness has led scientists, mainly biologists, mathematicians,
and computer scientists, to develop methods and tools for their
accurate reconstruction. Most of these tools, however, assume that the underlying
model of speciation is a tree. While a good first approximation, trees fail
to model the evolutionary histories in the presence of complex evolutionary
events, such as lateral gene transfer and hybrid speciation among biological
entities, and borrowing of linguistic features among natural languages. These
events lead to “networks”, rather than trees, of relationships.
In this dissertation, we present two methodologies for reconstructing
phylogenetic networks. In the biological context, our method is based on the
observation that contained within the branches of a (species) phylogenetic
networks are phylogenetic trees that model the evolution of individual genes.
To study the accuracy of our new method, as well as existing methods, we
have developed a suite of simulation tools and error measures. Our simulation
studies show a clear outperformance of existing methods.
In historical linguistics, we extend the Ringe-Warnow model of language
evolution, to incorporate non-treelike evolutionary events; our new methodology
is called “perfect phylogenetic networks”. We have implemented a reconstruction
method, based on the new methodology, and analyzed a dataset of
Indo-European languages.Computer Scienc
Improving the accuracy of community detection methods using connectivity modifier
Submission published under a 24 month embargo labeled 'U of I Access', the embargo will last until 2025-12-01The student, Seyedeh Yasamin Tabatabaee, accepted the attached license on 2023-11-29 at 10:04.The student, Seyedeh Yasamin Tabatabaee, submitted this Thesis for approval on 2023-11-29 at 10:32.This Thesis was approved for publication on 2023-12-04 at 13:13.DSpace SAF Submission Ingestion Package generated from Vireo submission #20056 on 2024-03-01 at 13:31:33Community detection algorithms are commonly used to recover the community structure of complex networks. To evaluate the accuracy of these algorithms and compare them against each other, one would need to apply them to networks with known community structure. However, the community structure of real-world networks is usually not known, and hence synthetic networks with ground-truth communities are used for benchmarking these algorithms. The most widely adopted synthetic networks for the evaluation of community detection methods are the Lancichinetti-Fortunato-Radicchi (LFR) benchmark graphs. Here we develop a pipeline for creating LFR graphs that emulate the characteristics of given real-world networks and their clusterings. While our study shows that these LFR graphs almost perfectly match some characteristics of the real-world networks they attempt to emulate, there are striking differences among their other properties. We also evaluate the recently introduced Connectivity Modifier (CM) algorithm, a meta-method for ensuring well-connectedness of clusters outputted by community detection methods, on these empirical networks and LFR graphs. Our results show that while CM reduces node coverage, it improves the accuracy of Leiden algorithm optimizing modularity or the Constant Potts model (CPM) in many model conditions
Improving gene trees without more data
Species tree and gene tree estimation from sequence data are two steps in many biological analyses. Computational challenges and limited amount of data often make estimating highly accurate phylogenetic trees a difficult task. Moreover, gene alignments used to estimate trees on individual loci often have low phylogenetic signal (e.g., short alignment length), resulting in poorly estimated gene trees. Species tree estimation on the other hand is challenged by individual loci having different evolutionary histories caused by a biological phenomenon known as incomplete lineage sorting (ILS). In the presence of ILS, summary methods like MP-EST, ASTRAL2, and ASTRID are often used to estimate the species tree from gene trees. Summary methods operate by combining estimated gene trees and thus suffer in the presence of low phylogenetic signal. To tackle this problem the Statistical Binning and Weighted Statistical Binning pipelines were designed to improve gene tree estimation, which in turn can improve species tree estimation. Experimental studies of these pipelines revealed that they helped in improving gene tree and species tree estimation. However, these studies only tested the weighted statistical binning and statistical binning pipelines using multi-locus bootstrapping (MLBS) and not using BestML, where MLBS and BestML are different ways to run a phylogenetic pipeline. In this thesis, a novel phylogenetic pipeline named WSB+WQMC is proposed. This pipeline shares several design features with the weighted statistical binning pipeline (referred as WSB+CAML in this thesis) but has some other desirable properties. The WSB+WQMC pipeline is also shown to be statistically consistent under the GTR+MSC model when a slightly different version of WQMC is used.
In this study WSB+WQMC was evaluated and compared with the WSB+CAML pipeline on various simulated datasets using BestML analysis. Most of the trends seen in MLBS analyses were also observed for WSB+WQMC and WSB+CAML in BestML analyses with some important differences. It is shown that WSB+WQMC substantially improved the accuracy of gene tree and species tree estimation using ASTRAL2 and ASTRID on most datasets having low, medium, and moderately high levels of ILS. Compared to WSB+CAML, it was found that WSB+WQMC computed less accurate gene trees and species trees in certain model conditions having low and medium levels of ILS. However, WSB+WQMC was found to be better and at least as accurate as WSB+CAML in computing gene trees and species trees on all datasets having moderately high and high ILS levels. WSB+WQMC is also shown to be better in estimating gene trees on certain medium and low ILS datasets. Thus, WSB+WQMC is a potential alternative to WSB+CAML for gene tree and species tree estimation in the presence of low phylogenetic signal.Submission original under an indefinite embargo labeled 'Open Access'. The submission was exported from vireo on 2016-07-07 without embargo termsThe student, Ashu Gupta, accepted the attached license on 2016-04-27 at 23:07.The student, Ashu Gupta, submitted this Thesis for approval on 2016-04-27 at 23:16.This Thesis was approved for publication on 2016-04-28 at 10:16.DSpace SAF Submission Ingestion Package generated from Vireo submission #9565 on 2016-07-07 at 13:33:46Made available in DSpace on 2016-07-07T19:58:20Z (GMT). No. of bitstreams: 2
GUPTA-THESIS-2016.pdf: 1031663 bytes, checksum: da4eba5e321848e6fd3b0bbacdc98795 (MD5)
LICENSE.txt: 4207 bytes, checksum: 365ce749507402d69c5299ff78cbfc80 (MD5)
Previous issue date: 2016-04-2
Computing Robinson-Foulds supertree for two trees
Made available in DSpace on 2019-11-26T20:35:09Z (GMT). No. of bitstreams: 2
YU-THESIS-2019.pdf: 720940 bytes, checksum: 98f79d5c66de551f7b969ae6400fe54f (MD5)
LICENSE.txt: 4205 bytes, checksum: 82c2b9e463c4f1a6ca91c6e1f16d7416 (MD5)
Previous issue date: 2019-07-15Supertree problems are important in phylogeny estimation. Supertree construction takes in a set of input trees on subsets of species and aims to find a supertree containing all species subjective to some combinatorial or statistical criterion. As such, it can be used to combine trees estimated by different research projects, or to construct species trees from gene trees that may not contain all species, or to serve a part in divide-and-conquer pipelines that improve the scalability of large scale phylogeny estimation. Yet the most promising supertree methods, such as the popular Robinson-Foulds Supertree (RFS) methods, not only cannot guarantee an optimal solution but also are computationally intensive by themselves, as they are heuristics for NP-hard optimization problems.
We present the first polynomial time algorithm to exactly solve the RFS problem on two binary input trees, and prove that finding the Robinson-Foulds Supertree of three input trees is NP-hard. We present GreedyRFS, a greedy heuristic for the Robinson-Foulds Supertree problem that operates by using our exact algorithm for RFS on pairs of trees, until all the trees are merged into a single supertree. Our experiments show that GreedyRFS has better accuracy than FastRFS, the leading heuristic for RFS, when the number of input trees is small, which is the natural case for use within divide-and-conquer pipelines.Submission original under an indefinite embargo labeled 'Open Access'. The submission was exported from vireo on 2019-11-26 without embargo termsThe student, Xilin Yu, accepted the attached license on 2019-07-15 at 15:53.The student, Xilin Yu, submitted this Thesis for approval on 2019-07-15 at 15:59.This Thesis was approved for publication on 2019-07-15 at 16:49.DSpace SAF Submission Ingestion Package generated from Vireo submission #14324 on 2019-11-26 at 12:53:4
- …
