1,721,079 research outputs found
An asymmetric approach to preserve common intervals while sorting by reversals
Dias Vieira Braga M, Gautier C, Sagot M-F. An asymmetric approach to preserve common intervals while sorting by reversals. Algorithms for Molecular Biology. 2009;4(1):16.Background: The reversal distance and optimal sequences of reversals to transform a genome into another are useful tools to analyse evolutionary scenarios. However, the number of sequences is huge and some additional criteria should be used to obtain a more accurate analysis. One strategy is searching for sequences that respect constraints, such as the common intervals (clusters of co-localised genes). Another approach is to explore the whole space of sorting sequences, eventually grouping them into classes of equivalence. Recently both strategies started to be put together, to restrain the space to the sequences that respect constraints. In particular an algorithm has been proposed to list classes whose sorting sequences do not break the common intervals detected between the two inital genomes A and B. This approach may reduce the space of sequences and is symmetric (the result of the analysis sorting A into B can be obtained from the analysis sorting B into A). Results: We propose an alternative approach to restrain the space of sorting sequences, using progressive instead of initial detection of common intervals (the list of common intervals is updated after applying each reversal). This may reduce the space of sequences even more, but is shown to be asymmetric. Conclusions: We suggest that our method may be more realistic when the relation ancestor-descendant between the analysed genomes is clear and we apply it to do a better characterisation of the evolutionary scenario of the bacterium Rickettsia felis with respect to one of its ancestors
Further thoughts on the syntenic distance between genomes
The syntenic distance between two multi-chromosomal genomes is the minimum number of fusion (union), fission (split), and translocation (exchange) operations necessary to transform a genome into another. Genomes are therefore seen as unlabeled bags of genes and the syntenic distance as the number of moves needed to obtain the same bags. This paper introduces the following new results concerning syntenic distance: • Using a direct proof, we solve a conjecture by Liben-Nowell [1] for the syntenic diameter size, that is, for the length of the longest optimal move sequence for an instance of size n, showing that it is, in fact, not In -3 but 2n -4 (an independent and different proof of this was provided by Kleinberg and Liben-Nowell in [2]). The structure of the proof further enables us to characterize the minimum number of translocations an optimal sequence of moves must have. This has important consequences for algorithmical purposes. The result is then generalized to what we call bidimensional syntenic diameter. • When the initial instance has more than one component, we suggest a way of performing inter-component moves, something that was not done in previous algorithms. We show however that, even in simple cases with more than two components involved, an optimal solution for a general instance of the problem using such moves is NP-hard to obtain. • We introduce, we believe, a simpler, more general and efficient algorithm for obtaining a move sequence that is optimal for the cases that Liben-Nowell and Kleinberg called in [3] nested synteny, exact and linear (this last one is the case addressed in [3]). • We sketch an extension of the algorithm which we conjecture may optimally solve the connected nested synteny (without requiring linearity). • We prove an inclusion theorem for the general synteny that is an extension of the one shown for linear synteny in [3]
RISOTTO
Software per la Ricerca di Motivi Strutturati che rappresentano Binding Site di Fattori di Trascrizione su Sequenze di DN
Geometric medians in reconciliation spaces of phylogenetic trees
In evolutionary biology, it is common to study how various entities evolve together, for example, how parasites coevolve with their host, or genes with their species. Coevolution is commonly modelled by considering certain maps or reconciliations from one evolutionary tree P to another H, all of which induce the same map φ between the leaf-sets of P and H (corresponding to present-day associations). Recently, there has been much interest in studying spaces of reconciliations, which arise by defining some metric d on the set R(P,H,φ) of all possible reconciliations between P and H. In this paper, we study the following question: How do we compute a geometric median for a given subset Ψ of R(P,H,φ) relative to d, i.e. an element ψmed∈R(P,H,φ) such that ∑ψ′∈Ψd(ψmed,ψ′)≤∑ψ′∈Ψd(ψ,ψ′) holds for all ψ∈R(P,H,φ)? For a model where so-called host-switches or transfers are not allowed, and for a commonly used metric d called the edit-distance, we show that it is possible to compute a geometric median for a set Ψ in R(P,H,φ) in polynomial time. We expect that this result could open up new directions for computing a consensus for a set of reconciliations
Bases of Motifs for Generating Repeated Patterns with Wild Cards
Motif inference represents one of the most important areas of research in computational biology, and one of its oldest ones. Despite this, the problem remains very much open in the sense that no existing definition is fully satisfying, either in formal terms, or in relation to the biological questions that involve finding such motifs. Two main types of motifs have been considered in the literature: matrices (of letter frequency per position in the motif) and patterns. There is no conclusive evidence in favor of either, and recent work has attempted to integrate the two types into a single model. In this paper, we address the formal issue in relation to motifs as patterns. This is essential to get at a better understanding of motifs in general. In particular, we consider a promising idea that was recently proposed, which attempted to avoid the combinatorial explosion in the number of motifs by means of a generator set for the motifs. Instead of exhibiting a complete list of motifs satisfying some input constraints, what is produced is a basis of such motifs from which all the other ones can be generated. We study the computational cost of determining such a basis of repeated motifs with wild cards in a sequence. We give new upper and lower bounds on such a cost, introducing a notion of basis that is provably contained in (and, thus, smaller) than previously defined ones. Our basis can be computed in less time and space, and is still able to generate the same set of motifs. We also prove that the number of motifs in all bases defined so far grows exponentially with the quorum, that is, with the minimal number of times a motif must appear in a sequence, something unnoticed in previous work. We show that there is no hope to efficiently compute such bases unless the quorum is fixed
A Basis of Tiling Motifs for Generating Repeated Patterns and its Complexity for Higher Quorum
We investigate the problem of determining the basis of motifs (a form of repeated patterns with don't cares) in an input string. We give new upper and lower bounds on the problem, introducing a new notion of basis that is provably smaller than (and contained in) previously defined ones. Our basis can be computed in less time and space, and is still able to generate the same set of motifs. We also prove that the number of motifs in all these bases grows exponentially with the quorum, the minimal number of times a motif must appear. We show that a polynomial-time algorithm exists only for fixed quorum
RISOTTO: Fast extraction of motifs with mismatches
We present in this paper an exact algorithm for motif extraction. Efficiency is achieved by means of an improvement in the algorithm and data structures that applies to the whole class of motif inference algorithms based on suffix trees. An average case complexity analysis shows a gain over the best known exact algorithm for motif extraction. A full implementation was developed and made available online. Experimental results show that the proposed algorithm is more than two times faster than the best known exact algorithm for motif extraction
Filtre exact de sélection de longues répétitions approchées utilisant le tableau des bi-facteurs
Lossless Filter for Finding Long Multiple Approximate Repetitions Using a New Data Structure, the Bi-Factor Array
Similarity search in texts, notably biological sequences, has received substantial attention in the last few years. Numerous filtration and indexing techniques have been created in order to speed up the resolution of the problem. However, previous filters were made for speeding up pattern matching, or for finding repetitions between two sequences or occurring twice in the same sequence. In this paper, we present an algorithm called NIMBUS for filtering sequences prior to finding repetitions occurring more than twice in a sequence or in more than two sequences. NIMBUS uses gapped seeds that are indexed with a new data structure, called a bi-factor array, that is also presented in this paper. Experimental results show that the filter can be very efficient: preprocessing with NIMBUS a data set where one wants to find functional elements using a multiple local alignment tool such as GLAM ([7]), the overall execution time can be reduced from 10 hours to 6 minutes while obtaining exactly the same results
- …
