1,721,012 research outputs found

    Monotone scoring of patterns with mismatches

    No full text
    We study the problem of extracting, from given source x and error threshold k, substrings of x that occur unusually often in x within k substitutions or mismatches. Specifically, we assume that the input textstring x of n characters is produced by an i.i.d. source, and design efficient methods for computing the probability and expected number of occurrences for substrings of x with (either exactly or up to) k mismatches. Two related schemes are presented. In the first one, an O(nk) time preprocessing of x is developed that supports the following subsequent queries: for any substring w of x arbitrarily specified as input, the probability of occurrence of w in x within (either exactly or up to) k mismatches is reported in O(k(2)) time. In the second scheme, a length or length range is arbitrarily specified, and the above probabilities are computed for all substrings of x having length in that range, in overall O(nk) time. Further, monotonicity conditions are introduced and studied for probabilities and expected occurrences of a substring under unit increases in its length, allowed number of errors, or both. Over intervals of constant frequency count, these monotonicities translate to some of the scores in use, thereby reducing the size of tables at the outset and enhancing the process of discovery. These latter derivations extend to patterns with mismatches an analysis previously devoted to exact patterns

    On the Monotonicity of the String Correction Factor for Words with Mismatches

    Full text link
    The string correction factor is the term by which the probability of a word w needs to be multiplied in order to account for character changes or “errors” occurring in at most k arbitrary positions in that word. The behavior of this factor, as a function of k and of the word length, has implications on the number of candidates that need to be considered and weighted when looking for subwords of a sequence that present unusually recurrent replicas within some bounded number of mismatches. Specifically, it is seen that over intervals of mono- or bi- tonicity for the correction factor, only some of the candidates need be considered. This mitigates the computation and leads to tables of over- represented words that are more compact to represent and inspect. In recent work, expectation and score monotonicity has been established for a number of cases of interest, under i.i.d. probabilistic assumptions. The present paper reviews the cases of bi-tonic behavior for the correction factor, concentrating on the instance in which the question is still open

    Optimal discovery of subword associations in strings

    No full text
    Given a textstring x of n symbols and an integer constant d, we consider the problem of finding, for any pair (y, z) of subwords of x the number of times that y and z occur in tandem (i.e., with no intermediate occurrence of either one of them) within a distance of d symbols of x. Although in principle there might be n(4) distinct subword pairs in x, we show that it suffices to consider a family of only n(2) such pairs, with the property that for any neglected pair (y', z'), there is a corresponding pair (y, z) contained in our family and such that: (i) y' is a prefix of y and z' is a prefix of z, and (ii) the tandem index of (y', z') equals that of (y, z). We show that an algorithm for the construction of the table of all such tandem indices can be built to run in optimal O(n(2)) time and space

    Alignment Free Sequence Similarity with Bounded Hamming Distance

    No full text
    A growing number of measures of sequence similarity is being based on some underlying notion of relative compressibility. Within this paradigm, similar sequences are expected to share a large number of common substrings, or subsequences, or more complex patterns or motifs, and so on. The computational complexity of such measures varies, and it increases with the complexion of the patterns taken into account. At the low end of the spectrum, most measures based on the bags of shared substrings are typically afforded in linear time. This performance is no longer achievable as soon as some degree of distortion is accepted. In this paper, measures of sequence similarity are introduced and studied in which patterns in a pair are considered similar if they coincide up to a preset number of mismatches, that is, within a bounded Hamming distance. It is shown here that for some such measures bounds are achievable that are slightly better than O(n^2). Preliminary experiments demonstrate the potential applicability to phylogeny and classification of similarity measures that are rougher than previously adopted ones
    corecore