1,721,120 research outputs found

    Efficient algorithms for on-line symbol ranking compression

    No full text
    Symbol ranking compression algorithms are known to achieve a very good compression ratio. Off-line symbol ranking algorithms (e.g., bzip, szip) are currently the state of the art for lossless data compression because of their excellent compression/time trade-off. Some on-line symbol ranking algorithms have been proposed in the past. They compress well but their slowness make them impractical. In this paper we design some fast on-line symbol ranking algorithms by fine tuning two data structures (skip lists and ternary trees) which are well known for their simplicity and efficiency

    Space-conscious compression

    No full text
    Compression is most important when space is in short supply, so compression algorithms are often implemented in limited memory. Most analyses ignore memory constraints as an implementation detail, however, creating a gap between theory and practice. In this paper we consider the effect of memory limitations on compression algorithms. In the first part we assume the memory available is fixed and prove nearly tight upper and lower bounds on how much memory is needed to compress a string close to its k-th order entropy. In the second part we assume the memory available grows (slowly) as more and more characters are read. In this setting we show that the rate of growth of the available memory determines the speed at which the compression ratio approaches the entropy. In particular, we establish a relationship between the rate of growth of the sliding window in the LZ77 algorithm and its convergence rate

    Design and analysis of periodic multiple seeds

    No full text
    A wide class of approximate pattern matching algorithms are based on a filtration phase in which spaced seeds are used to discard regions where a match is not likely to occur. The problem of determining the “optimal” shape of a spaced seed in a particular setting is known to be a hard one: in practice spaced seeds are chosen using heuristics or considering a restricted family of seeds with “reasonably good” performances. In this paper we consider the family of spaced seeds with a periodic structure. Such seeds have been already proven valuable both as a theoretical tool and in bioinformatics applications. We show that known combinatorial objects, namely Difference Sets and Families and Steiner Systems, naturally lead to the design of lossless periodic seeds for approximate pattern matching with k=2 and k=3 mismatches. We analyze in depth the properties of the resulting seeds obtaining insights also on seeds without a periodic structure. The results of the analysis are then used to guide an experimental evaluation of the effectiveness of periodic seeds for pattern lengths of practical interest. Our results give a complete picture of strengths and limitations of periodic seeds, and can be used by practitioners for the design of effective approximate pattern matching algorithms

    Move-to-Front, Distance Coding, and Inversion Frequencies Revisited

    No full text
    Move-to-Front, Distance Coding and Inversion Frequencies are three simple and effective techniques used to process the output of the Burrows-Wheeler Transform. In this paper we provide the first complete comparative analyses of these techniques, establishing upper and lower bounds on their compression ratios. We describe simple variants of these three techniques that compress any string up to a constant factor of its kth-order empirical entropy for any k >= 0. At the same time we prove lower bounds for the compression of arbitrary strings which show that these variants are nearly optimal. The bounds we establish are "entropy-only" bounds in the sense that they do not involve non-constant overheads. Our analyses provide new insights into the inner workings of these techniques, partially explain their good behavior in practice, and suggest strategies for improving their performance

    Indexing Compressed Text

    No full text
    We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form. Our first compressed data structure retrieves the occ occurrences of a pattern P[1,p] within a text T[1,n] in O(p + occ log^(1+epsilon) n) time for any chosen epsilon between 0 and 1. This data structure uses at most 5nHk(T) + o(n) bits of storage, where Hk(T) is the kth order empirical entropy of T. The space usage is Theta(n) bits in the worst case and o(n) bits for compressible texts. This data structure exploits the relationship between suffix arrays and the Burrows-Wheeler Transform, and can be regarded as a compressed suffix array. Our second compressed data structure achieves O(p+occ) query time using O(nHk(T)log^(epsilon) n) + o(n) bits of storage for any chosen epsilon between 0 and 1. Therefore, it provides optimal output-sensitive query time using o(n log n) bits in the worst case. This second data structure builds upon the first one and exploits the interplay between two compressors: the Burrows-Wheeler Transform and the LZ78 algorithm

    Multiple seeds sensitivity using a single seed with threshold

    Full text link
    Spaced seeds are a fundamental tool for similarity search in biosequences. The best sensitivity/selectivity trade-offs are obtained using many seeds simultaneously: This is known as the multiple seed approach. Unfortunately, spaced seeds use a large amount of memory and the available RAM is a practical limit to the number of seeds one can use simultaneously. Inspired by some recent results on lossless seeds, we revisit the approach of using a single spaced seed and considering two regions homologous if the seed hits in at least t sufficiently close positions. We show that by choosing the locations of the don't care symbols in the seed using quadratic residues modulo a prime number, we derive single seeds that when used with a threshold t > 1 have competitive sensitivity/selectivity trade-offs, indeed close to the best multiple seeds known in the literature. In addition, the choice of the threshold t can be adjusted to modify sensitivity and selectivity a posteriori, thus enabling a more accurate search in the specific instance at issue. The seeds we propose also exhibit robustness and allow flexibility in usage

    Space-conscious compression

    No full text
    Compression is most important when space is in short supply, so compression algorithms are often implemented in limited memory. Most analyses ignore memory constraints as an implementation detail, however, creating a gap between theory and practice. In this paper we consider the effect of memory limitations on compression algorithms. In the first part we assume the memory available is fixed and prove nearly tight upper and lower bounds on how much memory is needed to compress a string close to its k-th order entropy. In the second part we assume the memory available grows (slowly) as more and more characters are read. In this setting we show that the rate of growth of the available memory determines the speed at which the compression ratio approaches the entropy. In particular, we establish a relationship between the rate of growth of the sliding window in the LZ77 algorithm and its convergence rate

    Spaced Seeds Design Using Perfect Rulers

    No full text
    We consider the problem of lossless spaced seed design for approximate pattern matching. We show that, using mathematical objects known as perfect rulers, we can derive a family of spaced seeds for matching with up to two errors. We analyze these seeds with respect to the trade-off they offer between seed weight and the minimum length of the pattern to be matched. We prove that for patterns of length up to a, few hundreds our seeds have a larger weight, hence a better filtration efficiency, than the ones known in the literature. In this context, we study in depth the specific case of Wichmann rulers and prove some preliminary results on the generalization of our approach to the larger class of unrestricted rulers

    Design and analysis of periodic multiple seeds

    No full text
    A wide class of approximate pattern matching algorithms are based on a filtration phase in which spaced seeds are used to discard regions where a match is not likely to occur. The problem of determining the “optimal” shape of a spaced seed in a particular setting is known to be a hard one: in practice spaced seeds are chosen using heuristics or considering a restricted family of seeds with “reasonably good” performances. In this paper we consider the family of spaced seeds with a periodic structure. Such seeds have been already proven valuable both as a theoretical tool and in bioinformatics applications. We show that known combinatorial objects, namely Difference Sets and Families and Steiner Systems, naturally lead to the design of lossless periodic seeds for approximate pattern matching with k=2 and k=3 mismatches. We analyze in depth the properties of the resulting seeds obtaining insights also on seeds without a periodic structure. The results of the analysis are then used to guide an experimental evaluation of the effectiveness of periodic seeds for pattern lengths of practical interest. Our results give a complete picture of strengths and limitations of periodic seeds, and can be used by practitioners for the design of effective approximate pattern matching algorithms
    corecore