1,721,039 research outputs found

    Optimal Substring Equality Queries with Applications to Sparse Text Indexing

    No full text
    We consider the problem of encoding a string of length n from an integer alphabet of size so access, substring equality, and Longest Common Extension (LCE) queries can be answered efficiently. We describe a new space-optimal data structure supporting logarithmic-time queries. Access and substring equality query times can furthermore be improved to the optimal O(1) if O(log n) additional precomputed words are allowed in the total space. Additionally, we provide in-place algorithms for converting between the string and our data structure. Using this new string representation, we obtain the first in-place subquadratic algorithms for several string-processing problems in the restore model: The input string is rewritable and must be restored before the computation terminates. In particular, we describe the first in-place subquadratic Monte Carlo solutions to the sparse suffix sorting, sparse LCP array construction, and suffix selection problems. With the sole exception of suffix selection, our algorithms are also the first running in sublinear time for small enough sets of input suffixes. Combining these solutions, we obtain the first sublinear-time Monte Carlo algorithm for building the sparse suffix tree in compact space. We also show how to build a correct version of our data structure using small working space. This leads to the first Las Vegas in-place algorithm computing the full LCP array in O(nlog n) time w.h.p. and to the first Las Vegas in-place algorithms solving the sparse suffix sorting and sparse LCP array construction problems in O(n1.5 s log σ) time w.h.p

    Subpath Queries on Compressed Graphs: A Survey

    Full text link
    Text indexing is a classical algorithmic problem that has been studied for over four decades: given a text T, pre-process it off-line so that, later, we can quickly count and locate the occurrences of any string (the query pattern) in T in time proportional to the query’s length. The earliest optimal-time solution to the problem, the suffix tree, dates back to 1973 and requires up to two orders of magnitude more space than the plain text just to be stored. In the year 2000, two breakthrough works showed that efficient queries can be achieved without this space overhead: a fast index be stored in a space proportional to the text’s entropy. These contributions had an enormous impact in bioinformatics: today, virtually any DNA aligner employs compressed indexes. Recent trends considered more powerful compression schemes (dictionary compressors) and generalizations of the problem to labeled graphs: after all, texts can be viewed as labeled directed paths. In turn, since finite state automata can be considered as a particular case of labeled graphs, these findings created a bridge between the fields of compressed indexing and regular language theory, ultimately allowing to index regular languages and promising to shed new light on problems, such as regular expression matching. This survey is a gentle introduction to the main landmarks of the fascinating journey that took us from suffix trees to today’s compressed indexes for labeled graphs and regular languages

    Preface to “Computation over Compressed Data” at DCC 2022

    No full text
    This is the fifth special issue of the “computation over compressed data” session at the IEEE Data Compression Conference (DCC), which took place in Snowbird on March 22-24, 2022. This special issue features six articles from the field of compressed data structures, covering compact graph data structures and efficient algorithms and indexes for several variants of the ubiquitous Burrows-Wheeler transform. In “Compact representations of spatial hierarchical structures with support for topological queries”, José Fuentes-Sepúlveda, Diego Gatica, Gonzalo Navarro, M. Andrea Rodríguez, and Diego Seco show how to compactly represent hierarchical bidimensional spatial regions (such as the administrative partition of a country) while supporting a number of useful queries on the data. Their representation builds upon compact planar graph embeddings. New compact data structures for graphs are presented also in “Succinct data structure for path graphs”. Here, the authors Girish Balakrishnan, Sankardeep Chakraborty, N.S. Narayanaswamy and Kunihiko Sadakane show how to support efficient degree, adjacency, and neighborhood queries on path graphs in a space matching the information-theoretic lower bound. The remaining four papers deal with variants of the Burrows-Wheeler transform (BWT), a fundamental tool in text compression and indexing whose introduction in 1994 by Michael Burrows and David J. Wheeler opened a prolific line of research in the fields of data compression and indexing. In “A new class of string transformations for compressed text indexing”, Raffaele Giancarlo, Giovanni Manzini, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino generalize the BWT by using context-adaptive alphabet orderings. They show that, under certain conditions, on very repetitive text collections these more general transformations are indexable and more compressible than the original BWT. While the problem of building the BWT in linear time and space is well-understood to date, such repetitive collections pose the challenge of solving this task in compressed space: in those cases, the input is too large to be kept in memory in uncompressed format. This problem is tackled by Diego Díaz-Domínguez and Gonzalo Navarro in “Efficient construction of the BWT for repetitive text using string compression”. In this article, the authors show how to build a BWT variant (BCR BWT) with an algorithm keeping its intermediate results in grammar-compressed form. The resulting algorithm spectacularly achieves its goal, using a working space not exceeding 10% of the input’s size on real repetitive datasets. In “Constructing and indexing the bijective and extended Burrows–Wheeler transform”, Hideo Bannai, Juha Kärkkäinen, Dominik Köppl, and Marcin Pia̧tkowski consider the problem of building another BWT variant: the bijective BWT. They describe the first linear-time construction algorithm for such a variant, and present the first self-index based on the bijective BWT. Last but not least, in “r-indexing the eBWT” the authors Christina Boucher, Davide Cenzato, Zsuzsanna Lipták, Massimiliano Rossi, and Marinella Sciortino show how to index yet another BWT variant: the eBWT, originally designed to compactly represent collections of circular strings. They describe an extension of the r-index for such collections, able to simultaneously achieve high compression rates on very repetitive data, and (for the first time) to index circular strings. The success of this special issue shows that the field of compressed data structures is still very active and flourishing. As a matter of fact, these six articles clearly indicate two of the directions that the community is taking: compressing and indexing graphs and very repetitive text collections. In March 2024 a new special session on “computation over compressed data” took place at DCC, featuring ten new articles in the field which may result in a new successful special issue

    Space-efficient construction of compressed suffix trees

    Full text link
    We show how to build several data structures of central importance to string processing by taking as input the Burrows-Wheeler transform (BWT) and using small extra working space. Let nn be the text length and sigmasigma be the alphabet size. We first provide two algorithms that enumerate all LCP values and suffix tree intervals in O(nlogsigma)O(nlogsigma) time using just o(nlogsigma)o(nlogsigma) bits of working space on top of the input re-writable BWT. Using these algorithms as building blocks, for any parameter 0 < epsilon leq 1 we show how to build the PLCP bitvector and the balanced parentheses representation of the suffix tree topology in Oleft(n(logsigma+epsilon1cdotloglogn)ight)Oleft(n(logsigma + epsilon^{-1}cdot loglog n) ight) time using at most nlogsigmacdot(epsilon+o(1))nlogsigma cdot(epsilon + o(1)) bits of working space on top of the input re-writable BWT and the output. For example, we can build a compressed suffix tree from the BWT using just succinct working space (i.e. o(nlogsigma)o(nlogsigma) bits) and Thetaleft(nlogsigma+n(loglogn)1+deltaight)Thetaleft(nlogsigma + n(loglog n)^{1+delta} ight) time, for any constant delta>0. This improves the previous most space-efficient algorithms, which worked in O(n)O(n) bits and O(nlogn)O(nlog n) time. We also consider the problem of merging BWTs of string collections, and provide a solution running in O(nlogsigma)O(nlogsigma) time and using just o(nlogsigma)o(nlogsigma) bits of working space. An efficient implementation of our LCP construction and BWT merge algorithms uses (in RAM) as few as nn bits on top of a packed representation of the input/output and process data as fast as 2.922.92 megabases per second

    Indexing Compressed Text: A Tale of Time and Space (Invited Talk)

    No full text
    Text indexing is a classical algorithmic problem that has been studied for over four decades. The earliest optimal-time solution to the problem, the suffix tree [Weiner, 1973], dates back to 1973 and requires up to two orders of magnitude more space than the text to be stored. In the year 2000, two breakthrough works [Grossi and Vitter, 2000; Ferragina and Manzini, 2000] showed that this space overhead is not necessary: both the index and the text can be stored in a space proportional to the text’s entropy. These contributions had an enormous impact in bioinformatics: nowadays, the two most widely-used DNA aligners employ compressed indexes [Li and Durbin, 2009; Langmead et al., 2009]. In recent years, it became apparent that entropy had reached its limits: modern datasets (for example, collections of thousands of human genomes) are extremely large but very repetitive and, by its very definition, entropy cannot compress repetitive texts [S. Kreft and G. Navarro, 2013]. To overcome this problem, a new generation of indexes based on dictionary compressors (for example, LZ77 and run-length BWT) emerged [S. Kreft and G. Navarro, 2013; Gagie et al., 2020; F. Claude and G. Navarro, 2012], together with generalizations of the indexing problem to labeled graphs [Ferragina et al., 2009; Sirén et al., 2014; Travis Gagie et al., 2017]. This talk is a short and friendly survey of the landmarks of this fascinating path that took us from suffix trees to the most modern compressed indexes on labeled graphs

    Universal compressed text indexing

    Full text link
    The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family of techniques that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows-Wheeler transform (BWT) and on the Compact Directed Acyclic Word Graph (CDAWG). The most space-efficient indexes, on the other hand, are based on the Lempel-Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of text positions capturing all distinct substrings. Starting from this observation, in this paper we develop the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can therefore be built on top of any dictionary-compressed text representation. Let gamma be the size of a string attractor for a text of length n. From known reductions, gamma can be chosen to be asymptotically equal to any repetitiveness measure: number of runs in the BWT, size of the CDAWG, number of Lempel-Ziv phrases, number of rules in a grammar or collage system, size of a macro scheme. Our index takes O(gamma lg(n/gamma)) words of space and supports locating the occ occurrences of any pattern of length m in O(mlgn + occlg(epsilon)n) time, for any constant epsilon > 0. This is, in particular, the first index for general macro schemes and collage systems. Our result shows that the relation between indexing and compression is much deeper than what was previously thought: the simple property standing at the core of all dictionary compressors is sufficient to support fast indexed queries

    Optimal Rank and Select Queries on Dictionary-Compressed Text

    Full text link
    We study the problem of supporting queries on a string S of length n within a space bounded by the size gamma of a string attractor for S. In the paper introducing string attractors it was shown that random access on S can be supported in optimal O(log(n/gamma)/log log n) time within O(gamma polylog n) space. In this paper, we extend this result to rank and select queries and provide lower bounds matching our upper bounds on alphabets of polylogarithmic size. Our solutions are given in the form of a space-time trade-off that is more general than the one previously known for grammars and that improves existing bounds on LZ77-compressed text by a log log n time-factor in select queries. We also provide matching lower and upper bounds for partial sum and predecessor queries within attractor-bounded space, and extend our lower bounds to encompass navigation of dictionary-compressed tree representations

    A Framework of Dynamic Data Structures for String Processing

    Full text link
    In this paper we present DYNAMIC, an open-source C++ library implementing dynamic compressed data structures for string manipulation. Our framework includes useful tools such as searchable partial sums, succinct/gap-encoded bitvectors, and entropy/run-length compressed strings and FM indexes. We prove close-to-optimal theoretical bounds for the resources used by our structures, and show that our theoretical predictions are empirically tightly verified in practice. To conclude, we turn our attention to applications. We compare the performance of five recently-published compression algorithms implemented using DYNAMIC with those of state-of-the-art tools performing the same task. Our experiments show that algorithms making use of dynamic compressed data structures can be up to three orders of magnitude more space-efficient (albeit slower) than classical ones performing the same tasks

    In-place sparse suffix sorting

    No full text
    Suffix arrays encode the lexicographical order of all suffixes of a text and are often combined with the Longest Common Prefix array (LCP) to simulate navigational queries on the suffix tree in reduced space. In space-critical applications such as sparse and compressed text indexing, only information regarding the lexicographical order of a size-b subset of all n text suffixes is often needed. Such information can be stored space-efficiently (in b words) in the sparse suffix array (SSA). The SSA and its relative sparse LCP array (SLCP) can be used as a space-efficient substitute of the sparse suffix tree. Very recently, Gawrychowski and Kociumaka [11] showed that the sparse suffix tree (and therefore SSA and SLCP) can be built in asymptotically optimal O(b) space with a Monte Carlo algorithm running in O(n) time. The main reason for using the SSA and SLCP arrays in place of the sparse suffix tree is, however, their reduced space of b words each. This leads naturally to the quest for in-place algorithms building these arrays. Franceschini and Muthukrishnan [8] showed that the full suffix array can be built in-place and in optimal running time. On the other hand, finding sub-quadratic in-place algorithms for building the SSA and SLCP for general subsets of suffixes has been an elusive task for decades. In this paper, we give the first solution to this problem. We provide the first in-place algorithm building the full LCP array in O(n log n) expected time and the first Monte Carlo in-place algorithms building the SSA and SLCP in O(n + b log2 n) expected time. We moreover describe the first in-place solution for the suffix selection problem: to compute the i-th smallest text suffix. In order to achieve these results, we show that we can quickly overwrite the text with a reversible and implicit data structure supporting Longest Common Extension queries in polylogarithmic time and text extraction in optimal time: this structure is strictly more powerful than a plain text representation and is of independent interest
    corecore