1,721,075 research outputs found
Compressed data structures for strings: on searching and extracting strings from compressed textual data
A book on compressed data structures
Compressed cache-oblivious string B-tree
In this article, we study three variants of the well-known prefix-search problem for strings, and we design solutions for the cache-oblivious model which improve the best known results. Among these contributions, we close (asymptotically) the classic problem, which asks for the detection of the set of strings that share the longest common prefix with a queried pattern by providing an I/O-optimal solution that matches the space lower bound for tries up to a constant multiplicative factor of the form (1 + ∈), for ∈ > 0. Our solutions hinge upon a novel compressed storage scheme that adds the ability to decompress prefixes of the stored strings I/O-optimally to the elegant locality-preserving front coding (Bender et al. 2006) still preserving its space bounds
On a simple storage scheme for strings achieving entropy bounds
In this note we propose a storage scheme for a string S[1,n], drawn from an alphabet A, that requires space close to the k-th order empirical entropy of S and allows to retrieve any L-long substring of S in optimal O(1 + L log A/log n) time. This matches the best known bounds via the only use of binary encodings and tables
Space-Efficient Substring Occurrence Estimation
In this paper we study the problem of estimating the number of occurrences of substrings in textual data: A text T on some alphabet Σ=[σ] of length n is preprocessed and an index I is built. The index is used in lieu of the text to answer queries of the form Count≈(P), returning an approximated number of the occurrences of an arbitrary pattern P as a substring of T. The problem has its main application in selectivity estimation related to the LIKE predicate in textual databases. Our focus is on obtaining an algorithmic solution with guaranteed error rates and small footprint. To achieve that, we first enrich previous work in the area of compressed text-indexing providing an optimal data structure that, for a given additive error l≥1, requires Θnllogσ bits. We also approach the issue of guaranteeing exact answers for sufficiently frequent patterns, providing a data structure whose size scales with the amount of such patterns. Our theoretical findings are supported by experiments showing the practical impact of our data structures
System and method for string processing and searching using a compressed permuterm index
An improved system and method for string processing and searching using a compressed permuterm index is provided. To build a compressed permuterm index for a string dictionary, an index builder constructs a unique string from a collection of strings of a dictionary sorted in lexicographic order and then builds a compressed permuterm index to support queries over the unique string. A dictionary query engine supports several types of wild-card queries over the string dictionary by performing a backward search modified with a CyclicLF operation over the compressed permuterm index. These queries may used to implement other queries including a membership query, a prefix query, a suffix query, a prefix-suffix query, a query for an exact or substring match, a rank query, a select query and so forth. String processing and searching tasks may accurately be performed for sophisticated queries in optimal time and compressed space
Compressed String Dictionary Search with Edit Distance One
In this paper we present different solutions for the problem of indexing a dictionary of strings in compressed space. Given a pattern (Formula presented.) , the index has to report all the strings in the dictionary having edit distance at most one with (Formula presented.). Our first solution is able to solve queries in (almost optimal) (Formula presented.) time where (Formula presented.) is the number of strings in the dictionary having edit distance at most one with (Formula presented.). The space complexity of this solution is bounded in terms of the (Formula presented.) th order entropy of the indexed dictionary. A second solution further improves this space complexity at the cost of increasing the query time. Finally, we propose randomized solutions (Monte Carlo and Las Vegas) which achieve simultaneously the time complexity of the first solution and the space complexity of the second one
Distribution-aware compressed full-text indexes
In this paper we address the problem of building a compressed self-index that, given a distribution for the pattern queries and a bound on the space occupancy, minimizes the expected query time within that index space bound. We solve this problem by exploiting a reduction to the problem of finding a minimum weight K-link path in a properly designed Directed Acyclic Graph. Interestingly enough, our solution can be used with any compressed index based on the Burrows-Wheeler transform. Our experiments compare this optimal strategy with several other known approaches, showing its effectiveness in practice
- …
