Search CORE

1,721,039 research outputs found

Optimal Substring Equality Queries with Applications to Sparse Text Indexing

Author: Prezza Nicola
Publication venue
Publication date: 01/01/2020
Field of study

We consider the problem of encoding a string of length n from an integer alphabet of size so access, substring equality, and Longest Common Extension (LCE) queries can be answered efficiently. We describe a new space-optimal data structure supporting logarithmic-time queries. Access and substring equality query times can furthermore be improved to the optimal O(1) if O(log n) additional precomputed words are allowed in the total space. Additionally, we provide in-place algorithms for converting between the string and our data structure. Using this new string representation, we obtain the first in-place subquadratic algorithms for several string-processing problems in the restore model: The input string is rewritable and must be restored before the computation terminates. In particular, we describe the first in-place subquadratic Monte Carlo solutions to the sparse suffix sorting, sparse LCP array construction, and suffix selection problems. With the sole exception of suffix selection, our algorithms are also the first running in sublinear time for small enough sets of input suffixes. Combining these solutions, we obtain the first sublinear-time Monte Carlo algorithm for building the sparse suffix tree in compact space. We also show how to build a correct version of our data structure using small working space. This leads to the first Las Vegas in-place algorithm computing the full LCP array in O(nlog n) time w.h.p. and to the first Las Vegas in-place algorithms solving the sparse suffix sorting and sparse LCP array construction problems in O(n1.5 s log σ) time w.h.p

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Subpath Queries on Compressed Graphs: A Survey

Author: Nicola Prezza
Prezza Nicola
Publication venue
Publication date: 01/01/2021
Field of study

Text indexing is a classical algorithmic problem that has been studied for over four decades: given a text T, pre-process it off-line so that, later, we can quickly count and locate the occurrences of any string (the query pattern) in T in time proportional to the query’s length. The earliest optimal-time solution to the problem, the suffix tree, dates back to 1973 and requires up to two orders of magnitude more space than the plain text just to be stored. In the year 2000, two breakthrough works showed that efficient queries can be achieved without this space overhead: a fast index be stored in a space proportional to the text’s entropy. These contributions had an enormous impact in bioinformatics: today, virtually any DNA aligner employs compressed indexes. Recent trends considered more powerful compression schemes (dictionary compressors) and generalizations of the problem to labeled graphs: after all, texts can be viewed as labeled directed paths. In turn, since finite state automata can be considered as a particular case of labeled graphs, these findings created a bridge between the fields of compressed indexing and regular language theory, ultimately allowing to index regular languages and promising to shed new light on problems, such as regular expression matching. This survey is a gentle introduction to the main landmarks of the fascinating journey that took us from suffix trees to today’s compressed indexes for labeled graphs and regular languages

Multidisciplinary Digital Publishing Institute

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

On Locating Paths in Compressed Tries

Author: Nicola Prezza
Prezza Nicola
Publication venue
Publication date: 01/01/2021
Field of study

Crossref

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Preface to “Computation over Compressed Data” at DCC 2022

Author: Gagie Travis
Prezza Nicola
Publication venue
Publication date: 01/01/2025
Field of study

This is the fifth special issue of the “computation over compressed data” session at the IEEE Data Compression Conference (DCC), which took place in Snowbird on March 22-24, 2022. This special issue features six articles from the field of compressed data structures, covering compact graph data structures and efficient algorithms and indexes for several variants of the ubiquitous Burrows-Wheeler transform. In “Compact representations of spatial hierarchical structures with support for topological queries”, José Fuentes-Sepúlveda, Diego Gatica, Gonzalo Navarro, M. Andrea Rodríguez, and Diego Seco show how to compactly represent hierarchical bidimensional spatial regions (such as the administrative partition of a country) while supporting a number of useful queries on the data. Their representation builds upon compact planar graph embeddings. New compact data structures for graphs are presented also in “Succinct data structure for path graphs”. Here, the authors Girish Balakrishnan, Sankardeep Chakraborty, N.S. Narayanaswamy and Kunihiko Sadakane show how to support efficient degree, adjacency, and neighborhood queries on path graphs in a space matching the information-theoretic lower bound. The remaining four papers deal with variants of the Burrows-Wheeler transform (BWT), a fundamental tool in text compression and indexing whose introduction in 1994 by Michael Burrows and David J. Wheeler opened a prolific line of research in the fields of data compression and indexing. In “A new class of string transformations for compressed text indexing”, Raffaele Giancarlo, Giovanni Manzini, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino generalize the BWT by using context-adaptive alphabet orderings. They show that, under certain conditions, on very repetitive text collections these more general transformations are indexable and more compressible than the original BWT. While the problem of building the BWT in linear time and space is well-understood to date, such repetitive collections pose the challenge of solving this task in compressed space: in those cases, the input is too large to be kept in memory in uncompressed format. This problem is tackled by Diego Díaz-Domínguez and Gonzalo Navarro in “Efficient construction of the BWT for repetitive text using string compression”. In this article, the authors show how to build a BWT variant (BCR BWT) with an algorithm keeping its intermediate results in grammar-compressed form. The resulting algorithm spectacularly achieves its goal, using a working space not exceeding 10% of the input’s size on real repetitive datasets. In “Constructing and indexing the bijective and extended Burrows–Wheeler transform”, Hideo Bannai, Juha Kärkkäinen, Dominik Köppl, and Marcin Pia̧tkowski consider the problem of building another BWT variant: the bijective BWT. They describe the first linear-time construction algorithm for such a variant, and present the first self-index based on the bijective BWT. Last but not least, in “r-indexing the eBWT” the authors Christina Boucher, Davide Cenzato, Zsuzsanna Lipták, Massimiliano Rossi, and Marinella Sciortino show how to index yet another BWT variant: the eBWT, originally designed to compactly represent collections of circular strings. They describe an extension of the r-index for such collections, able to simultaneously achieve high compression rates on very repetitive data, and (for the first time) to index circular strings. The success of this special issue shows that the field of compressed data structures is still very active and flourishing. As a matter of fact, these six articles clearly indicate two of the directions that the community is taking: compressing and indexing graphs and very repetitive text collections. In March 2024 a new special session on “computation over compressed data” took place at DCC, featuring ten new articles in the field which may result in a new successful special issue

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Space-efficient construction of compressed suffix trees

Author: Prezza Nicola
Rosone Giovanna
Publication venue
Publication date: 01/01/2021
Field of study

We show how to build several data structures of central importance to string processing by taking as input the Burrows-Wheeler transform (BWT) and using small extra working space. Let

n

be the text length and

sigma

be the alphabet size. We first provide two algorithms that enumerate all LCP values and suffix tree intervals in

O(nlogsigma)

time using just

o(nlogsigma)

bits of working space on top of the input re-writable BWT. Using these algorithms as building blocks, for any parameter 0 < epsilon leq 1 we show how to build the PLCP bitvector and the balanced parentheses representation of the suffix tree topology in

Oleft(n(logsigma + epsilon^{-1}cdot loglog n) ight)

time using at most

nlogsigma cdot(epsilon + o(1))

bits of working space on top of the input re-writable BWT and the output. For example, we can build a compressed suffix tree from the BWT using just succinct working space (i.e.

o(nlogsigma)

bits) and

Thetaleft(nlogsigma + n(loglog n)^{1+delta} ight)

time, for any constant delta>0. This improves the previous most space-efficient algorithms, which worked in

O(n)

bits and

O(nlog n)

time. We also consider the problem of merging BWTs of string collections, and provide a solution running in

O(nlogsigma)

time and using just

o(nlogsigma)

bits of working space. An efficient implementation of our LCP construction and BWT merge algorithms uses (in RAM) as few as

n

bits on top of a packed representation of the input/output and process data as fast as

2.92

megabases per second

Archivio della Ricerca - Università di Pisa

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Indexing Compressed Text: A Tale of Time and Space (Invited Talk)

Author: Nicola Prezza
Prezza Nicola
Publication venue
Publication date: 01/01/2020
Field of study

Text indexing is a classical algorithmic problem that has been studied for over four decades. The earliest optimal-time solution to the problem, the suffix tree [Weiner, 1973], dates back to 1973 and requires up to two orders of magnitude more space than the text to be stored. In the year 2000, two breakthrough works [Grossi and Vitter, 2000; Ferragina and Manzini, 2000] showed that this space overhead is not necessary: both the index and the text can be stored in a space proportional to the text’s entropy. These contributions had an enormous impact in bioinformatics: nowadays, the two most widely-used DNA aligners employ compressed indexes [Li and Durbin, 2009; Langmead et al., 2009]. In recent years, it became apparent that entropy had reached its limits: modern datasets (for example, collections of thousands of human genomes) are extremely large but very repetitive and, by its very definition, entropy cannot compress repetitive texts [S. Kreft and G. Navarro, 2013]. To overcome this problem, a new generation of indexes based on dictionary compressors (for example, LZ77 and run-length BWT) emerged [S. Kreft and G. Navarro, 2013; Gagie et al., 2020; F. Claude and G. Navarro, 2012], together with generalizations of the indexing problem to labeled graphs [Ferragina et al., 2009; Sirén et al., 2014; Travis Gagie et al., 2017]. This talk is a short and friendly survey of the landmarks of this fascinating path that took us from suffix trees to the most modern compressed indexes on labeled graphs

DROPS Dagstuhl Research Online Publication Server

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Universal compressed text indexing

Author: Prezza Nicola
Navarro Gonzalo
Publication venue
Publication date: 01/01/2019
Field of study

The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family of techniques that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows-Wheeler transform (BWT) and on the Compact Directed Acyclic Word Graph (CDAWG). The most space-efficient indexes, on the other hand, are based on the Lempel-Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of text positions capturing all distinct substrings. Starting from this observation, in this paper we develop the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can therefore be built on top of any dictionary-compressed text representation. Let gamma be the size of a string attractor for a text of length n. From known reductions, gamma can be chosen to be asymptotically equal to any repetitiveness measure: number of runs in the BWT, size of the CDAWG, number of Lempel-Ziv phrases, number of rules in a grammar or collage system, size of a macro scheme. Our index takes O(gamma lg(n/gamma)) words of space and supports locating the occ occurrences of any pattern of length m in O(mlgn + occlg(epsilon)n) time, for any constant epsilon > 0. This is, in particular, the first index for general macro schemes and collage systems. Our result shows that the relation between indexing and compression is much deeper than what was previously thought: the simple property standing at the core of all dictionary compressors is sufficient to support fast indexed queries

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Repositorio Académico de la Universidad de Chile

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Optimal Rank and Select Queries on Dictionary-Compressed Text

Author: Prezza N.
Prezza Nicola
Publication venue
Publication date: 01/01/2019
Field of study

We study the problem of supporting queries on a string S of length n within a space bounded by the size gamma of a string attractor for S. In the paper introducing string attractors it was shown that random access on S can be supported in optimal O(log(n/gamma)/log log n) time within O(gamma polylog n) space. In this paper, we extend this result to rank and select queries and provide lower bounds matching our upper bounds on alphabets of polylogarithmic size. Our solutions are given in the form of a space-time trade-off that is more general than the one previously known for grammars and that improves existing bounds on LZ77-compressed text by a log log n time-factor in select queries. We also provide matching lower and upper bounds for partial sum and predecessor queries within attractor-bounded space, and extend our lower bounds to encompass navigation of dictionary-compressed tree representations

DROPS Dagstuhl Research Online Publication Server

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

A Framework of Dynamic Data Structures for String Processing

Author: Prezza N.
Prezza Nicola
Publication venue
Publication date: 01/01/2017
Field of study

In this paper we present DYNAMIC, an open-source C++ library implementing dynamic compressed data structures for string manipulation. Our framework includes useful tools such as searchable partial sums, succinct/gap-encoded bitvectors, and entropy/run-length compressed strings and FM indexes. We prove close-to-optimal theoretical bounds for the resources used by our structures, and show that our theoretical predictions are empirically tightly verified in practice. To conclude, we turn our attention to applications. We compare the performance of five recently-published compression algorithms implemented using DYNAMIC with those of state-of-the-art tools performing the same task. Our experiments show that algorithms making use of dynamic compressed data structures can be up to three orders of magnitude more space-efficient (albeit slower) than classical ones performing the same tasks

DROPS Dagstuhl Research Online Publication Server

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Online Research Database In Technology

In-place sparse suffix sorting

Author: Prezza N.
Nicola Prezza
Prezza Nicola
Publication venue
Publication date: 01/01/2018
Field of study

Suffix arrays encode the lexicographical order of all suffixes of a text and are often combined with the Longest Common Prefix array (LCP) to simulate navigational queries on the suffix tree in reduced space. In space-critical applications such as sparse and compressed text indexing, only information regarding the lexicographical order of a size-b subset of all n text suffixes is often needed. Such information can be stored space-efficiently (in b words) in the sparse suffix array (SSA). The SSA and its relative sparse LCP array (SLCP) can be used as a space-efficient substitute of the sparse suffix tree. Very recently, Gawrychowski and Kociumaka [11] showed that the sparse suffix tree (and therefore SSA and SLCP) can be built in asymptotically optimal O(b) space with a Monte Carlo algorithm running in O(n) time. The main reason for using the SSA and SLCP arrays in place of the sparse suffix tree is, however, their reduced space of b words each. This leads naturally to the quest for in-place algorithms building these arrays. Franceschini and Muthukrishnan [8] showed that the full suffix array can be built in-place and in optimal running time. On the other hand, finding sub-quadratic in-place algorithms for building the SSA and SLCP for general subsets of suffixes has been an elusive task for decades. In this paper, we give the first solution to this problem. We provide the first in-place algorithm building the full LCP array in O(n log n) expected time and the first Monte Carlo in-place algorithms building the SSA and SLCP in O(n + b log2 n) expected time. We moreover describe the first in-place solution for the suffix selection problem: to compute the i-th smallest text suffix. In order to achieve these results, we show that we can quickly overwrite the text with a reversible and implicit data structure supporting Longest Common Extension queries in polylogarithmic time and text extraction in optimal time: this structure is strictly more powerful than a plain text representation and is of independent interest

Crossref

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Online Research Database In Technology