1,721,036 research outputs found
Block Sorting-Based Transformations on Words: Beyond the Magic BWT
The Burrows-Wheeler Transform (BWT) is a word transformation introduced in 1994 for Data Compression and later results have contributed to make it a fundamental tool for the design of self-indexing compressed data structures. The Alternating Burrows-Wheeler Transform (ABWT) is a more recent transformation, studied in the context of Combinatorics on Words, that works in a similar way, using an alternating lexicographical order instead of the usual one. In this paper we study a more general class of block sorting-based transformations. The transformations in this new class prove to be interesting combinatorial tools that offer new research perspectives. In particular, we show that all the transformations in this class can be used as booster for memoryless compressors and we provide an upper bound on the number of equal-letter runs in their output. Moreover, we introduce the notion of rank-invertibility, a property related to the implementation of an efficient inversion procedure. We show that the BWT and the Alternating BWT are the only rank-invertible transformations in the class we have defined
A combinatorial view on string attractors
The notion of string attractor has recently been introduced in [Prezza, 2017] and studied in [Kempa and Prezza, 2018] to provide a unifying framework for known dictionary-based compressors. A string attractor for a word w=w_1w_2...w_n is a subset Γ of the positions {1,2,...,n}, such that all distinct factors of w have an occurrence crossing at least one of the elements of Γ.
In this paper we explore the notion of string attractor by focusing on its combinatorial properties. In particular, we show how the size of the smallest string attractor of a word varies when combinatorial operations are applied and we deduce that such a measure is not monotone. Moreover, we introduce a circular variant of the notion of string attractor to provide a characterization of the conjugacy classes of standard Sturmian words
Boosting Textual Compression in Optimal Linear Time
We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression performance guarantee. It displays the following remarkable properties: (a) it can turn any memoryless compressor into a compression algorithm that uses the "best possible" contexts; (b) it is very simple and optimal in terms of time; and (c) it admits a decompression algorithm again optimal in time. To the best of our knowledge, this is the first boosting technique displaying these properties.Technically, our boosting technique builds upon three main ingredients: the Burrows-Wheeler Transform, the Suffix Tree data structure, and a greedy algorithm to process them. Specifically, we show that there exists a proper partition of the Burrows-Wheeler Transform of a string s that shows a deep combinatorial relation with the kth order entropy of s. That partition can be identified via a greedy processing of the suffix tree of s with the aim of minimizing a proper objective function over its nodes. The final compressed string is then obtained by compressing individually each substring of the partition by means of the base compressor we wish to boost. Our boosting technique is inherently combinatorial because it does not need to assume any prior probabilistic model about the source emitting s, and it does not deploy any training, parameter estimation and learning. Various corollaries are derived from this main achievement. Among the others, we show analytically that using our booster, we get better compression algorithms than some of the best existing ones, that is, LZ77, LZ78, PPMC and the ones derived from the Burrows--Wheeler Transform. Further, we settle analytically some long-standing open problems about the algorithmic structure and the performance of BWT-based compressors. Namely, we provide the first family of BWT algorithms that do not use Move-To-Front or Symbol Ranking as a part of the compression process
Word assembly through minimal forbidden words
We give a linear-time algorithm to reconstruct a finite word w over a finite alphabet A of constant size starting from a finite set of factors of w verifying a suitable hypothesis. We use combinatorics techniques based on the minimal forbidden words, which have been introduced in previous papers. This improves a previous algorithm which worked under the assumption of stronger hypothesis
Generalization of Repetitiveness Measures for Two-Dimensional Strings
Detecting and measuring repetitiveness of strings is a problem that has been extensively studied in data compression and text indexing. When the data are structured in a non-linear way, as in two-dimensional strings, inherent redundancy offers a rich source for compression, yet systematic studies on repetitiveness measures are still lacking. In this paper, we extend to two dimensions the measures δ and γ, defined in terms of the submatrices of the input, as well as the measures g, g_rl, and b, which are based on copy-paste mechanisms. We study their properties and mutual relationships, and we show that the two classes of measures become incomparable when two-dimensional inputs are considered. We also compare our measures with the 2D Block Tree data structure [Brisaboa et al., Computer J., 2024], and provide some insights for the design of effective 2D compressors
r-Indexing the eBWT
The extended Burrows Wheeler Transform (eBWT) was introduced by Mantaci et al. [TCS 2007] to extend the definition of the BWT to a collection of strings. In our prior work [SPIRE 2021], we give a linear-time algorithm for the eBWT that preserves the fundamental property of the original definition (i.e., the independence from the input order). The algorithm combines a modification of the Suffix Array Induced Sorting (SAIS) algorithm [IEEE Trans Comput 2011] with Prefix Free Parsing [AMB 2019; JCB 2020]. In this paper, we show how this construction algorithm leads to r-indexing the eBWT, i.e., run-length encoded eBWT and SA samples of Gagie et al. [SODA 2018] can be constructed efficiently from the components of the PFP. Moreover, we show that finding maximal exact matches (MEMs) between a query string and the r-index of the eBWT can be efficiently supported
r-indexing the eBWT
The extended Burrows-Wheeler Transform (eBWT) was introduced by Mantaci et al. [TCS 2007] to extend the definition of the BWT to a collection of strings. As opposed to other commonly used BWT-variants for string collections, the eBWT is independent of the input order of the strings in the collection, while preserving the full functionality and compressibility of the classic BWT. In our prior work [Boucher et al. Computing the original eBWT faster, simpler, and with less memory, SPIRE 2021], we presented a linear-time algorithm for constructing the eBWT. The algorithm combines a modification of the Suffix Array Induced Sorting (SAIS) algorithm [Nong et al., IEEE Trans Comput 2011] with Prefix-Free Parsing [Boucher et al., Alg Mol Biol 2019; Kuhnle et. al., JCB 2020]. In this paper, we show how this construction can be extended for building the r-index of the eBWT, which we call extended r-index, an analogous data structure to the r-index of Gagie et al. [SODA 2018, JACM 2020], with the added value that circular pattern matching is also supported. Our data structure occupies O(r) words, where r is the number of runs of the eBWT of the input collection, and answers count and locate queries in time analogous to the original r-index. We also show how to efficiently support finding maximal exact matches (MEMs) using the extended r-index. We implemented the extended r-index and tested it on circular bacterial genomes and plasmids, comparing it to five state-of-the-art compressed text indexes, including the original r-index and several dictionary-based indexes. While our data structure maintains similar time and memory requirements for answering pattern matching queries as the original r-index, it is the only index in the literature that regards the strings as circular. So, it can naturally be used for both circular and linear input collections
Bit Catastrophes for the Burrows-Wheeler Transform
A bit catastrophe, loosely defined, is when a change in just one character of a string causes a significant change in the size of the compressed string. We study this phenomenon for the Burrows-Wheeler Transform (BWT), a string transform at the heart of several of the most popular compressors and aligners today. The parameter determining the size of the compressed data is the number of equal-letter runs of the BWT, commonly denoted .
We exhibit infinite families of strings in which insertion, deletion, resp.\ substitution of one character increases from constant to , where is the length of the string. These strings can be interpreted both as examples for an increase by a multiplicative or an additive -factor. As regards the multiplicative factor, they attain the upper bound given by Akagi, Funakoshi, and Inenaga [Inf \& Comput. 2023] of \Oh(\log n \log r), since here r=\Oh(1).
We then give examples of strings in which insertion, deletion, resp.\ substitution of a character increases by a additive factor. These strings significantly improve the best known lower bound for an additive factor of [Giuliani et al., SOFSEM 2021]
- …
