1,721,037 research outputs found
On balancing of a direct product
A direct product of two sequences is a naturally defined sequence on the alphabet of pairs of symbols.
By taking inspiration from [Pavel Salimov. On uniform recurrence of a direct product. In AutoMathA, 2009], where the author investigates the case of uniformly recurrent words, here, we study when the product of two balanced sequences on binary alphabet is also balanced
Space-efficient construction of compressed suffix trees
We show how to build several data structures of central importance to string processing by taking as input the Burrows-Wheeler transform (BWT) and using small extra working space. Let be the text length and be the alphabet size.
We first provide two algorithms that enumerate all LCP values and suffix tree intervals in time using just bits of working space on top of the input re-writable BWT.
Using these algorithms as building blocks, for any parameter 0 < epsilon leq 1 we show how to build the PLCP bitvector and the balanced parentheses representation of the suffix tree topology in time using at most bits of working space on top of the input re-writable BWT and the output.
For example, we can build a compressed suffix tree from the BWT using just succinct working space (i.e. bits) and time, for any constant delta>0.
This improves the previous most space-efficient algorithms, which worked in bits and time.
We also consider the problem of merging BWTs of string collections, and provide a solution running in time and using just bits of working space.
An efficient implementation of our LCP construction and BWT merge algorithms uses (in RAM) as few as bits on top of a packed representation of the input/output and process data as fast as megabases per second
Faster Online Computation of the Succinct Longest Previous Factor Array
We consider the problem of computing online the Longest Previous Factor array LPF[1, n] of a text T of length n. For each, LPF[i] stores the length of the longest factor of T with at least two occurrences, one ending at i and the other at a previous position. We present an improvement over the previous solution by Okanohara and Sadakane (ESA 2008): our solution uses less space (compressed instead of succinct) and runs in time, thus being faster by a logarithmic factor. As a by-product, we also obtain the first online algorithm computing the Longest Common Suffix (LCS) array (that is, the LCP array of the reversed text) in time and compressed space. We also observe that the LPF array can be represented succinctly in 2n bits. Our online algorithm computes directly the succinct LPF and LCS arrays
Burrows-Wheeler transform and palindromic richness
The investigation of the extremal case of the Burrows-Wheeler transform leads to study the words
over an ordered alphabet , with , such that is of the form , for some non-negative integers . A characterization of these words in the case has been given in~\cite{MaReSc}, where it is proved that they correspond to the powers of conjugates of standard words.
The case has been settled in [Jamie Simpson, Simon J. Puglisi, Words with simple Burrows-Wheeler transforms, Electronic Journal of Combinatorics 15, (2008) article R83], which also contains some partial results for an arbitrary alphabet.
In the present paper we show that such words can be described in terms of the notion of ``palindromic richness'',
recently introduced in [Amy Glen, Jacques Justin, Steve Widmer, Luca Q. Zamboni, Palindromic richness, European
Journal of Combinatorics 30 (2) (2009) 510-531]. Our main result indeed states that a word such that has the form is strongly rich, i.~e. the word contains the maximum number of different palindromic factors
Balancing and clustering of words in the Burrows–Wheeler transform
Compression algorithms based on Burrows–Wheeler transform (BWT) take advantage of the fact that the word output of BWT shows a local similarity and then turns out to be highly compressible. The aim of the present paper is to study such “clustering effect” by using notions and methods from Combinatorics on Words. The notion of balance of a word plays a central role in our investigation. Empirical observations suggest that balance is actually the combinatorial property of input word that ensure optimal BWT compression. Moreover, it is reasonable to assume that the more balanced the input word is, the more local similarity we have after BWT (and therefore the better the compression is). This hypothesis is here corroborated by experiments on “real” text, by using local entropy as a measure of the degree of balance of a word. In the setting of Combinatorics on Words, a sound confirmation of previous hypothesis is given by a result of Mantaci et al. (2003) [27], which states that, in the case of a binary alphabet, there is an equivalence between circularly balanced words, words having a clusterized BWT, and the conjugates of standard words. In the case of alphabets of size greater than two, there is no more equivalence. The last section of the present paper is devoted to investigate the relationships between these notions, and other related ones (as, for instance, palindromic richness) in the case of a general alphabet
On the product of balanced sequences
The product w = u ⊗ v of two sequences u and v is a naturally defined sequence on the alphabet of pairs of symbols. Here, we study when the product w of two balanced sequences u,v is balanced too. In the case u and v are binary sequences, we prove, as a main result, that, if such a product w is balanced and deg(w) = 4, then w is an ultimately periodic sequence of a very special form. The case of arbitrary alphabets is approached in the last section. The partial results obtained and the problems proposed show the interest of the notion of product in the study of balanced sequences
Balanced Words Having Simple Burrows-Wheeler Transform
The investigation of the "clustering effect" of the Burrows-Wheeler transform (BWT) leads to study the words having simple BWT , i.e. words w over an ordered alphabet , with , such that is of the form , for some non-negative integers .
We remark that, in the case of binary alphabets, there is an equivalence between words having simple BWT, the family of (circular) balanced words and the conjugates of standard words. In the case of alphabets of size greater than two, there is no more equivalence between these notions. As a main result of this paper we prove that, under assumption of balancing, the following three conditions on a word w are equivalent: i) w has simple BWT, ii) w is a circularly rich word, and iii) w is a conjugate of a finite epistandard word
Sorting suffixes of a text via its Lyndon Factorization
The process of sorting the suffixes of a text plays a fundamental role in Text Algorithms. They are used for instance in the constructions of the Burrows-Wheeler transform and the suffix array, widely used in several fields of Computer Science. For this reason, several recent researches have been devoted to finding new strategies to obtain effective methods for such a sorting. In this paper we introduce a new methodology in which an important role is played by the Lyndon factorization, so that the local suffixes inside factors detected by this factorization keep their mutual order when extended to the suffixes of the whole word. This property suggests a versatile technique that easily can be adapted to different implementative scenarios
Block Sorting-Based Transformations on Words: Beyond the Magic BWT
The Burrows-Wheeler Transform (BWT) is a word transformation introduced in 1994 for Data Compression and later results have contributed to make it a fundamental tool for the design of self-indexing compressed data structures. The Alternating Burrows-Wheeler Transform (ABWT) is a more recent transformation, studied in the context of Combinatorics on Words, that works in a similar way, using an alternating lexicographical order instead of the usual one. In this paper we study a more general class of block sorting-based transformations. The transformations in this new class prove to be interesting combinatorial tools that offer new research perspectives. In particular, we show that all the transformations in this class can be used as booster for memoryless compressors and we provide an upper bound on the number of equal-letter runs in their output. Moreover, we introduce the notion of rank-invertibility, a property related to the implementation of an efficient inversion procedure. We show that the BWT and the Alternating BWT are the only rank-invertible transformations in the class we have defined
Methods and systems for data analysis using the Burrows Wheeler transform
The present disclosure provides computer implemented methods and systems for analyzing datasets, such as large data sets output from nucleic acid sequencing technologies. In particular, the present disclosure provides for data analysis comprising computing the BWT of a collection of strings in an incremental, character by character, manner. The present disclosure also provides compression boosting strategies resulting in a BWT of a reordered collection of data that is more compressible by second stage compression methods compared to non-reordered computational analysis
- …
