1,721,037 research outputs found

    On balancing of a direct product

    No full text
    A direct product of two sequences is a naturally defined sequence on the alphabet of pairs of symbols. By taking inspiration from [Pavel Salimov. On uniform recurrence of a direct product. In AutoMathA, 2009], where the author investigates the case of uniformly recurrent words, here, we study when the product of two balanced sequences on binary alphabet is also balanced

    Space-efficient construction of compressed suffix trees

    Full text link
    We show how to build several data structures of central importance to string processing by taking as input the Burrows-Wheeler transform (BWT) and using small extra working space. Let nn be the text length and sigmasigma be the alphabet size. We first provide two algorithms that enumerate all LCP values and suffix tree intervals in O(nlogsigma)O(nlogsigma) time using just o(nlogsigma)o(nlogsigma) bits of working space on top of the input re-writable BWT. Using these algorithms as building blocks, for any parameter 0 < epsilon leq 1 we show how to build the PLCP bitvector and the balanced parentheses representation of the suffix tree topology in Oleft(n(logsigma+epsilon1cdotloglogn)ight)Oleft(n(logsigma + epsilon^{-1}cdot loglog n) ight) time using at most nlogsigmacdot(epsilon+o(1))nlogsigma cdot(epsilon + o(1)) bits of working space on top of the input re-writable BWT and the output. For example, we can build a compressed suffix tree from the BWT using just succinct working space (i.e. o(nlogsigma)o(nlogsigma) bits) and Thetaleft(nlogsigma+n(loglogn)1+deltaight)Thetaleft(nlogsigma + n(loglog n)^{1+delta} ight) time, for any constant delta>0. This improves the previous most space-efficient algorithms, which worked in O(n)O(n) bits and O(nlogn)O(nlog n) time. We also consider the problem of merging BWTs of string collections, and provide a solution running in O(nlogsigma)O(nlogsigma) time and using just o(nlogsigma)o(nlogsigma) bits of working space. An efficient implementation of our LCP construction and BWT merge algorithms uses (in RAM) as few as nn bits on top of a packed representation of the input/output and process data as fast as 2.922.92 megabases per second

    Faster Online Computation of the Succinct Longest Previous Factor Array

    No full text
    We consider the problem of computing online the Longest Previous Factor array LPF[1, n] of a text T of length n. For each, LPF[i] stores the length of the longest factor of T with at least two occurrences, one ending at i and the other at a previous position. We present an improvement over the previous solution by Okanohara and Sadakane (ESA 2008): our solution uses less space (compressed instead of succinct) and runs in time, thus being faster by a logarithmic factor. As a by-product, we also obtain the first online algorithm computing the Longest Common Suffix (LCS) array (that is, the LCP array of the reversed text) in time and compressed space. We also observe that the LPF array can be represented succinctly in 2n bits. Our online algorithm computes directly the succinct LPF and LCS arrays

    Burrows-Wheeler transform and palindromic richness

    No full text
    The investigation of the extremal case of the Burrows-Wheeler transform leads to study the words ww over an ordered alphabet A={a1,a2,,ak}A=\{a_1,a_2,\ldots,a_k\}, with a1<a2<<aka_1 < a_2 < \ldots <a_k, such that bwt(w)bwt(w) is of the form aknkak1nk1a2n2a1n1a_k^{n_k} a_{k-1}^{n_{k-1}} \cdots a_2^{n_2}a_1^{n_1}, for some non-negative integers n1,n2,,nkn_1, n_2, \ldots, n_k. A characterization of these words in the case A=2|A|=2 has been given in~\cite{MaReSc}, where it is proved that they correspond to the powers of conjugates of standard words. The case A=3|A|=3 has been settled in [Jamie Simpson, Simon J. Puglisi, Words with simple Burrows-Wheeler transforms, Electronic Journal of Combinatorics 15, (2008) article R83], which also contains some partial results for an arbitrary alphabet. In the present paper we show that such words can be described in terms of the notion of ``palindromic richness'', recently introduced in [Amy Glen, Jacques Justin, Steve Widmer, Luca Q. Zamboni, Palindromic richness, European Journal of Combinatorics 30 (2) (2009) 510-531]. Our main result indeed states that a word ww such that bwt(w)bwt(w) has the form aknkak1nk1a2n2a1n1a_k^{n_k} a_{k-1}^{n_{k-1}} \cdots a_2^{n_2}a_1^{n_1} is strongly rich, i.~e. the word w2w^2 contains the maximum number of different palindromic factors

    Balancing and clustering of words in the Burrows–Wheeler transform

    No full text
    Compression algorithms based on Burrows–Wheeler transform (BWT) take advantage of the fact that the word output of BWT shows a local similarity and then turns out to be highly compressible. The aim of the present paper is to study such “clustering effect” by using notions and methods from Combinatorics on Words. The notion of balance of a word plays a central role in our investigation. Empirical observations suggest that balance is actually the combinatorial property of input word that ensure optimal BWT compression. Moreover, it is reasonable to assume that the more balanced the input word is, the more local similarity we have after BWT (and therefore the better the compression is). This hypothesis is here corroborated by experiments on “real” text, by using local entropy as a measure of the degree of balance of a word. In the setting of Combinatorics on Words, a sound confirmation of previous hypothesis is given by a result of Mantaci et al. (2003) [27], which states that, in the case of a binary alphabet, there is an equivalence between circularly balanced words, words having a clusterized BWT, and the conjugates of standard words. In the case of alphabets of size greater than two, there is no more equivalence. The last section of the present paper is devoted to investigate the relationships between these notions, and other related ones (as, for instance, palindromic richness) in the case of a general alphabet

    On the product of balanced sequences

    Full text link
    The product w = u ⊗ v of two sequences u and v is a naturally defined sequence on the alphabet of pairs of symbols. Here, we study when the product w of two balanced sequences u,v is balanced too. In the case u and v are binary sequences, we prove, as a main result, that, if such a product w is balanced and deg(w) = 4, then w is an ultimately periodic sequence of a very special form. The case of arbitrary alphabets is approached in the last section. The partial results obtained and the problems proposed show the interest of the notion of product in the study of balanced sequences

    Balanced Words Having Simple Burrows-Wheeler Transform

    No full text
    The investigation of the "clustering effect" of the Burrows-Wheeler transform (BWT) leads to study the words having simple BWT , i.e. words w over an ordered alphabet A={a1,a2,,ak}A=\{a_1,a_2,\ldots,a_k\}, with a1<a2<<aka_1 < a_2 < \ldots <a_k, such that bwt(w)bwt(w) is of the form aknkak1nk1a1n1a_k^{n_k} a_{k-1}^{n_{k-1}} \cdots a_1^{n_1}, for some non-negative integers n1,n2,,nkn_1, n_2, \ldots, n_k. We remark that, in the case of binary alphabets, there is an equivalence between words having simple BWT, the family of (circular) balanced words and the conjugates of standard words. In the case of alphabets of size greater than two, there is no more equivalence between these notions. As a main result of this paper we prove that, under assumption of balancing, the following three conditions on a word w are equivalent: i) w has simple BWT, ii) w is a circularly rich word, and iii) w is a conjugate of a finite epistandard word

    Sorting suffixes of a text via its Lyndon Factorization

    No full text
    The process of sorting the suffixes of a text plays a fundamental role in Text Algorithms. They are used for instance in the constructions of the Burrows-Wheeler transform and the suffix array, widely used in several fields of Computer Science. For this reason, several recent researches have been devoted to finding new strategies to obtain effective methods for such a sorting. In this paper we introduce a new methodology in which an important role is played by the Lyndon factorization, so that the local suffixes inside factors detected by this factorization keep their mutual order when extended to the suffixes of the whole word. This property suggests a versatile technique that easily can be adapted to different implementative scenarios

    Block Sorting-Based Transformations on Words: Beyond the Magic BWT

    Full text link
    The Burrows-Wheeler Transform (BWT) is a word transformation introduced in 1994 for Data Compression and later results have contributed to make it a fundamental tool for the design of self-indexing compressed data structures. The Alternating Burrows-Wheeler Transform (ABWT) is a more recent transformation, studied in the context of Combinatorics on Words, that works in a similar way, using an alternating lexicographical order instead of the usual one. In this paper we study a more general class of block sorting-based transformations. The transformations in this new class prove to be interesting combinatorial tools that offer new research perspectives. In particular, we show that all the transformations in this class can be used as booster for memoryless compressors and we provide an upper bound on the number of equal-letter runs in their output. Moreover, we introduce the notion of rank-invertibility, a property related to the implementation of an efficient inversion procedure. We show that the BWT and the Alternating BWT are the only rank-invertible transformations in the class we have defined

    Methods and systems for data analysis using the Burrows Wheeler transform

    No full text
    The present disclosure provides computer implemented methods and systems for analyzing datasets, such as large data sets output from nucleic acid sequencing technologies. In particular, the present disclosure provides for data analysis comprising computing the BWT of a collection of strings in an incremental, character by character, manner. The present disclosure also provides compression boosting strategies resulting in a BWT of a reordered collection of data that is more compressible by second stage compression methods compared to non-reordered computational analysis
    corecore