325 research outputs found
Front Matter, Table of Contents, Preface, Conference Organization
Front Matter, Table of Contents, Preface, Conference Organizatio
CSC: Circular Strings Comparison
Description: Given two sequences x and y, CSC finds the cyclic rotation of x (or an approximation of it) that minimises the blockwise q-gram distance from y.
Installation: To compile CSC, please follow the instructions given in file INSTALL.
Citation:
Roberto Grossi, Costas S. Iliopoulos, Robert Mercas, Nadia Pisanti, Solon P. Pissis, Ahmad Retha, Fatima Vayani:
Circular Sequence Comparison with q-grams.
WABI 2015: 203-216.
License: GNU GPLv3 License; Copyright (C) 2015 Solon P. Pissis, Ahmad Retha and Fatima Vayani
Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line
An elastic-degenerate string is a sequence of n sets of strings of total length N. It has been introduced to represent multiple sequence alignments of closely-related sequences in a compact form. For a standard pattern of length m, pattern matching in an elastic-degenerate text can be solved on-line in time O(nm^2+N) with pre-processing time and space O(m) (Grossi et al., CPM 2017). A fast bit-vector algorithm requiring time O(N * ceil[m/w]) with pre-processing time and space O(m * ceil[m/w]), where w is the size of the computer word, was also presented. In this paper we consider the same problem for a set of patterns of total length M. A straightforward generalization of the existing bit-vector algorithm would require time O(N * ceil[M/w]) with pre-processing time and space O(M * ceil[M/w]), which is prohibitive in practice. We present a new on-line O(N * ceil[M/w])-time algorithm with pre-processing time and space O(M). We present experimental results using both synthetic and real data demonstrating the performance of the algorithm. We further demonstrate a real application of our algorithm in a pipeline for discovery and verification of minimal absent words (MAWs) in the human genome showing that a significant number of previously discovered MAWs are in fact false-positives when a population's variants are considered
LIPIcs, Volume 312, WABI 2024, Complete Volume
LIPIcs, Volume 312, WABI 2024, Complete Volum
Faster Approximate Elastic-Degenerate String Matching - Part A
An elastic-degenerate (ED) string is a sequence = [1] ⋯ [n] of n finite sets of strings. The cardinality m of is the total number of strings in [i], for all i ∈ [1..n]. The size N of is the total length of all m strings of . ED strings have been introduced to represent a set of closely-related DNA sequences. Let P = P[1..p] be a pattern of length p and k > 0 be an integer. We consider the problem of k-Approximate ED String Matching (EDSM): searching k-approximate occurrences of P in the language of . We call k-Approximate EDSM under the Hamming distance, k-Mismatch EDSM; and we call k-Approximate EDSM under edit distance, k-Edit EDSM.
Bernardini et al. (Theoretical Computer Science, 2020) showed a simple (k m p + kN)-time algorithm for k-Mismatch EDSM and an (k² m p + kN)-time algorithm for k-Edit EDSM. We improve the dependency on k in both results, obtaining an Õ(k^{2/3}mp+√kN)-time algorithm for k-Mismatch EDSM and an Õ(kmp+ kN)-time algorithm for k-Edit EDSM.
Bernardini et al. (Theory of Computing Systems, 2024) presented several algorithms for 1-Approximate EDSM working in Õ(np²+N) time. They have also left the possibility to generalize these solutions for k > 1 as an open problem. We improve the runtime of their solution for 1-Mismatch and 1-Edit EDSM from Õ(np²+N) to (np²+N). We further show algorithms for k-Approximate EDSM for the Hamming and edit distances working in Õ(np² + N) time, for any constant k > 0.
Finally, we show how our techniques can be applied to improve upon the complexity of the k-Approximate ED String Intersection and k-Approximate Doubly EDSM problems that were introduced very recently by Gabory et al. (Information and Computation, 2025)
Front Matter, Table of Contents, Preface, Conference Organization
Front Matter, Table of Contents, Preface, Conference Organizatio
Internal Shortest Absent Word Queries
Given a string T of length n over an alphabet Σ ⊂ {1,2,…,n^{(1)}} of size σ, we are to preprocess T so that given a range [i,j], we can return a representation of a shortest string over Σ that is absent in the fragment T[i]⋯ T[j] of T. For any positive integer k ∈ [1,log log_σ n], we present an ((n/k)⋅ log log_σ n)-size data structure, which can be constructed in (nlog_σ n) time, and answers queries in time (log log_σ k)
Size-Constrained Weighted Ancestors with Applications
The weighted ancestor problem on a rooted node-weighted tree T is a generalization of the classic predecessor problem: construct a data structure for a set of integers that supports fast predecessor queries. Both problems are known to require Ω(log log n) time for queries provided (n poly log n) space is available, where n is the input size. The weighted ancestor problem has attracted a lot of attention by the combinatorial pattern matching community due to its direct application to suffix trees. In this formulation of the problem, the nodes are weighted by string depth. This research has culminated in a data structure for weighted ancestors in suffix trees with (1) query time and an (n)-time construction algorithm [Belazzougui et al., CPM 2021]. In this paper, we consider a different version of the weighted ancestor problem, where the nodes are weighted by any function weight that maps each node of T to a positive integer, such that weight(u) ≤ size(u) for any node u and weight(u₁) ≤ weight(u₂) if node u₁ is a descendant of node u₂, where size(u) is the number of nodes in the subtree rooted at u. In the size-constrained weighted ancestor (SWA) problem, for any node u of T and any integer k, we are asked to return the lowest ancestor w of u with weight at least k. We show that for any rooted tree with n nodes, we can locate node w in (1) time after (n)-time preprocessing. In particular, this implies a data structure for the SWA problem in suffix trees with (1) query time and (n)-time preprocessing, when the nodes are weighted by weight. We also show several string-processing applications of this result
Efficient Index for Weighted Sequences
The problem of finding factors of a text string which are identical or similar to a given pattern string is a central problem in computer science. A generalised version of this problem consists in implementing an index over the text to support efficient on-line pattern queries. We study this problem in the case where the text is weighted: for every position of the text and every letter of the alphabet a probability of occurrence of this letter at this position is given. Sequences of this type, also called position weight matrices, are commonly used to represent imprecise or uncertain data. A weighted sequence may represent many different strings, each with probability of occurrence equal to the product of probabilities of its letters at subsequent positions. Given a probability threshold 1/z, we say that a pattern string P matches a weighted text at position i if the product of probabilities of the letters of P at positions i,...,i+|P|-1 in the text is at least 1/z. In this article, we present an O(nz)-time construction of an O(nz)-sized index that can answer pattern matching queries in a weighted text in optimal time improving upon the state of the art by a factor of z log z. Other applications of this data structure include an O(nz)-time construction of the weighted prefix table and an O(nz)-time computation of all covers of a weighted sequence, which improve upon the state of the art by the same factor
- …
