1,721,009 research outputs found
Searching for Jumbled Patterns in Strings
Cicalese F, Fici G, Lipták Z. Searching for Jumbled Patterns in Strings. In: Proc. of the Prague Stringology Conference 2009. 2009: 105-117
Hardness of MSA with Selective Column Scoring
Multiple Sequence Alignment (MSA for short) is a well known problem in the field of computational biology. In order to evaluate the quality of a solution, many different scoring functions have been introduced, the most widely used being the Sum-of-pairs score (SP-score). It is known that computing the best MSA under the SP-score measure is NP-hard. In this paper, we introduce a variant of the Column score (defined in Thompson et al. 1999), which we refer to as Selective Column score: Given a symbol a ∈ Σ, the score of the i-th column is one if and only if all symbols of the same column are a, and otherwise zero. The acolumn score of an alignment is then the number of columns made of only character a. We show that finding the optimal MSA under the Selective Column Score is NP-hard for all alphabets of size |Σ| ≥ 2 by reducing from MIN-2-SAT
A new strategy for querying priced information.
Cicalese F, Laber ES. A new strategy for querying priced information. In: Proc. 37th Annual ACM Symposium on Theory of Computing (STOC 2005). 2005: 674-683
A Note on Approximation of Uniform Distributions from Variable-to-Fixed Length Codes
Cicalese F, Gargano L, Vaccaro U. A Note on Approximation of Uniform Distributions from Variable-to-Fixed Length Codes. IEEE Transactions on Information Theory. 2006;52(8):3772-3777
Using Random Forest Distances for Outlier Detection
In recent years, a great variety of outlier detectors have been proposed in the literature, many of which are based on pairwise distances or derived concepts. However, in such methods, most of the efforts have been devoted to the outlier detection mechanisms, not paying attention to the distance measure - in most cases the basic Euclidean distance is used. Instead, in the clustering field, data-dependent measures have shown to be very useful, especially those based on Random Forests: actually, Random Forests are partitioners of the space able to naturally encode the relation between two objects. In the outlier detection field, these informative distances have received scarce attention. This manuscript is aimed at filling this gap, studying the suitability of these measures in the identification of outliers. In our scheme, we build an unsupervised Random Forest model, from which we extract pairwise distances; these distances are then input to an outlier detector. In particular, we study the impact of several Random Forest-based distances, including advanced and recent ones, on different outlier detectors. We evaluate thoroughly our methodology on nine benchmark datasets for outlier detection, focusing on different aspects of the pipeline, such as the parametrization of the forest, the type of distance-based outlier detector, and most importantly, the impact of the adopted distance
On Approximate Jumbled Pattern Matching in Strings
Burcsi P, Cicalese F, Fici G, Lipták Z. On Approximate Jumbled Pattern Matching in Strings. Theory of Computing Systems. 2012;50(1):35-51.Given a string s, the Parikh vector of s, denoted p(s), counts the multiplicity of each character in s. Searching for a match of a Parikh vector q in the text s requires finding a substring t of s with p(t) = q. This can be viewed as the task of finding a jumbled (permuted) version of a query pattern, hence the term Jumbled Pattern Matching. We present several algorithms for the approximate version of the problem: Given a string s and two Parikh vectors u, v (the query bounds), find all maximal occurrences in s of some Parikh vector q such that u <= q <= v. This definition encompasses several natural versions of approximate Parikh vector search. We present an algorithm solving this problem in sub-linear expected time using a wavelet tree of s, which can be computed in time O(n) in a preprocessing phase. We then discuss a Scrabble-like variation of the problem, in which a weight function on the letters of s is given and one has to find all occurrences in s of a substring t with maximum weight having Parikh vector p(t) <= v. For the case of a binary alphabet, we present an algorithm which solves the decision version of the Approximate Jumbled Pattern Matching problem in constant time, by indexing the string in subquadratic time
Efficient Reconstruction of RC-Equivalent Strings
Cicalese F, Erdős PLP, Lipták Z. Efficient Reconstruction of RC-Equivalent Strings. In: Proc. WOCA. Vol 6460. Berlin, Heidelberg: Springer Berlin Heidelberg; 2011: 349-362
- …
