1,721,088 research outputs found

    Of Maps Bigger than the Empire (Keynote Paper)

    No full text
    in a passage by J.LBorges on the "exactituide of Science", a fictituos author describes an Empire in which the art of Cartography "logro' tal perfeccions que el mapa de una sola Provicia ocupaba toda la Ciudad, y el mapa del imperio ocupaba toda la Provincia". With time, these huge maps woudn't be enough and the Colleges of Cartographers erected a map that equalled in width the Empire itself... This paper concerns itself with increasing cases of pattern discovery and data mining in which synopses, indeces and relationships thereof seem to grow faster and bigger than the phenomena they were meant to encapsulate. The paper then reviews specifics examples of algorithmic and combinatorial constructs that proved capable of alleviating such paradoxes in the author's recent work experience

    Of Lempel-Ziv-Welch Parses with Refillable Gaps.

    No full text
    We consider gapped variants of classical data compression paradigms (Ziv, J. and Lempel, A.,1977, 1978; Welch, T.A., 1984). In the original algorithm, phrases are identified and stored in a dictionary on-the-fly as the text-file is scanned. The entries in the dictionary are then matched against the incoming string, thereby determining the next codeword, and this gives the method an inherently linear-time implementation. In our variants, the phrases used in compression are selected among suitably chosen strings of intermittently solid and wild characters produced by the autocorrelation of the source-string, in a way that still preserves linearity of time. At the receiver, gaps can be filled back exactly, interpolated, or left blank, so that lossless, as well as lossy, implementations are possible. However, the focus of the paper is on lossy variants

    Fast gapped variants for Lempel-Ziv-Welch compression.

    No full text
    Variants of classical data compression paradigms by Ziv, Lempel, and Welch are proposed in which the phrases used in compression are selected among suitably chosen strings of intermittently solid and wild characters produced by the autocorrelation of the sourcestring. Adaptations and extensions of the classical ZL78 paradigm as implemented by Welch are developed along these lines, and they are easily seen to be susceptible of simple linear time implementation. Both lossy and lossless schemata are considered, and preliminary analyses of performance are attempted

    Pattern Discovery and the Algorithmics of Surprise (Invited Paper)

    No full text
    Many tasks of contemporary Molecular Biology rely increasingly on au- tomated techniques for the discovery of interesting patterns and associations among them, both in individual sequences and across sequence families. A number of computational models and tools have been set up in recent years in response to these needs. This paper concentrates on approaches based on discrete combinatorial algorithms. As the order of magnitude of sequences and sequence repositories scales up to genomic proportions, problems of massive pattern and association discovery pose daunting methodological and algorith- mic challenges. Several formulations of pattern discovery implicate tables and synopses which, far from being a fraction of the input size, grow non-linearly, even exponentially with it. Often, these phenomena are intrinsic to the problem formulations and are not easily reverted. In some cases, a prudent mixture of combinatorial, algorithmic and statistical properties enable us to control such paradoxes. The focus of the present paper is thus on such “space-conscious” approaches to pattern discovery. The paper reflects mostly recent experience and work by its author, and does not pretend to be exhaustive

    Pattern Discovery in the Crib of Procrustes

    No full text
    The study of physics purports to concise descriptors or theories, good at predicting a virtually unlimited set of replicas of a phenomenon of a certain nature. The discovery of patterns or structure in discrete objects pursues a similar goal, but it departs from the inference of physical laws in so far as the ensuing generation of unlimited replicas may be a curse rather than a blessing. Decades after the facts, an engineer turned computer scientist and still struggling with his math speculates about the origins of a physicist’s fascination with the essence of complexity and structure; and how they can be inferred from examples. Which led to several and still largely unanswered questions, but ultimately helped shaping many a quest for a lifetime

    Sequence Alignment in Molecular Biology

    No full text
    Molecular biology is becoming a computationally intense realm of contemporary science and faces some of the current grand scientific challenges. In its context, tools that identify, store, compare and analyze effectively large and growing numbers of bio-sequences are found of increasingly crucial importance. Biosequences are routinely compared or aligned, in a variety of ways, to infer common ancestry, to detect functional equivalence, or simply while searching for similar entries in a database. A considerable body of knowledge has accumulated on sequence alignment during the past few decades. Without pretending to be exhaustive, this paper attempts a survey of some criteria of wide use in sequence alignment and comparison problems, and of the corresponding solutions. The paper is based on presentations and literature given at the Workshop on Sequence Alignment held at Princeton, N.J., in November 1994, as part of the DIMACS Special Year on Mathematical Support for Molecular Biology

    Pattern Matching Algorithms

    No full text
    Oxford University Pres

    Scoring Unusual Words with Varying Mismatch Errors

    No full text
    Patterns consisting of strings with a bounded number of mismatches are central to coding theory and find multiple applications in text processing and computational biology. In this latter field, the presence of over-represented patterns of this kind has been linked, for instance, to modeling regulatory regions in biosequences. The study and computation of expected number of occurrences and related scores for these patterns is made difficult by the sheer explosion of the roster of candidates that need to be evaluated. In recent work, properties of pattern saturation and score monotonicity have proved capable to mitigate this problem. In such a context, expectation and score monotonicity has been established within the i.i.d. model for all cases of interest except that of a fixed word length with a varying number of mismatches. The present paper completes this investigation by showing that the expected number of occurrences in a textstring for such a word is bi-tonic, that is, behaves as a unimodal function of the number of errors. This extends to this case the time and space savings brought about by discovery algorithms based on pattern saturation
    corecore