1,720,992 research outputs found

    Static optimality theorem for external memory string access

    No full text
    Data warehouses are increasingly storing and managing large scale string data, and dealing with large volume of transactions that update and search string data. Motivated by this context, we initiate the study of self-adjusting data structures for string dictionary operations, that is, data structures that are designed to be efficient on an entire sequence rather than individual string operations. Furthermore, we study this problem in the external memory model where string data is too massive to be stored in internal memory and has to reside in disks; each access to a disk page fetches B items, and the cost of the operations is the number of pages accessed (I/Os). We show that given a dictionary of n strings S-1,...,S-n of total length N, a sequence of m string searches S-i1,Si-2,...,S-im takes O(Sigma(j=1)(m)((B)/(\Sij\)) + Sigma(i=1)(n)(n(i)log(Bni)/(m))) expected I/Os, where n(i) is the number of times S-i is queried. Inserting or deleting a string S takes O((\S\)(B) + log(B) n) expected amortized I/Os. The Sigma(j=1)(m) (\Sij\)(B) term is a lower bound for reading the input; the Sigma(i=1)(n) n(i) log(B) (ni)/(m) term is the entropy of the query sequence and is a standard information-theoretic lower bound. This is the Static Optimality Theorem for external-memory string access. The static optimality theorem was first formalized and proved by Tarjan and Sleator for numerical dictionary in their classic splay trees paper in 1985 [16]; they left open the B-tree case for numerical values (page 684), and a fortiori, the case of string data in an external-memory setting, that we settle here. We obtain our result not by using traditional "splay" operations on search trees as in [16], but by designing a novel and conceptually simple self-adjusting data structure based on the well-known skip lists

    A data structure for a sequence of string accesses in external memory

    No full text
    We introduce a new paradigm for querying strings in external memory, suited to the execution of sequences of operations. Formally, given a dictionary of n strings S1, . . . , Sn, we aim at supporting a search sequence for m not necessarily distinct strings T1, T2, . . . , Tm, as well as inserting and deleting individual strings. The dictionary is stored on disk, where each access to a disk page fetches B items, the cost of an operation is the number of pages accessed (I/Os), and efficiency must be attained on entire sequences of string operations rather than on individual ones. Our approach relies on a novel and conceptually simple self-adjusting data structure (SASL) based on skip lists, that is also interesting per se

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    On the mining of artful processes

    Full text link
    Artful processes are those processes in which the experience, intuition, and knowledge of the actors are the key factors in determining the decision making. These knowledge-intensive processes are typically carried out by the “knowledge workers”, such as professors, managers, researchers. They are often scarcely formalised or completely unknown a priori, and depend on the skills, experience, and judgment of the primary actors. Artful processes have goals and methods that change quickly over time, making them difficult to codify in the context of an enterprise application. Knowledge workers cannot be realistically expected to instruct the assistive system by modelling their artful processes: it would be time-consuming both in the initial definition and in the potential continuous revisions. To make things worse, time is the crucial resource that usually knowledge workers indeed lack. Despite the advent of structured case management tools, many enterprise processes are still “run” over emails. Thus, reverse engineering workflows of such processes and their integration with artefacts and other structured processes can accurately depict the enterprise’s process landscape. A system able to infer the models of the processes laying behind the email messages exchanged would be valuable and the result could materialise almost freely. This is the purpose of our approach, which is the core of this thesis and is named MailOfMine. Its investigation mainly resides in the Machine Learning area. More specifically, it relates to Information Retrieval (IR) and Process Mining (PM). We adopted well-known IR techniques in order to extract the activities out of the email messages. We propose a new algorithm for PM in order to discover the temporal rules that the activities adhere to: MINERful. The set of such rules, intended as temporal constraints, constitute the so called declarative modelling of workflows. Declarative models differ from the imperative in that they do not explicitly represent every possible execution that a process can be enacted through, i.e., there is no graph-like structure determining the whole evolution of a process instance, from the beginning to the end. They establish a set of constraints that must hold true, whatever the evolution of the process instance will be. What is not explicitly declared to be respected, is allowed. The reader can easily see that it is better suited to processes subject to frequent changes, with respect to the classical approach. From a more abstract perspective, this work challenges the problem of discovering highly flexible workflows (such as artful processes), out of semi-structured information (such as email messages)

    Variations on the Author

    Full text link
    “Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship

    Appropriate Similarity Measures for Author Cocitation Analysis

    Full text link
    We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis

    Opportunistic data structures with applications

    No full text
    There is an upsurging interest in designing succinct data structures for basic searching problems (see [Munro99] and references therein). The motivation has to be found in the exponential increase of electronic data nowadays available which is even surpassing the significant increase in memory and disk storage capacities of current computers. Space reduction is an attractive issue because it is also intimately related to performance improvements as noted by several authors (e.g. Knuth [knuth3], Bentley [bentley]). In designing these implicit data structures the goal is to reduce as much as possible the auxiliary information kept together with the input data without introducing a significant slowdown in the final query performance. Yet input data are represented in their entirety thus taking no advantage of possible repetitiveness into them. The importance of those issues is well known to programmers who typically use various tricks to squeeze data as much as possible and still achieve good query performance. Their approaches, thought, boil down to heuristics whose effectiveness is witnessed only by experimentation. In this paper, we address the issue of compressing and indexing data by studying it in a theoretical framework. We devise a novel data structure for indexing and searching whose space occupancy is a function of the entropy of the underlying data set. The novelty resides in the careful combination of a compression algorithm, proposed by Burrows-Wheeler [bw], with the structural properties of a well known indexing tool, the Suffix Array [MM93]. We call the data structure opportunistic since its space occupancy is decreased when the input is compressible at no significant slowdown in the query performance and without any assumption on a particular fixed distribution. More precisely, its space occupancy is optimal in a information-content sense because a text T[1,u]T[1,u] is stored using O(k(T))+o(1)O(k(T)) + o(1) bits per input symbol, where k(T)k(T) is the kkth order entropy of TT (the bound holds for any fixed kk). Given an arbitrary string P[1,p]P[1,p], the opportunistic data structure allows to search for the occocc occurrences of PP in TT requiring O(p+occlogϵu)O(p + occ \log^\epsilon u) time complexity (for any fixed ϵ>0\epsilon >0). If data are non compressible, then we achieve the best space bound currently known [GV00]; otherwise our solution improves the succinct suffix array in [GV00] and the classical suffix tree and suffix array data structures either in space or in query time complexity or both. It was a belief [Witten:1999:MGC] that some space overhead should be paid to use full-text indices (i.e. suffix trees or suffix arrays) with respect to the word-based indices (i.e. inverted lists). The results in this paper show that no space overhead is needed at all, and as an application we improve space and query time complexity of the well-known Glimpse tool [glimpse]. We finally investigate the modifiability of our opportunistic data structure by studying how to choreograph its basic ideas with a dynamic setting thus achieving effective searching and updating time bounds

    Dispelling the Myths Behind First-author Citation Counts

    Full text link
    We conducted a full-scale evaluative citation analysis study of scholars in the XML research field to explore just how different from each other author rankings resulting from different citation counting methods actually are, and to demonstrate the capability of emerging data and tools on the Web in supporting more realistic citation counting methods. Our results contest some common arguments for the continued use of first-author citation counts in the evaluation of scholars, such as high correlations between author rankings by first-author citation counts and other citation counting methods, and high costs of using more realistic citation counting methods that are not well-supported by the ISI databases. It is argued that increasingly available digital full text research papers make it possible for citation analysis studies to go beyond what the ISI databases have directly supported and to employ more sophisticated methods
    corecore