1,720,992 research outputs found
Static optimality theorem for external memory string access
Data warehouses are increasingly storing and managing large scale string data, and dealing with large volume of transactions that update and search string data. Motivated by this context, we initiate the study of self-adjusting data structures for string dictionary operations, that is, data structures that are designed to be efficient on an entire sequence rather than individual string operations. Furthermore, we study this problem in the external memory model where string data is too massive to be stored in internal memory and has to reside in disks; each access to a disk page fetches B items, and the cost of the operations is the number of pages accessed (I/Os). We show that given a dictionary of n strings S-1,...,S-n of total length N, a sequence of m string searches S-i1,Si-2,...,S-im takes
O(Sigma(j=1)(m)((B)/(\Sij\)) + Sigma(i=1)(n)(n(i)log(Bni)/(m))) expected I/Os, where n(i) is the number of times S-i is queried. Inserting or deleting a string S takes O((\S\)(B) + log(B) n) expected amortized I/Os. The Sigma(j=1)(m) (\Sij\)(B) term is a lower bound for reading the input; the Sigma(i=1)(n) n(i) log(B) (ni)/(m) term is the entropy of the query sequence and is a standard information-theoretic lower bound. This is the Static Optimality Theorem for external-memory string access. The static optimality theorem was first formalized and proved by Tarjan and Sleator for numerical dictionary in their classic splay trees paper in 1985 [16]; they left open the B-tree case for numerical values (page 684), and a fortiori, the case of string data in an external-memory setting, that we settle here. We obtain our result not by using traditional "splay" operations on search trees as in [16], but by designing a novel and conceptually simple self-adjusting data structure based on the well-known skip lists
A data structure for a sequence of string accesses in external memory
We introduce a new paradigm for querying strings in external memory, suited to the execution of sequences of operations. Formally, given a dictionary of n strings S1, . . . , Sn, we aim at supporting a search sequence for m not necessarily distinct strings T1, T2, . . . , Tm, as well as inserting and deleting individual strings. The dictionary is stored on disk, where each access to a disk page fetches B items, the cost of an operation is the number of pages accessed (I/Os), and efficiency must be attained on entire sequences of string operations rather than on individual ones.
Our approach relies on a novel and conceptually simple self-adjusting data structure (SASL) based on skip lists, that is also interesting per se
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
On the mining of artful processes
Artful processes are those processes in which the experience, intuition, and knowledge of the actors are the key factors in determining the decision making. These knowledge-intensive processes are typically carried out by the “knowledge workers”, such as professors, managers, researchers. They are often scarcely formalised or completely unknown a priori, and depend on the skills, experience, and judgment of the primary actors. Artful processes have goals and methods that change quickly over time, making them difficult to codify in the context of an enterprise application. Knowledge workers cannot be realistically expected to instruct the assistive system by modelling their artful processes: it would be time-consuming both in the initial definition and in the potential continuous revisions. To make things worse, time is the crucial resource that usually knowledge workers indeed lack.
Despite the advent of structured case management tools, many enterprise processes are still “run” over emails. Thus, reverse engineering workflows of such processes and their integration with artefacts and other structured processes can accurately depict the enterprise’s process landscape. A system able to infer the models of the processes laying behind the email messages exchanged would be valuable and the result could materialise almost freely. This is the purpose of our approach, which is the core of this thesis and is named MailOfMine. Its investigation mainly resides in the Machine Learning area. More specifically, it relates to Information Retrieval (IR) and Process Mining (PM). We adopted well-known IR techniques in order to extract the activities out of the email messages.
We propose a new algorithm for PM in order to discover the temporal rules that the activities adhere to: MINERful. The set of such rules, intended as temporal constraints, constitute the so called declarative modelling of workflows. Declarative models differ from the imperative in that they do not explicitly represent every possible execution that a process can be enacted through, i.e., there is no graph-like structure determining the whole evolution of a process instance, from the beginning to the end. They establish a set of constraints that must hold true, whatever the evolution of the process instance will be. What is not explicitly declared to be respected, is allowed. The reader can easily see that it is better suited to processes subject to frequent changes, with respect to the classical approach.
From a more abstract perspective, this work challenges the problem of discovering highly flexible workflows (such as artful processes), out of semi-structured information (such as email messages)
Variations on the Author
“Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship
Appropriate Similarity Measures for Author Cocitation Analysis
We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis
Opportunistic data structures with applications
There is an upsurging interest in designing succinct data structures
for basic searching problems (see [Munro99] and references
therein). The motivation has to be found in the exponential increase
of electronic data nowadays available which is even surpassing the
significant increase in memory and disk storage capacities of current
computers. Space reduction is an attractive issue because it is also
intimately related to performance improvements as noted by several
authors (e.g. Knuth [knuth3], Bentley [bentley]). In
designing these implicit data structures the goal is to reduce
as much as possible the auxiliary information kept together with the
input data without introducing a significant slowdown in the final
query performance. Yet input data are represented in their entirety
thus taking no advantage of possible repetitiveness into them. The
importance of those issues is well known to programmers who typically
use various tricks to squeeze data as much as possible and still
achieve good query performance. Their approaches, thought, boil down
to heuristics whose effectiveness is witnessed only by
experimentation.
In this paper, we address the issue of compressing and indexing data
by studying it in a theoretical framework. We devise a novel data
structure for indexing and searching whose space occupancy is a
function of the entropy of the underlying data set. The novelty
resides in the careful combination of a compression algorithm,
proposed by Burrows-Wheeler [bw], with the structural properties
of a well known indexing tool, the Suffix Array [MM93]. We call
the data structure opportunistic since its space occupancy is
decreased when the input is compressible at no significant slowdown in
the query performance and without any assumption on a particular fixed
distribution. More precisely, its space occupancy is optimal in a
information-content sense because a text is stored using
bits per input symbol, where is
the th order entropy of (the bound holds for any fixed
). Given an arbitrary string , the opportunistic data
structure allows to search for the occurrences of in
requiring time complexity (for any
fixed ). If data are non compressible, then we achieve
the best space bound currently known [GV00]; otherwise our
solution improves the succinct suffix array in [GV00] and the
classical suffix tree and suffix array data structures either in space
or in query time complexity or both.
It was a belief [Witten:1999:MGC] that some space overhead should
be paid to use full-text indices (i.e. suffix trees or suffix arrays)
with respect to the word-based indices (i.e. inverted lists). The
results in this paper show that no space overhead is needed at
all, and as an application we improve space and query time complexity
of the well-known Glimpse tool [glimpse].
We finally investigate the modifiability of our opportunistic data
structure by studying how to choreograph its basic ideas with a dynamic
setting thus achieving effective searching and updating time bounds
Dispelling the Myths Behind First-author Citation Counts
We conducted a full-scale evaluative citation analysis study of scholars in the XML research field to explore just how different from each other author rankings resulting from different citation counting methods actually are, and to demonstrate the capability of emerging data and tools on the Web in supporting more realistic citation counting methods. Our results contest some common arguments for the continued
use of first-author citation counts in the evaluation of scholars, such as high correlations between author rankings by first-author citation counts and other citation
counting methods, and high costs of using more realistic citation counting methods that are not well-supported by the ISI databases. It is argued that increasingly available digital full text research papers make it possible for citation analysis studies to go beyond what the ISI databases have directly supported and to employ more
sophisticated methods
- …
