1,721,150 research outputs found
CTTP
We describe an optimal randomized MapReduce algorithm for the problem of triangle enumeration. This generalizes the well-known vertex partitioning approach proposed in (Suri and Vassilvitskii, 2011) to multiple rounds, significantly increasing the size of the graphs that can be handled on a given system. We also give new theoretical (high probability) bounds on the work needed in each reducer, addressing the "curse of the last reducer". Indeed, our work is the first to give guarantees on the maximum load of each reducer for an arbitrary input graph
Confirmation sampling for exact nearest neighbor search
Locality-sensitive hashing (LSH), introduced by Indyk and Motwani in STOC ’98, has been an extremely influential framework for nearest neighbor search in high-dimensional data sets. While theoretical work has focused on the approximate nearest neighbor problem, in practice LSH data structures with suitably chosen parameters are used to solve the exact nearest neighbor problem (with some error probability). Sublinear query time is often possible in practice even for exact nearest neighbor search, intuitively because the nearest neighbor tends to be significantly closer than other data points. However, theory offers little advice on how to choose LSH parameters outside of pre-specified worst-case settings.
We introduce the technique of confirmation sampling for solving the exact nearest neighbor problem using LSH. First, we give a general reduction that transforms a sequence of data structures that each find the nearest neighbor with a small, unknown probability, into a data structure that returns the nearest neighbor with probability 1−δ
, using as few queries as possible. Second, we present a new query algorithm for the LSH Forest data structure with L trees that is able to return the exact nearest neighbor of a query point within the same time bound as an LSH Forest of Ω(L) trees with internal parameters specifically tuned to the query and data
A Generic Framework for Engineering Graph Canonization Algorithms
The state-of-the-art tools for practical graph canonization are all based on the individualization-refinement paradigm, and their di erence is primarily in the choice of heuristics they include and in the actual tool implementation. It is thus not possible to make a direct comparison of how individual algorithmic ideas a ect the performance on di erent graph classes. We present an algorithmic software framework that facilitates implementation of heuristics as independent extensions to a common core algorithm. It therefore becomes easy to perform a detailed comparison of the performance and behaviour of di erent algorithmic ideas. Implementations are provided of a range of algorithms for tree traversal, target cell selection, and node invariant, including choices from the literature and new variations. The framework readily supports extraction and visualization of detailed data from separate algorithm executions for subsequent analysis and development of new heuristics. Using collections of di erent graph classes we investigate the e ect of varying the selections of heuristics, often revealing exactly which individual algorithmic choice is responsible for particularly good or bad performance. On several benchmark collections, including a newly proposed class of di cult instances, we additionally find that our implementation performs better than the current state-of-the-art tools.</p
Fast and Space-Efficient Perfect Hashing
From data analytics to machine learning, large amounts of input data become more and more important. The volume of collected data grows continuously. Compact data structures enable efficient access to this data and help with processing. Perfect hashing is a fundamental basis for many compact data structures and is used for example in databases, hash tables, retrieval data structures, and approximate membership data structures. A perfect hash function (PHF) maps a set of keys to the first integers without collisions, and is called minimal perfect (MPHF) if . It may map keys that are not in to an arbitrary value. This makes it possible to store the function without representing the set itself.
In this dissertation, we present three MPHF construction algorithms. With SIMDRecSplit, we focus on space efficiency, scratching the space lower bound of representing MPHFs. We enhance the existing construction RecSplit through parallelism on many levels - bits, vectors, multicores, and the GPU. As a base case, RecSplit uses brute-force search - here we achieve impressive speedups by replacing several retries with simple bit operations or even table lookups. With SicHash, we aim for a good balance between query and construction performance. SicHash is based on sampling a random graph where the nodes are the output hash values. Each key, hashed with different ordinary hash functions, gives a hyperedge connecting candidate output values. The perfect hash function is given by an orientation of the edges such that the indegree of each node is at most one. SicHash works close to the orientability threshold - the factor below which such an orientation likely exists. With ShockHash, we combine the ideas of SIMDRecSplit and SicHash. We try to orient a graph far above the orientability threshold, which requires retrying with many different graphs. Compared to plain brute-force, this reduces the number of retries massively, leading to more than times faster construction asymptotically. At the same time, it still maintains the asymptotically optimal space consumption.
Monotone minimal perfect hash functions (MMPHFs) retain the natural order of the input keys. After many years in which tree-based structures were predominant, our LeMonHash moves into a novel direction. It is based on learning regularities in the input data and is efficient to query because it uses a flat structure. Finally, as a relaxation of perfect hashing, we present PaCHash - a static external memory hash table for objects of variable size. PaCHash can fetch an object using a single contiguous access to the external memory. For objects of fixed size, this is similar to what is possible using -perfect hashing, where at most collisions are allowed. However, in some way, PaCHash breaks the space lower bounds of -perfect hashing at the cost of sometimes fetching one external memory block too much.
The approaches presented in this dissertation are a large step forward from the state of the art. Our approaches dominate the majority of the space-time trade-off, as shown by our detailed experimental evaluation, and inspire future research in the area of perfect hashing
Linear time canonicalization and enumeration of non-Isomorphic 1-Face embeddings
Antiparallel strong traces (ASTs) are a type of walks in graphs which use every edge exactly twice. They correspond to 1-face embeddings in orientable surfaces and can be used to design self-assembling protein or DNA strands. Based on a novel canonical form invariant for ASTs, gap vector, we provide a linear-time isomorphism test for ASTs and thus, also for orientable 1-face embeddings of graphs. Using the canonical form, we develop an algorithm for enumerating all pairwise non-isomorphic 1-face embeddings of graphs. We compare our algorithm with an independent implementation of a recent algebraic approach (Baši? et al., MATCH Commun. Math. Comput. Chem. 78 (3), 2017) on large data sets. Our results yield the first large-scale enumeration of non-isomorphic embeddings and investigation of their properties
Optimal prefix-suffix queries with applications
We revisit the classic border tree data structure [Gu, Farach, Beigel, SODA 1994] that answers the following prefix-suffix queries on a string T of length n over an integer alphabet Σ = [0, σ): for any i, j ∈ [0, n) return all occurrences of T in T[0.. i]T[j.. n−1]. The border tree of T can be constructed in O(n) time and answers prefix-suffix queries in O(log n+Occ) time, where Occ is the number of occurrences of T in T[0.. i]T[j.. n−1]. Our contribution here is the following. We present a completely different and remarkably simple data structure that can be constructed in the optimal O(n/logσ n) time and supports queries in the optimal O(1) time. Our result is based on a new structural lemma that lets us encode the output of any query in constant time and space. We also show a new direct application of our result in pattern matching on node-labeled graphs.</p
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
Über die Analyse von zwei fundamentalen randomisierten Algorithmen
Im ersten Teil der vorliegenden Arbeit werden
Multi-Pivot-Quicksort-Algorithmenbetrachtet. Die Idee mehr als ein
Pivotelement im Quicksort-Algorithmus zunutzen, erschien über viele Jahre
als unpraktikabel. Dies änderte sich, als imJahr 2009 ein
Dual-Pivot-Algorithmus von V. Yaroslavskiy zumStandard-Sortierverfahren in
Java 7 wurde. Die vorliegende Arbeit stellt eineStudie von
Multi-Pivot-Quicksort-Algorithmen dar, also Quicksort-Varianten, diemit
mehr als einem Pivotelement arbeiten. Sie beschreibt
dieKonstruktionsprinzipien von 2-Pivot-Algorithmen in Bezug auf die bei
derSortierung notwendigen Schlüsselvergleiche. Ein Ergebnis dieser
Untersuchungsind zwei optimale und leicht zu implementierende
2-Pivot-Algorithmen. DieVerallgemeinerung auf >= 3 Pivotelemente benötigt
nur kleine Anpassungen. DieseArbeit betrachtet außerdem die theoretische
Analyse von Kostenmaßen, die esermöglichen,
Multi-Pivot-Quicksort-Algorithmen hinsichtlich ihres Speicher-
undCacheverhaltens zu vergleichen. Sie schließt mit einer Laufzeitstudie
derbesprochenen Algorithmen.
Der zweite Teil der Arbeit beschäftigt sich mit dem Einsatz von
Hashfunktionenin Algorithmen und Datenstrukturen. Hashfunktionen bilden
eine Kernkomponente,z.B. beim Aufbau einer Hashtabelle oder bei
Lastbalancierung. Oft wird dabeieine unrealistische Annahme getätigt: Die
Hashwerte seien voll zufällig. DieSpeicherplatzkomplexität einer solchen
Funktion ist für den praktischen Einsatzfür unverhältnismäßig hoch.
Das Ziel ist, einfache Konstruktionen zu finden,deren Zufallseigenschaften
beweisbar gut sind. Diese Arbeit beschreibt einesolche einfache
Konstruktion von Hashfunktionen, die in einer Vielzahl vonAnwendungen
beweisbar gut ist. Zu diesen Anwendungen zählen Cuckoo Hashing miteinem
sogenannten Stash, die Konstruktion einer perfekten Hashfunktion,
dieSimulation einer uniformen Hashfunktion, verschiedene Algorithmen
zurLastbalancierung und verallgemeinertes Cuckoo Hashing in einer
leichtabgeschwächten Variante mit verschiedenen Einfügealgorithmen. Der
zentraleBeitrag dieser Dissertation ist ein einheitliches Analysekonzept.
Diesesermöglicht es, eine auf Hashfunktionen basierende Datenstruktur
oder einen aufHashfunktionen basierenden Algorithmus nur mit Mitteln der
Theorie vonZufallsgraphen zu analysieren, ohne Details der Hashfunktion
offenzulegen.The first part of this thesis considers multi-pivot quicksort algorithms.
Theidea of using more than one pivot element in quicksort was considered to
beimpractical. This changed when a dual-pivot algorithm replaced
thewell-engineered quicksort-based sorting algorithm in Java 7. The
thesispresents a detailed study of multi-pivot quicksort algorithms. It
explains theunderlying design choices for dual-pivot quicksort algorithms
with respect tothe comparisons they make when sorting the input. Moreover,
it describes twoeasy to implement optimal dual-pivot algorithms To analyze
quicksort algorithmsusing more than two pivots, only slight modifications
to the dual-pivot caseare necessary. This thesis also considers the
theoretical analysis of thememory and cache behavior of multi-pivot
quicksort algorithms. At the end, itreports on a large-scale study on the
empirical running time of multi-pivotquicksort algorithms.
The second part of this thesis considers the use of hash functions
inalgorithms and data structures. Hash functions are applied in many
differentsituations, e.g., when building a hash table or when distributing
jobs amongmachines in load balancing. Traditionally, the analysis of a
particularhashing-based algorithm or data structure assumes that a hash
function mapskeys independently and uniformly at random to a range. Such
functions areunrealistic, since their space complexity is huge.
Consequently, the task is toconstruct explicit hash functions providing
provable theoretical guarantees.The thesis describes such a construction.
It will provide sufficient randomnessfor running many different
applications, such as cuckoo hashing with a stash,the construction of a
perfect hash function, the simulation of a uniform hashfunction, load
balancing, and generalized cuckoo hashing in a sparse settingwith two
alternative insertion algorithms. The main contribution of this partof the
thesis is a unified framework based on the first moment method.
Thisframework makes it possible to analyze a hashing-based algorithm or
datastructure only using random graph theory, without exploiting details of
thehash function. The hash functions are easy to implement and turn out to
bepractical while providing strong randomness guarantees
Variations on the Author
“Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship
- …
