Search CORE

1,720,971 research outputs found

PtrHash: Minimal Perfect Hashing at RAM Throughput

Author: Groot Koerkamp Ragnar
Publication venue
Publication date: 01/01/2025
Field of study

Motivation. Given a set K of n keys, a minimal perfect hash function (MPHF) is a collision-free bijective map H_mphf from K to {0, … , n-1}. These functions have uses in databases, search engines, and are used in bioinformatics indexing tools such as Pufferfish (using BBHash), and Piscem (PTHash). PTHash is also used in SSHash, a data structure on k-mers that supports membership queries. PTHash only takes around 5% of the total space of SSHash, and thus, trading slightly more space for faster queries is beneficial. Thus, this work presents a (minimal) perfect hash function that first prioritizes query throughput, while also allowing efficient construction for 10⁹ or more elements using 2.4 bits of memory per key. Contributions. Both PTHash and PHOBIC first map all n keys to n/λ < n buckets. Then, each bucket stores a pilot that controls the final hash value of the keys mapping to it. PtrHash builds on this by using 1) fixed-width (uncompressed) 8-bit pilots, 2) a construction algorithm similar to Cuckoo hashing to find suitable pilot values. Further, it partitions the keys, so that keys in each part map to their own set of slots. PtrHash 3) uses the same number of buckets and slots for each part, with 4) a single remap table to map intermediate positions ≥ n to < n, 5) encoded using per-cacheline Elias-Fano coding. Lastly, 6) PtrHash supports streaming queries, where we use prefetching to answer a stream of multiple queries more efficiently than one-by-one processing. Results. With default parameters, PtrHash takes 2.4 bits per key. On 300 million string keys, PtrHash is as fast or faster to build than other MPHFs at a similar size, and at least 2.1× faster to query. When streaming multiple queries, this improves to 3.3× speedup over the fastest alternative, while also being significantly faster to construct. When using 10⁹ integer keys instead, query times are as low as 12 ns/key when iterating in a for loop, or even down to 8 ns/key when using the streaming approach, just short of the 7.4 ns inverse throughput of random memory accesses

DROPS Dagstuhl Research Online Publication Server

RagnarGrootKoerkamp/astar-pairwise-aligner

Author: Groot Koerkamp Ragnar
Publication venue
Publication date: 01/01/2024
Field of study

DROPS Dagstuhl Research Online Publication Server

RagnarGrootKoerkamp/PtrHash

Author: Groot Koerkamp Ragnar
Publication venue
Publication date: 01/01/2025
Field of study

DROPS Dagstuhl Research Online Publication Server

Exact global alignment using A* with chaining seed heuristic and match pruning

Author: Groot Koerkamp Ragnar
Ivanov Pesho
Publication venue
Publication date: 2024
Field of study

ISSN:1367-4803ISSN:1460-2059ISSN:1460-205

ETHzürich Repository for Publications and Research Data

Error Correction in Automatic Speech Recognition

Author: Groot Koerkamp Ragnar
Weisz Ágoston
Publication venue
Publication date: 2020
Field of study

This disclosure describes techniques to correct errors in automatic speech recognition, e.g., as performed to recognize spoken queries from a user to a virtual assistant or other application. A machine learning model detects potentially misrecognized n-grams within transcribed text which are then underlined in a user interface. A user can tap on the underlined n-gram, or another portion of the transcribed text to activate a dropdown menu that presents alternatives to the transcribed text. The alternatives can be based on speech hypothesis scores. To correct the error in transcribed text, the user picks an alternative from the dropdown menu, or, in the absence of a suitable alternative, types in the correction. With user permission, the error and corresponding correction are used as training data to improve model performance

Technical Disclosure Common

rust-seq/packed-seq

Author: Groot Koerkamp Ragnar
Martayan Igor
Publication venue
Publication date: 01/01/2025
Field of study

DROPS Dagstuhl Research Online Publication Server

rust-seq/simd-minimizers

Author: Groot Koerkamp Ragnar
Martayan Igor
Publication venue
Publication date: 01/01/2025
Field of study

DROPS Dagstuhl Research Online Publication Server

SimdMinimizers: Computing Random Minimizers, fast

Author: Groot Koerkamp Ragnar
Martayan Igor
Publication venue
Publication date: 01/01/2025
Field of study

Motivation. Because of the rapidly-growing amount of sequencing data, computing sketches of large textual datasets has become an essential preprocessing task. These sketches are typically much smaller than the input sequences, but preserve sufficient information for downstream analysis. Minimizers are an especially popular sketching technique and used in a wide variety of applications. They sample at least one out of every w consecutive k-mers. As DNA sequencers are getting more accurate, some applications can afford to use a larger w and hence sparser and smaller sketches. And as sketches get smaller, their analysis becomes faster, so the time spent sketching the full-sized input becomes more of a bottleneck. Methods. Our library simd-minimizers implements a random minimizer algorithm using SIMD instructions. It supports both AVX2 and NEON architectures. Its main novelty is two-fold. First, it splits the input into 8 chunks that are streamed over in parallel through all steps of the algorithm. This is enabled by using the completely deterministic two-stacks sliding window minimum algorithm, which seems not to have been used before for finding minimizers. Results. Our library is up to 6.8× faster than a scalar implementation of the rescan method when w = 5 is small, and 3.4× faster for larger w = 19. Computing canonical minimizers is less than 50% slower than computing forward minimizers, and over 15× faster than the existing implementation in the minimizer-iter crate. Our library finds all (canonical) minimizers of a 3.2 Gbp human genome in 5.2 (resp. 6.7) seconds

ETHzürich Repository for Publications and Research Data

DROPS Dagstuhl Research Online Publication Server

On rainbow-free colourings of uniform hypergraphs

Author: Groot Koerkamp Ragnar
Zivny Stanislav
Publication venue
Publication date: 17/06/2021
Field of study

We study rainbow-free colourings of k-uniform hypergraphs; that is, colourings that use k colours but with the property that no hyperedge attains all colours. We show that p⁎ = (k−1)(ln⁡n)/n is the threshold function for the existence of a rainbow-free colouring in a random k-uniform hypergraph

Oxford University Research Archive

A*PA2: Up to 19× Faster Exact Global Alignment

Author: Groot Koerkamp Ragnar
Koerkamp Ragnar Groot
Publication venue
Publication date: 01/01/2024
Field of study

ISSN:1868-896

ETHzürich Repository for Publications and Research Data

DROPS Dagstuhl Research Online Publication Server