1,720,988 research outputs found
ML and System Approaches for Data Processing Optimization and Long Sequence Modeling
513 pagesModern data systems are integral to various industries, yet optimizing their performance remains complex. Traditional methods relying on cost models often fail due to unreliable assumptions about data distribution and the difficulty inestimating the size of intermediate results. To overcome these limitations, we propose a novel paradigm that abandons classical cost models in favor of a sequential decision-making framework inspired by reinforcement learning. Specifically, we utilize Monte Carlo Tree Search (MCTS) and runtime feedback from sampled or partial query executions to learn and converge to the optimal configuration. We introduce and apply this paradigm to several data system tuning problems, query optimization, SkinnerDB in Chapter 2, general parameter optimization, UDO in Chapter 3, and graph query optimization, ADPOT in Chapter 4, demonstrating that this approach effectively handles complex scenarios (e.g., skewed data and large queries) that traditional models struggle with. During this journey, we introduce new theoretical extensions of MCTS under noisy, delayed, and multi-fidelity feedback in Chapter 5 and a new gym environment in Chapter 6 for benchmarking various RL algorithms (e.g., SAC, DQN, TD3, PPO) and motives the development of safety-aware RL approaches. At the same time, data-intensive applications have increasingly leveraged large language models (LLMs) to integrate vector-based entity searches and manage modalities within large-scale data lakes. However, LLMs are typically trained using the transformer architecture, which faces fundamental challenges due to its quadratic dependency on sequence length. This limitation restricts their ability to handle and manage long data (e.g., large tables, videos, and audio). To enable LLMs ability to process long sequence inputs, we investigate whether linear Recurrent Neural Networks (RNNs) can provide a viable alternative while maintaining transformer-level performance. In Chapter 9, we introduce the first bidirectional linear complexity language model that matches BERT performance without using attention. In Chapter 10, we also demonstrate that linear RNNs outperform transformers in byte-level language modeling, which may enable the universal representation of different modalities and formats. Finally, since training LLMs from scratch is costly and resource-intensive, we consider distilling large transformers (e.g., Llama) into linear RNNs trained on very limited GPUs without sacrificing performance in Chapter 11. Additionally, we introduce a hardware-aware speculative decoding algorithm that accelerates inference speed, enabling more efficient deployment of these models. Our research opens new opportunities for LLMs to handle long contexts and to more efficiently model different long data modalities
Towards Generalized 3D Object Detection for Autonomous Driving: Enhancing Data Efficiency in Novel Domains
158 pagesRobust 3D object detection is fundamental to the advancement of robotics and autonomous driving, allowing autonomous systems to navigate complex environments safely and efficiently. Current 3D object detection algorithms rely on supervised learning using large-scale annotated datasets, which are meticulously designed to reflect diverse real-world scenarios and object variations. Creating and annotating such datasets is notoriously labor-intensive and time-consuming, requiring significant efforts to ensure precision and comprehensiveness.However, due to the rapid development and geographic differences in vehicle designs, deploying these carefully tuned systems in real-world environments still introduces domain gaps that compromise performance. These challenges are often addressed by continuously collecting and annotating additional data, which is unsustainable due to the high costs involved. Consequently, there is a pressing need for more generalized 3D object detection capabilities to ensure consistent and reliable performance across diverse and evolving domains. In this thesis, we identified mis-localization as the primary cause of reduced 3D object detection performance in novel domains. That is, 3D objects (e.g. cars, cyclists, and pedestrians) in novel domains can generally be correctly identified by unadapted 3D object detectors, but their precise sizes, locations, and orientations are often inaccurate. However, previous studies tend to view object detection in novel domains as entirely new tasks, overlooking the fact that objects of the same class often share similar shapes across domains, and that bounding boxes are consistently defined to be tight around the objects. We argue that addressing such mis-localization does not necessarily require extensive data collection and annotation. In fact, successful adaptation can be achieved with minimal or even no prior knowledge of the new domain. Notably, StatNorm significantly enhances domain adaptation performance by simulating new domains using only their object dimension statistics; DRIFT markedly improves both the efficiency and performance of unsupervised object discovery in novel domains by aligning 3D object detectors with heuristic reward functions; DiffuBox consistently boosts existing domain adaptation methods through domain-agnostic bounding box refinement that operates in the bounding-box-relative view. Through comprehensive experimental results, we demonstrate that incorporating shape knowledge and heuristic priors can effectively mitigate the domain gaps, achieving more robust, efficient, and generalized 3D object detection across diverse environments without the need for extensive domain-specific data collection and annotation. Overall, we offer a more practical and scalable solution for deploying autonomous driving systems in diverse real-world scenarios
Robust Querying for Data Analysis and Processing
171 pagesTraditionally, formal languages such as SQL have been used by users for data analysis. However, these interfaces are not easily accessible to lay users without an IT background. This has led to the emergence of novel interfaces such as visual and natural language query interfaces. While these interfaces democratize data access, they can introduce ambiguities in understanding the user's query intent in the frontend. Furthermore, efficiently executing these queries in the backend is not a straightforward task. Traditional systems typically optimize the query execution by selecting a plan based on analytical cost models. However, these models can lead to suboptimal choices due to statistical inaccuracies. This thesis focuses on developing a robust data analysis platform that addresses these issues by making multiple query and plan choices instead of using a single one. In this thesis, I introduce three systems that facilitate robust data analysis and processing. The first system, MUVE, enables natural language queries through typed or voice input. It provides users with alternative query interpretations and optimizes visual output to minimize the time required to identify the correct results. The second system, SkinnerMT, parallelizes adaptive query processing to improve efficiency and robustness. It utilizes different parallel methods, allocating threads for plan searching or execution on data partitions. The third system, ROME, strategically selects complementary plans for concurrent execution, increasing the likelihood of incorporating an optimal plan. These systems contribute to the robustness of interactive data analysis systems by optimally selecting queries and plans from both the frontend and backend
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
ADAPTIVE JOIN EXECUTION IN COMPILATION-BASED EXECUTION ENGINES OF DATABASES
65 pagesQuery compilation and adaptive query processing aim to improve the runtime and robustness of analytical databases. However, due to the high cost of compilation, standard methods for combining these involve shaping the adaptive optimization to allow reuse of a single program instead of recompiling. We combine recent developments in both of these areas to show that both compile-once and recompilation-based execution can be practical for adaptive join ordering. We first introduce a low-latency query compilation framework that manages the trade off between compile time and execution time at all stages. First, we describe abstractions to allow easily generating intermediate representation code. Next, we detail the intermediate representation, backend and optimizations that enable similar execution performance to LLVM while achieving much faster compilation time. We then integrate online join order learning that abandons any a-priori optimization into this framework. We propose two orthogonal approaches: a compile-once approach that uses indirection to permute the join order and a recompilation approach that generates code for each join order. We experimentally compare each approach against optimized, analytical databases (MonetDB, DuckDB) on the join order benchmark, TPC-H and JCC-H. Overall, we find that we are able to match or incur modest overheads on queries unfavorable to adaptive optimization while dominating in queries where optimizers are susceptible to choosing a disastrous query plan. We further show that our low latency compilation framework is able to improve both proposed methods across database sizes and in particular, is critical for practical recompilation-based execution
From Massive Parallelization to Quantum Computing: Seven Novel Approaches to Query Optimization
The goal of query optimization is to map a declarative query (describing data to generate) to a query plan (describing how to generate the data) with optimal execution cost. Query optimization is required to support declarative query interfaces. It is a core problem in the area of database systems and has received tremendous attention in the research community, starting with an initial publication in 1979. In this thesis, we revisit the query optimization problem. This visit is motivated by several developments that change the context of query optimization. That change is not reflected in prior literature. First, advances in query execution platforms and processing techniques have changed the context of query optimization. Novel provisioning models and processing techniques such as Cloud computing, crowdsourcing, or approximate processing allow to trade between different execution cost metrics (e.g., execution time versus monetary execution fees in case of Cloud computing). This makes it necessary to compare alternative execution plans according to multiple cost metrics in query optimization. While this is a common scenario nowadays, the literature on query optimization with multiple cost metrics (a generalization of the classical problem variant with one execution cost metric) is surprisingly sparse. While prior methods take hours to optimize even moderately sized queries when considering multiple cost metrics, we propose a multitude of approaches to make query optimization in such scenarios practical. A second development that we address in this thesis is the availability of novel software and hardware platforms that can be exploited for optimization. We will show that integer programming solvers, massively parallel clusters (which nowadays are commonly used for query execution), and adiabatic quantum annealers enable us to solve query optimization problem instances that are far beyond the capabilities of prior approaches. In summary, we propose seven novel approaches to query optimization that significantly increase the size of the problem instances that can be addressed (measured by the query size and by the number of considered execution cost metrics). Those novel approaches can be classified into three broad categories: moving query optimization before run time to relax constraints on optimization time, trading optimization time for relaxed optimality guarantees (leading to approximation schemes, incremental algorithms, and randomized algorithms for query optimization with multiple cost metrics), and reducing optimization time by leveraging novel software and hardware platforms (integer programming solvers, massively parallel clusters, and adiabatic quantum annealers). Those approaches are novel since they address novel problem variants of query optimization, introduced in this thesis, since they are novel for their respective problem variant (e.g., we propose the first randomized algorithm for query optimization with multiple cost metrics), or because they have never been used for optimization problems in the database domain (e.g., this is the first time that quantum computing is used to solve a database-specific optimization problem).DAT
- …
