1,791 research outputs found

    Message from the DOLAP 2024 Chairs

    No full text
    Presents the introductory welcome message from the workshop proceedings

    Beyond Macrobenchmarks:Microbenchmark-based Graph Database Evaluation

    Full text link
    Despite the increasing interest in graph databases their requirements and specifications are not yet fully understoodby everyone, leading to a great deal of variation in the supported functionalities and the achieved performances. Inthis work, we provide a comprehensive study of the existing graph database systems. We introduce a novel microbenchmarking framework that provides insights on their performance that go beyond what macro-benchmarks can offer. The framework includes the largest set of queries andoperators so far considered. The graph database systemsare evaluated on synthetic and real data, from different domains, and at scales much larger than any previous work.The framework is materialized as an open-source suite andis easily extended to new datasets, systems, and queries1

    DBpedia RDF2Vec Graph Embeddings

    No full text
    DBpedia graph embeddings using RDF2Vec. RDF2Vec embedding generation code can be found here and is based on a publication by Portisch et al. [1]. The embeddings dataset consists of 200-dimensional vectors of DBpedia entities (from 1/9/2021). Generating Embeddings The code for generating these embeddings can be found here. Run the run.sh script that wraps all the necessary commmands to generate embeddings bash run.sh The script downloads a set of DBpedia files, which are listed in dbpedia_files.txt. It then builds a Docker image and runs a container of that image that generates the embeddings for the DBpedia graph defined by the DBpedia files. A folder files is created containing all the downloaded DBpedia files, and a folder embeddings/dbpedia is created containing the embeddings in vectors.txt along a set of random walk files. Run Time of Embeddings Generation Generating embeddings can take more than a day, but it depends on the number of DBpedia files chosen to be downloaded. Following are some basic run time statistics when embeddings are generated on a 64 GB RAM, 8 cores (AMD EPYC), 1 TB SSD, 1996.221 MHz machine. Total: 1 day, 8 hours, 52 minutes, 41 seconds Walk generation: 0 days, 7 minutes, 24 minutes, 36 seconds Training: 1 day, 1 hour, 28 minutes, 5 seconds Parameters Used Here is listed the parameters used to generate the embeddings provided here: Number of walks per entity: 100 Depth (hops) per walk: 4 Walk generation mode: RANDOM_WALKS_DUPLICATE_FREE Threads: # of processors / 2 Training mode: sg Embeddings vector dimension: 200 Minimum word2vec word count: 1 Sample rate: 0.0 Training window size: 5 Training epochs:

    Estimating the extent of the effects of data quality through observations

    No full text
    Existing data quality works have so far focused on the computation of many data characteristics as a mean of quantifying different quality dimensions, like freshness, consistency, accuracy, or completeness, that are all defined about some ideal (clean) dataset. We claim that this approach falls short in providing a full specification of the quality of the data since it does not take into consideration the task for which the data is to be used, neither any future instances of the dataset. We argue that apart from the difference from the clean dataset, it is equally important to know the degree to which such difference affects the results of the task at hand. Thus, we extend the existing data quality definition to include that degree. Our approach, not only allows data quality to be considered in the context of the intended task, but can also provide useful information even in the absence of the clean dataset, and proffer an understanding of the effect of data quality in future dataset instances. We describe a system and its implementation that computes this extended form of data quality through a principled approach of systematic noise generation and task result evaluation. We perform numerous experiments illustrating the effectiveness of the approach and how this allows contextualizing traditional data quality measures.</p

    The ESW of Wikidata: Exploratory search workflows on Knowledge Graphs

    No full text
    Exploratory search on Knowledge Graphs (KGs) arises when a user needs to understand and extract insights from an unfamiliar KG. In these exploratory sessions, the users issue a series of queries to identify relevant portions of the KG that can answer their questions, with each query answer informing the formulation of the next query. Despite the widespread adoption of KGs, the needs of current KG exploration use cases are not well understood. This work presents the “Exploratory Search Workflows” (ESW) collection focusing on real-world exploration sessions of an open-domain KG, Wikidata, conducted by 57 M.Sc. Computer Engineering students in two advanced Graph Database course editions. This resource includes 234 real exploratory workflows, each containing an average of 45 SPARQL queries and reference workflows that serve as gold-standard solutions to the proposed tasks. The ESW collection is also available as an RDF graph and accessible via a public SPARQL endpoint. It allows for analysis of real user sessions, understanding query evolution and complexity, and serves as the first query benchmark for KG management systems for exploratory search

    A foundation for spatio-textual-temporal cube analytics

    Full text link
    Large amounts of spatial, textual, and temporal (STT) data are being produced daily. This is data containing an unstructured component (text), a spatial component (geographic position), and a time component (timestamp). Therefore, there is a need for a powerful and general way of analyzing STT data together. In this paper, we define and formalize the Spatio-Textual-Temporal Cube (STTCube) structure to enable combined effective and efficient analytical queries over STT data. Our novel data model over STT objects enables novel joint and integrated STT insights that are hard to obtain using existing methods. Moreover, we introduce the new concept of STT measures with associated novel STTOLAP operators. To allow for efficient large-scale analytics, we present a pre-aggregation framework for exact and approximate computation of STT measures. Our comprehensive experimental evaluation on a real-world Twitter dataset confirms that our proposed methods reduce query response time by 1-5 orders of magnitude compared to the No Materialization baseline and decrease storage cost between 97% and 99.9% compared to the Full Materialization baseline while adding only a negligible overhead in the STTCube construction time. Moreover, approximate computation achieves an accuracy between 90% and 100% while reducing query response time by 3-5 orders of magnitude compared to No Materialization.</p

    SHACL and ShEx in the Wild:A Community Survey on Validating Shapes Generation and Adoption

    No full text
    Knowledge Graphs (KGs) are widely used to represent heterogeneous domain knowledge on the Web and within organizations. Various methods exist to manage KGs and ensure the quality of their data. Among these, the Shapes Constraint Language (SHACL) and the Shapes Expression Language (ShEx) are the two state-of-the-art languages to define validating shapes for KGs. Since the usage of these constraint languages has recently increased, new needs arose. One such need is to enable the efficient generation of these shapes. Yet, since these languages are relatively new, we witness a lack of understanding of how they are effectively employed for existing KGs. Therefore, in this work, we answer How validating shapes are being generated and adopted? Our contribution is threefold. First, we conducted a community survey to analyze the needs of users (both from industry and academia) generating validating shapes. Then, we cross-referenced our results with an extensive survey of the existing tools and their features. Finally, we investigated how existing automatic shape extraction approaches work in practice on real, large KGs. Our analysis shows the need for developing semi-automatic methods that can help users generate shapes from large KGs.</p

    The F4U system for understanding the effects of data quality

    No full text
    We demonstrate a system that enables a data-centric approach in understanding data quality. Instead of directly quantifying data quality as traditionally done, it disrupts the quality of the dataset and monitors the deviations in the output of an analytic task at hand. It computes the correlation factor between the disruption and the deviation and uses it as the quality metric. This allows users to understand not only the quality of their dataset but also the effect that present and future quality issues have to the intended analytic tasks. This is a novel data-centric approach aimed at complementing existing solutions. On top of the new information that it provides, and in contrast to existing techniques of data quality, it neither requires knowledge of the clean datasets, nor of the constraints on which the data should comply.</p

    PlanRGCN: Predicting SPARQL Query Performance

    No full text
    Query Performance Prediction (QPP) is the task of predicting the query runtime performance prior to its execution. While QPP has been studied in relational database systems, it has received little attention for RDF stores, i.e., triplestores that are queried via the SPARQL query language. Existing methods predict the query performance based on the syntactic similarity between a given query and past queries in the query logs. This means that they are not able to generalize to unseen queries with unseen structures or characteristics. We propose a novel GCNN architecture, PlanRGCN, to generalize to unseen queries, fully exploit statistics on the stored KG, and offer more scalable pre-training than the state of the art methods. Furthermore, our architecture is the first to support nontrivial SPARQL operators. In our experiments, we demonstrate both the superior robustness of our prediction method and its practical effect on two downstream tasks: (1) load balancing, achieving a throughput improvement of up to 207%207 \% on real-world query logs and (2) execution control, processing up to 70%70 \% more queries

    Reproducibility and Analysis of Scientific Dataset Recommendation Methods

    Full text link
    Datasets play a central role in scholarly communications. However, scholarly graphs are often incomplete, particularly due to the lack of connections between publications and datasets. Therefore, the importance of dataset recommendation—identifying relevant datasets for a scientific paper, an author, or a textual query—is increasing. Although various methods have been proposed for this task, their reproducibility remains unexplored, making it difficult to compare them with new approaches. We reviewed current recommendation methods for scientific datasets, focusing on the most recent and competitive approaches, including an SVM-based model, a bi-encoder retriever, a method leveraging co-authors and citation network embeddings, and a heterogeneous variational graph autoencoder. These approaches underwent a comprehensive analysis under consistent experimental conditions. Our reproducibility efforts show that three methods can be reproduced, while the graph variational autoencoder is challenging due to unavailable code and test datasets. Hence, we re-implemented this method and performed a component-based analysis to examine its strengths and limitations. Furthermore, our study indicated that three out of four considered methods produce subpar results when applied to real-world data instead of specialized datasets with ad-hoc features
    corecore