Search CORE

1,791 research outputs found

Message from the DOLAP 2024 Chairs

Author: Matteo Lissandrini
Enrico Gallinucci
Publication venue
Publication date: 01/01/2024
Field of study

Presents the introductory welcome message from the workshop proceedings

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Beyond Macrobenchmarks:Microbenchmark-based Graph Database Evaluation

Author: Brugnara Martin
Velegrakis Yannis
Matteo Lissandrini
Yannis Velegrakis
Lissandrini Matteo; id_orcid
Martin Brugnara
Publication venue
Publication date: 01/01/2018
Field of study

Despite the increasing interest in graph databases their requirements and specifications are not yet fully understoodby everyone, leading to a great deal of variation in the supported functionalities and the achieved performances. Inthis work, we provide a comprehensive study of the existing graph database systems. We introduce a novel microbenchmarking framework that provides insights on their performance that go beyond what macro-benchmarks can offer. The framework includes the largest set of queries andoperators so far considered. The graph database systemsare evaluated on synthetic and real data, from different domains, and at scales much larger than any previous work.The framework is materialized as an open-source suite andis easily extended to new datasets, systems, and queries1

Catalogo dei prodotti della ricerca Università degli Studi di Verona

VBN (Videnbasen) Aalborg Universitets forskningsportal

DBpedia RDF2Vec Graph Embeddings

Author: Hose Katja
Martin Pekár Christensen
Matteo Lissandrini
Lissandrini Matteo
Katja Hose
Christensen Martin Pekár
Publication venue
Publication date: 01/01/2022
Field of study

DBpedia graph embeddings using RDF2Vec. RDF2Vec embedding generation code can be found here and is based on a publication by Portisch et al. [1]. The embeddings dataset consists of 200-dimensional vectors of DBpedia entities (from 1/9/2021). Generating Embeddings The code for generating these embeddings can be found here. Run the run.sh script that wraps all the necessary commmands to generate embeddings bash run.sh The script downloads a set of DBpedia files, which are listed in dbpedia_files.txt. It then builds a Docker image and runs a container of that image that generates the embeddings for the DBpedia graph defined by the DBpedia files. A folder files is created containing all the downloaded DBpedia files, and a folder embeddings/dbpedia is created containing the embeddings in vectors.txt along a set of random walk files. Run Time of Embeddings Generation Generating embeddings can take more than a day, but it depends on the number of DBpedia files chosen to be downloaded. Following are some basic run time statistics when embeddings are generated on a 64 GB RAM, 8 cores (AMD EPYC), 1 TB SSD, 1996.221 MHz machine. Total: 1 day, 8 hours, 52 minutes, 41 seconds Walk generation: 0 days, 7 minutes, 24 minutes, 36 seconds Training: 1 day, 1 hour, 28 minutes, 5 seconds Parameters Used Here is listed the parameters used to generate the embeddings provided here: Number of walks per entity: 100 Depth (hops) per walk: 4 Walk generation mode: RANDOM_WALKS_DUPLICATE_FREE Threads: # of processors / 2 Training mode: sg Embeddings vector dimension: 200 Minimum word2vec word count: 1 Sample rate: 0.0 Training window size: 5 Training epochs:

ZENODO

Catalogo dei prodotti della ricerca Università degli Studi di Verona

Estimating the extent of the effects of data quality through observations

Author: Foroni Daniele
Velegrakis Yannis
Matteo Lissandrini
Yannis Velegrakis
Daniele Foroni
Lissandrini Matteo; id_orcid
Publication venue
Publication date: 01/01/2021
Field of study

Existing data quality works have so far focused on the computation of many data characteristics as a mean of quantifying different quality dimensions, like freshness, consistency, accuracy, or completeness, that are all defined about some ideal (clean) dataset. We claim that this approach falls short in providing a full specification of the quality of the data since it does not take into consideration the task for which the data is to be used, neither any future instances of the dataset. We argue that apart from the difference from the clean dataset, it is equally important to know the degree to which such difference affects the results of the task at hand. Thus, we extend the existing data quality definition to include that degree. Our approach, not only allows data quality to be considered in the context of the intended task, but can also provide useful information even in the absence of the clean dataset, and proffer an understanding of the effect of data quality in future dataset instances. We describe a system and its implementation that computes this extended form of data quality through a principled approach of systematic noise generation and task result evaluation. We perform numerous experiments illustrating the effectiveness of the approach and how this allows contextualizing traditional data quality measures.</p

Crossref

Catalogo dei prodotti della ricerca Università degli Studi di Verona

VBN (Videnbasen) Aalborg Universitets forskningsportal

The ESW of Wikidata: Exploratory search workflows on Knowledge Graphs

Author: Silvello Gianmaria
Matteo Lissandrini
Gianmaria Silvello
Lissandrini Matteo
Prando Gianmarco
Gianmarco Prando
Publication venue
Publication date: 01/01/2025
Field of study

Exploratory search on Knowledge Graphs (KGs) arises when a user needs to understand and extract insights from an unfamiliar KG. In these exploratory sessions, the users issue a series of queries to identify relevant portions of the KG that can answer their questions, with each query answer informing the formulation of the next query. Despite the widespread adoption of KGs, the needs of current KG exploration use cases are not well understood. This work presents the “Exploratory Search Workflows” (ESW) collection focusing on real-world exploration sessions of an open-domain KG, Wikidata, conducted by 57 M.Sc. Computer Engineering students in two advanced Graph Database course editions. This resource includes 234 real exploratory workflows, each containing an average of 45 SPARQL queries and reference workflows that serve as gold-standard solutions to the proposed tasks. The ESW collection is also available as an RDF graph and accessible via a public SPARQL endpoint. It allows for analysis of real user sessions, understanding query evolution and complexity, and serves as the first query benchmark for KG management systems for exploratory search

Directory of Open Access Journals

Catalogo dei prodotti della ricerca Università degli Studi di Verona

A foundation for spatio-textual-temporal cube analytics

Author: Matteo Lissandrini
Iqbal Mohsin
Pedersen (Torben Bach)
Lissandrini Matteo; id_orcid
Pedersen Torben Bach; id_orcid
Mohsin Iqbal
Publication venue
Publication date: 01/01/2021
Field of study

Large amounts of spatial, textual, and temporal (STT) data are being produced daily. This is data containing an unstructured component (text), a spatial component (geographic position), and a time component (timestamp). Therefore, there is a need for a powerful and general way of analyzing STT data together. In this paper, we define and formalize the Spatio-Textual-Temporal Cube (STTCube) structure to enable combined effective and efficient analytical queries over STT data. Our novel data model over STT objects enables novel joint and integrated STT insights that are hard to obtain using existing methods. Moreover, we introduce the new concept of STT measures with associated novel STTOLAP operators. To allow for efficient large-scale analytics, we present a pre-aggregation framework for exact and approximate computation of STT measures. Our comprehensive experimental evaluation on a real-world Twitter dataset confirms that our proposed methods reduce query response time by 1-5 orders of magnitude compared to the No Materialization baseline and decrease storage cost between 97% and 99.9% compared to the Full Materialization baseline while adding only a negligible overhead in the STTCube construction time. Moreover, approximate computation achieves an accuracy between 90% and 100% while reducing query response time by 3-5 orders of magnitude compared to No Materialization.</p

Catalogo dei prodotti della ricerca Università degli Studi di Verona

VBN (Videnbasen) Aalborg Universitets forskningsportal

SHACL and ShEx in the Wild:A Community Survey on Validating Shapes Generation and Adoption

Author: Kashif Rabbani
Matteo Lissandrini
Hose Katja; id_orcid
Lissandrini Matteo; id_orcid
Rabbani Kashif; id_orcid
Katja Hose
Publication venue
Publication date: 01/01/2022
Field of study

Knowledge Graphs (KGs) are widely used to represent heterogeneous domain knowledge on the Web and within organizations. Various methods exist to manage KGs and ensure the quality of their data. Among these, the Shapes Constraint Language (SHACL) and the Shapes Expression Language (ShEx) are the two state-of-the-art languages to define validating shapes for KGs. Since the usage of these constraint languages has recently increased, new needs arose. One such need is to enable the efficient generation of these shapes. Yet, since these languages are relatively new, we witness a lack of understanding of how they are effectively employed for existing KGs. Therefore, in this work, we answer How validating shapes are being generated and adopted? Our contribution is threefold. First, we conducted a community survey to analyze the needs of users (both from industry and academia) generating validating shapes. Then, we cross-referenced our results with an extensive survey of the existing tools and their features. Finally, we investigated how existing automatic shape extraction approaches work in practice on real, large KGs. Our analysis shows the need for developing semi-automatic methods that can help users generate shapes from large KGs.</p

Crossref

Catalogo dei prodotti della ricerca Università degli Studi di Verona

VBN (Videnbasen) Aalborg Universitets forskningsportal

The F4U system for understanding the effects of data quality

Author: Foroni Daniele
Velegrakis Yannis
Matteo Lissandrini
Yannis Velegrakis
Daniele Foroni
Lissandrini Matteo; id_orcid
Publication venue
Publication date: 01/01/2021
Field of study

We demonstrate a system that enables a data-centric approach in understanding data quality. Instead of directly quantifying data quality as traditionally done, it disrupts the quality of the dataset and monitors the deviations in the output of an analytic task at hand. It computes the correlation factor between the disruption and the deviation and uses it as the quality metric. This allows users to understand not only the quality of their dataset but also the effect that present and future quality issues have to the intended analytic tasks. This is a novel data-centric approach aimed at complementing existing solutions. On top of the new information that it provides, and in contrast to existing techniques of data quality, it neither requires knowledge of the clean datasets, nor of the constraints on which the data should comply.</p

Crossref

Catalogo dei prodotti della ricerca Università degli Studi di Verona

VBN (Videnbasen) Aalborg Universitets forskningsportal

PlanRGCN: Predicting SPARQL Query Performance

Author: Matteo Lissandrini
Abiram Mohanaraj
Katja Hose
Publication venue
Publication date: 01/01/2025
Field of study

Query Performance Prediction (QPP) is the task of predicting the query runtime performance prior to its execution. While QPP has been studied in relational database systems, it has received little attention for RDF stores, i.e., triplestores that are queried via the SPARQL query language. Existing methods predict the query performance based on the syntactic similarity between a given query and past queries in the query logs. This means that they are not able to generalize to unseen queries with unseen structures or characteristics. We propose a novel GCNN architecture, PlanRGCN, to generalize to unseen queries, fully exploit statistics on the stored KG, and offer more scalable pre-training than the state of the art methods. Furthermore, our architecture is the first to support nontrivial SPARQL operators. In our experiments, we demonstrate both the superior robustness of our prediction method and its practical effect on two downstream tasks: (1) load balancing, achieving a throughput improvement of up to

207 \%

on real-world query logs and (2) execution control, processing up to

70 \%

more queries

Catalogo dei prodotti della ricerca Università degli Studi di Verona

Reproducibility and Analysis of Scientific Dataset Recommendation Methods

Author: Daniele Dell'Aglio
Irrera Ornella
Silvello Gianmaria
Matteo Lissandrini
Gianmaria Silvello
Lissandrini Matteo
Dell'Aglio Daniele; id_orcid
Ornella Irrera
Publication venue
Publication date: 01/01/2024
Field of study

Datasets play a central role in scholarly communications. However, scholarly graphs are often incomplete, particularly due to the lack of connections between publications and datasets. Therefore, the importance of dataset recommendation—identifying relevant datasets for a scientific paper, an author, or a textual query—is increasing. Although various methods have been proposed for this task, their reproducibility remains unexplored, making it difficult to compare them with new approaches. We reviewed current recommendation methods for scientific datasets, focusing on the most recent and competitive approaches, including an SVM-based model, a bi-encoder retriever, a method leveraging co-authors and citation network embeddings, and a heterogeneous variational graph autoencoder. These approaches underwent a comprehensive analysis under consistent experimental conditions. Our reproducibility efforts show that three methods can be reproduced, while the graph variational autoencoder is challenging due to unavailable code and test datasets. Hence, we re-implemented this method and performed a component-based analysis to examine its strengths and limitations. Furthermore, our study indicated that three out of four considered methods produce subpar results when applied to real-world data instead of specialized datasets with ad-hoc features

Catalogo dei prodotti della ricerca Università degli Studi di Verona

VBN (Videnbasen) Aalborg Universitets forskningsportal