1,720,967 research outputs found

    Can We Measure the Impact of a Database?

    Full text link
    Databases publish data. This is undoubtedly the case for scientific and statistical databases, which have largely replaced traditional reference works. Database and Web technologies have led to an explosion in the number of databases that support scientific research, for obvious reasons: Databases provide faster communication of knowledge, hold larger volumes of data, are more easily searched, and are both human- and machine-readable. Moreover, they can be developed rapidly and collaboratively by a mixture of researchers and curators. For example, more than 1,500 curated databases are relevant to molecular biology alone. The value of these databases lies not only in the data they present but also in how they organize that data. In the case of an author or journal, most bibliometric measures are obtained from citations to an associated set of publications. There are typically many ways of decomposing a database into publications, so we might use its organization to guide our choice of decompositions. We will show that when the database has a hierarchical structure, there is a natural extension of the h-index that works on this hierarchy

    The ESW of Wikidata: Exploratory search workflows on Knowledge Graphs

    Full text link
    Exploratory search on Knowledge Graphs (KGs) arises when a user needs to understand and extract insights from an unfamiliar KG. In these exploratory sessions, the users issue a series of queries to identify relevant portions of the KG that can answer their questions, with each query answer informing the formulation of the next query. Despite the widespread adoption of KGs, the needs of current KG exploration use cases are not well understood. This work presents the “Exploratory Search Workflows” (ESW) collection focusing on real-world exploration sessions of an open-domain KG, Wikidata, conducted by 57 M.Sc. Computer Engineering students in two advanced Graph Database course editions. This resource includes 234 real exploratory workflows, each containing an average of 45 SPARQL queries and reference workflows that serve as gold-standard solutions to the proposed tasks. The ESW collection is also available as an RDF graph and accessible via a public SPARQL endpoint. It allows for analysis of real user sessions, understanding query evolution and complexity, and serves as the first query benchmark for KG management systems for exploratory search

    Reproducibility and Analysis of Scientific Dataset Recommendation Methods

    Full text link
    Datasets play a central role in scholarly communications. However, scholarly graphs are often incomplete, particularly due to the lack of connections between publications and datasets. Therefore, the importance of dataset recommendation—identifying relevant datasets for a scientific paper, an author, or a textual query—is increasing. Although various methods have been proposed for this task, their reproducibility remains unexplored, making it difficult to compare them with new approaches. We reviewed current recommendation methods for scientific datasets, focusing on the most recent and competitive approaches, including an SVM-based model, a bi-encoder retriever, a method leveraging co-authors and citation network embeddings, and a heterogeneous variational graph autoencoder. These approaches underwent a comprehensive analysis under consistent experimental conditions. Our reproducibility efforts show that three methods can be reproduced, while the graph variational autoencoder is challenging due to unavailable code and test datasets. Hence, we re-implemented this method and performed a component-based analysis to examine its strengths and limitations. Furthermore, our study indicated that three out of four considered methods produce subpar results when applied to real-world data instead of specialized datasets with ad-hoc features

    Data citation and the citation graph

    Full text link
    The citation graph is a computational artifact that is widely used to represent the domain of published literature. It represents connections between published works, such as citations and authorship. Among other things, the graph supports the computation of bibliometric measures such as h-indexes and impact factors. There is now an increasing demand that we should treat the publication of data in the same way that we treat conventional publications. In particular, we should cite data for the same reasons that we cite other publications. In this paper we discuss what is needed for the citation graph to represent data citation. We identify two challenges: to model the evolution of credit appropriately (through references) over time and to model data citation not only to a data set treated as a single object but also to parts of it. We describe an extension of the current citation graph model that addresses these challenges. It is built on two central concepts: citable units and reference subsumption. We discuss how this extension would enable data citation to be represented within the citation graph and how it allows for improvements in current practices for bibliometric computations, both for scientific publications and for data

    Mining patterns in graphs with multiple weights

    No full text
    Graph pattern mining aims at identifying structures that appear frequently in large graphs, under the assumption that frequency signifies importance. In real life, there are many graphs with weights on nodes and/or edges. For these graphs, it is fair that the importance (score) of a pattern is determined not only by the number of its appearances, but also by the weights on the nodes/edges of those appearances. Scoring functions based on the weights do not generally satisfy the apriori property, which guarantees that the number of appearances of a pattern cannot be larger than the frequency of any of its sub-patterns, and hence allows faster pruning. Therefore, existing approaches employ other, less efficient, pruning strategies. The problem becomes even more challenging in the case of multiple weighting functions that assign different weights to the same nodes/edges. In this work we propose a new family of scoring functions that respects the apriori property, and thus can rely on effective pruning strategies. We provide efficient and effective techniques for mining patterns in multi-weighted graphs, and we devise both an exact and an approximate solution. In addition, we propose a distributed version of our approach, which distributes the appearances of the patterns to examine among multiple workers. Extensive experiments on both real and synthetic datasets prove that the presence of edge weights and the choice of scoring function affect the patterns mined, and the quality of the results returned to the user. Moreover, we show that, even when the performance of the exact algorithm degrades because of an increasing number of weighting functions, the approximate algorithm performs well and with fairly good quality. Finally, the distributed algorithm proves to be the best choice for mining large and rich input graphs

    Exemplar queries: a new way of searching

    No full text
    Modern search engines employ advanced techniques that go beyond the structures that strictly satisfy the query conditions in an effort to better capture the user intentions. In this work, we introduce a novel query paradigm that considers a user query as an example of the data in which the user is interested. We call these queries exemplar queries. We provide a formal specification of their semantics and show that they are fundamentally different from notions like queries by example, approximate queries and related queries. We provide an implementation of these semantics for knowledge graphs and present an exact solution with a number of optimizations that improve performance without compromising the result quality. We study two different congruence relations, isomorphism and strong simulation, for identifying the answers to an exemplar query. We also provide an approximate solution that prunes the search space and achieves considerably better time performance with minimal or no impact on effectiveness. The effectiveness and efficiency of these solutions with synthetic and real datasets are experimentally evaluated, and the importance of exemplar queries in practice is illustrated

    Beyond frequencies: Graph Pattern mining in multi-weighted graphs

    No full text
    Graph pattern mining aims at identifying structures that appear frequently in large graphs, under the assumption that frequency signies importance. Several measures of frequency have been proposed that respect the apriori property, essential for an e-cient search of the patterns. This property states that the number of appearances of a pattern in a graph cannot be larger than the frequency of any of its sub-patterns. In real life, there are many graphs with weights on nodes and/or edges. For these graphs, it is fair that the importance (score) of a pattern is determined not only by the number of its appearances, but also by the weights on the nodes/edges of those appearances. Scoring functions based on the weights do not generally satisfy the apriori property, thus forcing many approaches to employ other, less ecient, pruning strategies to speed up the computation. The problem becomes even more challenging in the case of multiple weighting functions that assign dierent weights to the same nodes/edges. In this work, we provide ecient and eective techniques for mining patterns in multi-weight graphs. We devise both an exact and an approximate solution. The rst is characterized by intelligent storage and computation of the pattern scores, while the second is based on the aggregation of similar weighting functions to allow scalability and avoid redundant computations. Both methods adopt a scoring function that respects the apriori property, and thus they can rely on eective pruning strategies. Extensive experiments under dierent parameter settings prove that the presence of edge weights and the choice of scoring function aect the patterns mined, and hence the quality of the results returned to the user. Finally, experiments on datasets of dierent sizes and increasing numbers of weighting functions show that, even when the performance of the exact algorithm degrades, the approximate algorithm performs well and with quite good quality

    Extraction of Validating Shapes from very large Knowledge Graphs

    Full text link
    Knowledge Graphs (KGs) represent heterogeneous domain knowledge on the Web and within organizations. There exist shapes constraint languages to define validating shapes to ensure the quality of the data in KGs. Existing techniques to extract validating shapes often fail to extract complete shapes, are not scalable, and are prone to produce spurious shapes. To address these shortcomings, we propose the QUALITY SHAPES EXTRACTION (QSE) approach to extract validating shapes in very large graphs, for which we devise both an exact and an approximate solution. QSE provides information about the reliability of shape constraints by computing their confidence and support within a KG and in doing so allows to identify shapes that are most informative and less likely to be affected by incomplete or incorrect data. To the best of our knowledge, QSE is the first approach to extract a complete set of validating shapes from WikiData. Moreover, QSE provides a 12x reduction in extraction time compared to existing approaches, while managing to filter out up to 93% of the invalid and spurious shapes, resulting in a reduction of up to 2 orders of magnitude in the number of constraints presented to the user, e.g., from 11,916 to 809 on DBpedia

    Multi-Example Search in Rich Information Graphs

    No full text
    In rich information spaces, it is often hard for users to formally specify the characteristics of the desired answers, either due to the complexity of the schema or of the query language, or even because they do not know exactly what they are looking for. Exemplar queries constitute a query paradigm that overcomes those problems, by allowing users to provide examples of the elements of interest in place of the query specification. In this paper, we propose a general approach where the user-provided example can comprise several partial specification fragments, where each fragment describes only one part of the desired result. We provide a formal definition of the problem, which generalizes existing formulations for both the relational and the graph model. We then describe exact algorithms for its solution for the case of information graphs, as well as top-k algorithms. Experiments on large real datasets demonstrate the effectiveness and efficiency of the proposed approach
    corecore