1,721,190 research outputs found
Web Data Commons – Extracting Structured Data from Two Large Web Corpora
More and more websites embed structured data describing for instance products,
people, organizations, places, events, resumes, and cooking recipes into their
HTML pages using encoding standards such as Microformats, Microdatas and RDFa.
The Web Data Commons project extracts all Microformat, Microdata and RDFa data
from the Common Crawl web corpus, the largest and most up-todata web corpus
that is currently available to the public, and provides the extracted data for
download in the form of RDF-quads. In this paper, we give an overview of the
project and present statistics about the popularity of the different encoding
standards as well as the kinds of data that are published using each format
Linked Data - the story so far
The term “Linked Data” refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions— the Web of Data. In this article, the authors present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. They describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward
Towards automatic topical classification of LOD datasets
The datasets that are part of the Linking Open Data cloud
diagramm (LOD cloud) are classified into the following topical
categories: media, government, publications, life sciences,
geographic, social networking, user-generated content,
and cross-domain. The topical categories were manually
assigned to the datasets. In this paper, we investigate to
which extent the topical classification of new LOD datasets
can be automated using machine learning techniques and the
existing annotations as supervision. We conducted experiments
with different classification techniques and different
feature sets. The best classification technique/feature set
combination reaches an accuracy of 81.62% on the task of
assigning one out of the eight classes to a given LOD dataset.
A deeper inspection of the classification errors reveals problems
with the manual classification of datasets in the current
LOD cloud
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
Variations on the Author
“Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship
Benchmarking the Performance of Linked Data Translation Systems
Linked Data sources on the Web use a wide range of different
vocabularies to represent data describing the same type
of entity. For some types of entities, like people or bibliographic
record, common vocabularies have emerged that are
used by multiple data sources. But even for representing
data of these common types, different user communities use
different competing common vocabularies. Linked Data applications
that want to understand as much data from the
Web as possible, thus need to overcome vocabulary heterogeneity
and translate the original data into a single target
vocabulary. To support application developers with this integration
task, several Linked Data translation systems have
been developed. These systems provide languages to express
declarative mappings that are used to translate heterogeneous
Web data into a single target vocabulary. In this paper,
we present a benchmark for comparing the expressivity
as well as the runtime performance of data translation systems.
Based on a set of examples from the LOD Cloud, we
developed a catalog of fifteen data translation patterns and
survey how often these patterns occur in the example set.
Based on these statistics, we designed the LODIB (Linked
Open Data Integration Benchmark) that aims to reflect the
real-world heterogeneities that exist on the Web of Data.
We apply the benchmark to test the performance of two
data translation systems, Mosto and LDIF, and compare
the performance of the systems with the SPARQL 1.1 CONSTRUCT
query performance of the Jena TDB RDF store.Junta de Andalucía P07-TIC-2602Junta de Andalucía P08- TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EEuropean Community FP7-256975 (LATC
- …
