1,721,018 research outputs found
A unifying framework for l0-sampling algorithms
The problem of building an l 0-sampler is to sample near-uniformly from the support set of a dynamic multiset. This problem has a variety of applications within data analysis, computational geometry and graph algorithms. In this paper, we abstract a set of steps for building an l 0-sampler, based on sampling, recovery and selection. We analyze the implementation of an l 0-sampler within this framework, and show how prior constructions of l 0-samplers can all be expressed in terms of these steps. Our experimental contribution is to provide a first detailed study of the accuracy and computational cost of l 0-samplers. © 2013 Springer Science+Business Media New York
The (not so) critical nodes of criminal networks
One of the most basic question in the analysis of social networks is to find nodes that are of particular relevance in the network. The answer that emerged in the recent literature is that the importance, or centrality, of a node x is proportional to the number of nodes that get disconnected from the network when node x is removed. We show that while in social networks such important nodes lie in their cores (i.e., maximal subgraphs in which all nodes have degree higher than a certain value), this is not necessarily the case in criminal networks. This shows that nodes whose removal affects large portions of the criminal network prefer to operate from network peripheries, thus confirming the intuition of Baker and Faulkner [4]. Our results also highlight structural differences between criminal networks and other social networks, suggesting that classical definitions of importance (or centrality) in a network fail to capture the concept of key players in criminal networks
Online entity resolution using an oracle
Entity resolution (ER) is the task of identifying all records in adatabase that refer to the same underlying entity. This is an expensivetask, and can take a significant amount of money and time; theend-user may want to take decisions during the process, rather thanwaiting for the task to be completed. We formalize an online versionof the entity resolution task, and use an oracle which correctlylabels matching and non-matching pairs through queries. In thissetting, we design algorithms that seek to maximize progressive recall,and develop a novel analysis framework for prior proposalson entity resolution with an oracle, beyond their worst case guarantees.Finally, we provide both theoretical and experimental analysisof the proposed algorithms
On the Meaningfulness of “Big Data Quality” (Invited Paper)
In this paper, we discuss the application of concept of data quality to big data by highlighting how much complex is to define it in a general way. Already data quality is a multidimensional concept, difficult to characterize in precise definitions even in the case of well-structured data. Big data add two further dimensions of complexity: (i) being “very” source specific, and for this we adopt the interesting UNECE classification, and (ii) being highly unstructured and schema-less, often without golden standards to refer to or very difficult to access. After providing a tutorial on data quality in traditional contexts, we analyze big data by providing insights into the UNECE classification, and then, for each type of data source, we choose a specific instance of such a type (notably deep Web data, sensor-generated data, and Twitters/short texts) and discuss how quality dimensions can be defined in these cases. The overall aim of the paper is therefore to identify further research directions in the area of big data quality, by providing at the same time an up-to-date state of the art on data quality. © 2015, The Author(s)
Cliques are Too Strict for Representing Communities: Finding Large k-plexes in Real Networks
k-plexes are a formal yet flexible way of defining communities in networks. They generalize the notion of cliques and are more appropriate in most real cases: while a node of a clique C is connected to all other nodes of C, a node of a k-plex may miss up to k connections. Unfortunately, computing all maximal k-plexes is a gruesome task and state-of-the-art algorithms can only process small-size networks. In this paper we propose a new approach for enumerating large k-plexes in networks that speeds up the search by several orders of magnitude, leveraging on efficient techniques for the computation of maximal cliques
In Codice Ratio: Machine Transcription of Medieval Manuscripts
Our project, In Codice Ratio, is an interdisciplinary research
initiative for analyzing content of historical documents conserved in the
Vatican Secret Archives (VSA). As most of such documents are digitized
as images, Machine Transcription is both an enabler to the application
of Knowledge Discovery techniques, as well as a useful tool to the
paleographer for speeding up the transcription process. Our approach
involves a convolutional neural network to recognize characters, statistical
language models to compose and rank word transcriptions, and
crowdsourcing for scalable training data collection. We have conducted
experiments on pages from the medieval manuscript collection known as
the Vatican Registers. Our results show that almost all the considered
words can be transcribed without significant spelling errors
Computer Science Foundations for Digital Libraries: Algorithms, Systems, and Applications
Digital libraries face challenges in quality, accessibility, and usage of resources. This issue presents seven papers offering computational and technical solutions to these problems: data quality through validation and monitoring, AI evaluation of information systems, and enhanced content discoverability. Research also covers knowledge representation with new provenance models, deep learning for bibliographic control, metadata-driven access to underrepresented languages, and computational methods for restoring historical documents. These papers showcase how modern techniques like machine learning, semantic web technologies, knowledge graphs, and image processing tackle digital library challenges, improving resource quality and accessibility. These papers were selected from the 21st Italian Research Conference on Digital Libraries (IRCDL 2025), held in Udine, Italy, on 20–21 February 2025, which has served since 2005 as a key annual forum bringing together researchers from academia, government, and industry to address topics spanning computer science, digital humanities, information science, librarianship, archival science, museum studies, and cultural heritage
Biconnectivity of social and web graphs
Network connectivity is a basic notion for big data analysis, with many practical applications. In particular, the analysis of the largest strongly connected component is at the heart of the famous bow-tie structure of the Web. In this paper, we go beyond connected components and characterize the properties of the higher-connectivity components that lie inside real-world networks. We combine and extend recently developed techniques for scaling up to real-world datasets with up to 23.1 billion edges. Our experiments point out a striking difference between structural properties in social and web graphs, and highlight deep properties of social networks which may be helpful in devising theoretical models that can better capture their structure
On unifying the space of l0-sampling algorithms
The problem of building an l0-sampler is to sample nearuniformly from the support set of a dynamic multiset. This problem has a variety of applications within data analysis, computational geometry and graph algorithms. In this paper, we abstract a set of steps for building an l0-sampler, based on sampling, recovery and selection. We analyze the implementation of an l0-sampler within this framework, and show how prior constructions of l0-samplers can all be expressed in terms of these steps. Our experimental contribution is to provide a first detailed study of the accuracy and computational cost of l0-samplers
Real-time anomalies detection and analysis of network structure, with application to the Autonomous System network
The structural analysis is the very basic tool for understanding the properties of a network. In this paper we present a (customizable) tool, able to compute in real-time the most important connectivity properties of a network, modeled as an undirected graph: connected and biconnected components, articulation points and bridges. The algorithm underlying the tool has been theoretically analyzed in the (semi-)streaming model, and has been tested with graphs up to hundreds of millions nodes and billions edges. The tool, therefore, can be employed to monitor traffic flows in medium and large networks, at real-time, and detect possible anomalies. As an application, we provide results about the structural properties of ten years of samples of the Autonomous System network, obtained from the Univ. of Oregon Route Views project [1], that (once again) shows the ubiquitous presence of power-law distribution. © 2011 IEEE
- …
