Search CORE

1,721,059 research outputs found

Encoding syntactic dependencies using Random Indexing and Wikipedia as a corpus

Author: CAPUTO ANNALINA
BASILE PIERPAOLO
Publication venue
Publication date: 01/01/2012
Field of study

Distributional approaches are based on a simple hypothesis: the meaning of a word can be inferred from its usage. The application of that idea to the vector space model makes possible the construction of a WordSpace in which words are represented by mathematical points in a geometric space. Similar words are represented close in this space and the definition of "word usage" depends on the definition of the context used to build the space, which can be the whole document, the sentence in which the word occurs, a fixed window of words, or a specific syntactic context. However, in its original formulation WordSpace can take into account only one definition of context at a time. We propose an approach based on vector permutation and Random Indexing to encode several syntactic contexts in a single WordSpace. We adopt WaCkypedia EN corpus to build our WordSpace that is a 2009 dump of the English Wikipedia (about 800 million tokens) annotated with syntactic information provided by a full dependency parser. The effectiveness of our approach is evaluated using the GEometrical Models of natural language Semantics (GEMS) 2011 Shared Evaluation data

Archivio istituzionale della ricerca - Università di Bari

UNIBA-SENSE at CLEF 2008: Semantic N-Levels Search Engine

Author: CAPUTO ANNALINA
SEMERARO Giovanni
BASILE PIERPAOLO
Publication venue
Publication date: 01/01/2008
Field of study

Archivio istituzionale della ricerca - Università di Bari

UNIBA-CORE: Combining Strategies for Semantic Textual Similarity.

Author: CAPUTO ANNALINA
SEMERARO Giovanni
BASILE PIERPAOLO
Publication venue
Publication date: 01/01/2013
Field of study

This paper describes the UNIBA participation in the Semantic Textual Similarity (STS) core task 2013. We exploited three different systems for computing the similarity between two texts. A system is used as baseline, which represents the best model emerged from our previous participation in STS 2012. Such system is based on a distributional model of semantics capable of taking into account also syntactic structures that glue words together. In addition, we investigated the use of two different learning strategies exploiting both syntactic and semantic features. The former uses a combination strategy in order to combine the best machine learning techniques trained on 2012 training and test sets. The latter tries to overcame the limit of working with different datasets with varying characteristics by selecting only the more suitable dataset for the training purpose

Archivio istituzionale della ricerca - Università di Bari

Encoding syntactic dependencies by vector permutation

Author: CAPUTO ANNALINA
SEMERARO Giovanni
BASILE PIERPAOLO
Publication venue
Publication date: 01/01/2011
Field of study

Archivio istituzionale della ricerca - Università di Bari

An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Author: CAPUTO ANNALINA
SEMERARO Giovanni
BASILE PIERPAOLO
Publication venue
Publication date: 01/01/2014
Field of study

This paper describes a new Word Sense Disambiguation (WSD) algorithm which extends two well-known variations of the Lesk WSD method. Given a word and its context, Lesk algorithm exploits the idea of maximum number of shared words (maximum overlaps) between the context of a word and each definition of its senses (gloss) in order to select the proper meaning. The main contribution of our approach relies on the use of a word similarity function defined on a distributional semantic space to compute the gloss-context overlap. As sense inventory we adopt BabelNet, a large multilingual semantic network built exploiting both WordNet and Wikipedia. Besides linguistic knowledge, BabelNet represents also encyclopedic concepts coming from Wikipedia. The evaluation performed on SemEval-2013 Multilingual Word Sense Disambiguation shows that our algorithm goes beyond the most frequent sense baseline and the simplified version of the Lesk algorithm. Moreover, when compared with the other participants in SemEval-2013 task, our approach is able to outperform the best system for English

Archivio istituzionale della ricerca - Università di Bari

From terms to concepts: a revisited approach to Local Context Analysis

Author: CAPUTO ANNALINA
SEMERARO Giovanni
BASILE PIERPAOLO
Publication venue
Publication date: 01/01/2011
Field of study

Archivio istituzionale della ricerca - Università di Bari

Entity Linking for the Semantic Annotation of Italian Tweets

Author: CAPUTO ANNALINA
SEMERARO Giovanni
BASILE PIERPAOLO
Publication venue
Publication date: 01/01/2016
Field of study

Linking entity mentions in Italian tweets to concepts in a knowledge base is a challenging task, due to the short and noisy nature of these short messages and the lack of specific resources for Italian. This paper proposes an adaptation of a general purpose Named Entity Linking algorithm, which exploits the similarity measure computed over a Distributional Semantic Model, in the context of Italian tweets. In order to evaluate the proposed algorithm, we introduce a new dataset of tweets for entity linking that we have developed specifically for the Italian language

Archivio istituzionale della ricerca - Università di Bari

Entity linking for tweets

Author: Annalina Caputo
Pierpaolo Basile
Caputo Annalina
BASILE PIERPAOLO
Publication venue
Publication date: 01/01/2017
Field of study

Named Entity Linking (NEL) is the task of semantically annotating entity mentions in a portion of text with links to a knowledge base. The automatic annotation, which requires the recognition and disambiguation of the entity mention, usually exploits contextual clues like the context of usage and the coherence with respect to other entities. In Twitter, the limits of 140 characters originates very short and noisy text messages that pose new challenges to the entity linking task. We propose an overview of NEL methods focusing on approaches specifically developed to deal with short messages, like tweets. NEL is a fundamental task for the extraction and annotation of concepts in tweets, which is necessary for making the Twitter’s huge amount of interconnected user-generated contents machine readable and enable the intelligent information access

Crossref

Irish Universities

Archivio istituzionale della ricerca - Università di Bari

DCU Online Research Access Service

Temporal Random Indexing: A System for Analysing Word Meaning over Time

Author: CAPUTO ANNALINA
SEMERARO Giovanni
BASILE PIERPAOLO
Publication venue
Publication date: 01/01/2015
Field of study

During the last decade the surge in available data spanning different epochs has inspired a new analysis of cultural, social, and linguistic phenomena from a temporal perspective. This paper describes a method that enables the analysis of the time evolution of the meaning of a word. We propose Temporal Random Indexing (TRI), a method for building WordSpaces that takes into account temporal information. We exploit this methodology in order to build geometrical spaces of word meanings that consider several periods of time. The TRI framework provides all the necessary tools to build WordSpaces over different time periods and perform such temporal linguistic analysis. We propose some examples of usage of our tool by analysing word meanings in two corpora: a collection of Italian books and English scientific papers about computational linguistics. This analysis enables the detection of linguistic events that emerge in specific time intervals and that can be related to social or cultural phenomena

Archivio istituzionale della ricerca - Università di Bari

Semantic Re-ranking in Ad-hoc Robust Retrieval

Author: CAPUTO ANNALINA
SEMERARO Giovanni
BASILE PIERPAOLO
Publication venue
Publication date: 01/01/2011
Field of study

This paper proposes an investigation about a re-ranking strategy presented at SIGIR 2010. In that work we describe a re-ranking strategy in which the output of a semantic based IR system is used to re-weigh documents by exploiting inter-document similarities computed on a vector space. The space is built using the Random Indexing technique. The effectiveness of the strategy has been evaluated in the context of the CLEF Ad-Hoc Robust-WSD Task, while in this paper we propose new experiments in the TREC Ad-Hoc Robust Track 2004

Archivio istituzionale della ricerca - Università di Bari