1,721,059 research outputs found
Encoding syntactic dependencies using Random Indexing and Wikipedia as a corpus
Distributional approaches are based on a simple hypothesis: the meaning of a word can be inferred from its usage. The application of that idea to the vector space model makes possible the construction of a WordSpace in which words are represented by mathematical points in a geometric space. Similar words are represented close in this space and the definition of "word usage" depends on the definition of the context used to build the space, which can be the whole document, the sentence in which the word occurs, a fixed window of words, or a specific syntactic context. However, in its original formulation WordSpace can take into account only one definition of context at a time. We propose an approach based on vector permutation and Random Indexing to encode several syntactic contexts in a single WordSpace. We adopt WaCkypedia EN corpus to build our WordSpace that is a 2009 dump of the English Wikipedia (about 800 million tokens) annotated with syntactic information provided by a full dependency parser. The effectiveness of our approach is evaluated using the GEometrical Models of natural language Semantics (GEMS) 2011 Shared Evaluation data
UNIBA-CORE: Combining Strategies for Semantic Textual Similarity.
This paper describes the UNIBA participation in the Semantic Textual Similarity (STS) core
task 2013. We exploited three different systems for computing the similarity between two texts. A system is used as baseline, which represents the best model emerged from our previous
participation in STS 2012. Such system is based on a distributional model of semantics capable of taking into account also syntactic
structures that glue words together. In addition, we investigated the use of two different learning strategies exploiting both syntactic
and semantic features. The former uses a combination strategy in order to combine the best machine learning techniques trained on
2012 training and test sets. The latter tries to
overcame the limit of working with different datasets with varying characteristics by selecting only the more suitable dataset for the training purpose
An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model
This paper describes a new Word Sense Disambiguation (WSD) algorithm which extends two well-known variations of the Lesk WSD method. Given a word and its context, Lesk algorithm exploits the idea of maximum number of shared words (maximum overlaps) between the context of a word and each definition of its senses (gloss) in order to select the proper meaning. The main
contribution of our approach relies on the use of a word similarity function defined on a distributional semantic space to compute the gloss-context overlap. As sense inventory we adopt BabelNet, a large multilingual semantic network built exploiting both WordNet and Wikipedia. Besides
linguistic knowledge, BabelNet represents also encyclopedic concepts coming from Wikipedia.
The evaluation performed on SemEval-2013 Multilingual Word Sense Disambiguation shows
that our algorithm goes beyond the most frequent sense baseline and the simplified version of the Lesk algorithm. Moreover, when compared with the other participants in SemEval-2013 task, our approach is able to outperform the best system for English
Entity Linking for the Semantic Annotation of Italian Tweets
Linking entity mentions in Italian tweets to concepts in a knowledge base is a challenging task,
due to the short and noisy nature of these short messages and the lack of specific resources for
Italian. This paper proposes an adaptation of a general purpose Named Entity Linking algorithm,
which exploits the similarity measure computed over a Distributional Semantic Model, in the
context of Italian tweets. In order to evaluate the proposed algorithm, we introduce a new dataset
of tweets for entity linking that we have developed specifically for the Italian language
Entity linking for tweets
Named Entity Linking (NEL) is the task of semantically annotating entity mentions in a portion of text with links to a knowledge base. The automatic annotation, which requires the recognition and disambiguation of the entity mention, usually exploits contextual clues like the context of usage and the coherence with respect to other entities. In Twitter, the limits of 140 characters originates very short and noisy text messages that pose new challenges to the entity linking task. We propose an overview of NEL methods focusing on approaches specifically developed to deal with short messages, like tweets. NEL is a fundamental task for the extraction and annotation of concepts in tweets, which is necessary for making the Twitter’s huge amount of interconnected user-generated contents machine readable and enable the intelligent information access
Temporal Random Indexing: A System for Analysing Word Meaning over Time
During the last decade the surge in available data spanning different epochs has inspired a new
analysis of cultural, social, and linguistic phenomena from a temporal perspective. This paper
describes a method that enables the analysis of the time evolution of the meaning of a word.
We propose Temporal Random Indexing (TRI), a method for building WordSpaces that takes
into account temporal information. We exploit this methodology in order to build geometrical
spaces of word meanings that consider several periods of time. The TRI framework provides all
the necessary tools to build WordSpaces over different time periods and perform such temporal
linguistic analysis. We propose some examples of usage of our tool by analysing word meanings
in two corpora: a collection of Italian books and English scientific papers about computational
linguistics. This analysis enables the detection of linguistic events that emerge in specific time
intervals and that can be related to social or cultural phenomena
Semantic Re-ranking in Ad-hoc Robust Retrieval
This paper proposes an investigation about a re-ranking strategy presented at SIGIR 2010. In that work we describe a re-ranking strategy in which the output of a semantic based IR system is used to re-weigh documents by exploiting inter-document similarities computed on a vector space. The space is built using the Random Indexing technique. The effectiveness of the strategy has been evaluated in the context of the CLEF Ad-Hoc Robust-WSD Task, while in this paper we propose new experiments in the TREC Ad-Hoc Robust Track 2004
- …
