1,721,168 research outputs found
Representing Multilingual Data as Linked Data: the Case of BabelNet 2.0
Ehrmann M, Cecconi F, Vannella D, McCrae J, Cimiano P, Roberto Navigli R. Representing Multilingual Data as Linked Data: the Case of BabelNet 2.0. Presented at the LREC 2014
ESC: Redesigning WSD with Extractive Sense Comprehension
Word Sense Disambiguation (WSD) is a historical NLP task aimed at linking words in contexts to discrete sense inventories and it is usually cast as a multi-label classification task. Recently, several neural approaches have employed sense definitions to better represent word meanings. Yet, these approaches do not observe the input sentence and the sense definition candidates all at once, thus potentially reducing the model performance and generalization power. We cope with this issue by reframing WSD as a span extraction problem — which we called Extractive Sense Comprehension (ESC) — and propose ESCHER, a transformer-based neural architecture for this new formulation. By means of an extensive array of experiments, we show that ESC unleashes the full potential of our model, leading it to outdo all of its competitors and to set a new state of the art on the English WSD task. In the few-shot scenario, ESCHER proves to exploit training data efficiently, attaining the same performance as its closest competitor while relying on almost three times fewer annotations. Furthermore, ESCHER can nimbly combine data annotated with senses from different lexical resources, achieving performances that were previously out of everyone’s reach. The model along with data is available at https://github.com/SapienzaNLP/esc
Huge automatically extracted training sets for multilingual Word Sense Disambiguation
We release to the community six large-scale sense-annotated datasets in multiple language to pave the way for supervised multilingual Word Sense Disambiguation. Our datasets cover all the nouns in the English WordNet and their translations in other languages for a total of millions of sense-tagged sentences. Experiments prove that these corpora can be effectively used as training sets for supervised WSD systems, surpassing the state of the art for low- resourced languages and providing competitive results for English, where manually annotated training sets are accessible. The data is available at trainomatic. org
Automated short answer grading: A simple solution for a difficult task
The task of short answer grading is aimed at assessing the outcome of an exam by automatically analysing students’ answers in natural language and deciding whether they should pass or fail the exam. In this paper, we tackle this task training an SVM classifier on real data taken from a University statistics exam, showing that simple concatenated sentence embeddings used as features yield results around 0.90 F1, and that adding more complex distance-based features lead only to a slight improvement. We also release the dataset, that to our knowledge is the first freely available dataset of this kind in Italian.
A Comparative Study of Models for Answer Sentence Selection
Answer Sentence Selection is one of the steps typically involved in Question Answering. Question Answering is considered a hard task for natural language processing systems, since full solutions would require both natural language understanding and inference abilities. In this paper, we explore how the state of the art in answer selection has improved recently, comparing two of the best proposed models for tackling the problem: the Cross-attentive Convolutional Network and the BERT model. The experiments are carried out on two datasets, WikiQA and SelQA, both created for and used in open-domain question answering challenges. We also report on cross domain experiments with the two datasets
Is “manovra” Really “del popolo”? Linguistic Insights into Twitter Reactions to the Annual Italian Budget Law
Relying on linguistic cues obtained by means of structural topic modeling as well as descriptive lexical analyses, this study contributes to the general understanding of the Twitter users’ response to the annual Italian budget law approved at the end of December 2018. Some topics contained in the dataset of tweets are procedural or generic, but besides those, it often emerges that Twitter users expressed their concern with respect to the provisions of this law. Supportive attitudes seem to be less frequent. This paper also advocates that findings from inductive studies on Twitter data should be interpreted with caution, since the nature of tweets might not be adequate for drawing far-reaching generalizations
Enhancing a Text Summarization System with ELMo
Text summarization has gained a considerable amount of research interest due to deep learning based techniques. We lever- age recent results in transfer learning for Natural Language Processing (NLP) using pre-trained deep contextualized word embeddings in a sequence-to-sequence architecture based on pointer-generator networks. We evaluate our approach on the two largest summarization datasets: CNN/Daily Mail and the recent Newsroom dataset. We show how using pre-trained contextualized embeddings on Newsroom improves significantly the state-of-the-art ROUGE-1 measure and obtains comparable scores on the other ROUGE values
Asymmetries in extraction from nominal copular sentences: A challenging case study for NLP tools
In this paper we discuss two types of nominal copular sentences (Canonical and Inverse, Moro 1997) and we demonstrate how the peculiarities of these two configurations are hardly considered by standard NLP tools that are currently publicly available. Here we show that example-based MT tools (e.g. Google Translate) as well as other NLP tools (UDpipe, LinguA, Stanford Parser, and Google Cloud AI API) fail in capturing the critical distinctions between the two structures in the end producing both wrong analyses and, possibly as a consequence of a non-coherent (or missing) structural analysis, incorrect translations in the case of MT tools. To support the proposed analysis, we present also an empirical study showing that native speakers are indeed sensitive to the critical distinctions. This poses a sharp challenge for NLP tools that aim at being cognitively plausible or at least descriptively adequate (Chowdhury & Zamparelli 2018)
Text Frame Detector: Slot Filling Based On Domain Knowledge Bases
In this paper we present a systemcalledText Frame Detector(TFD) whichaims at populating a frame-based ontologyin a graph-based structure. Our systemorganizes textual information into frames,according to a predefined set of semanti-cally informed patterns linking pre-codedinformation such as named entities, sim-ple and complex terms. Given the semi-automatic expansion of such informationwith word embeddings, the system can beeasily adapted to new domains
Gender Detection and Stylistic Differences and Similarities between Males and Females in a Dream Tales Blog
In this paper, we present the results of a gender detection experiment carried out on a corpus we built downloading dream tales from a blog. We also highlight stylistic differences and similarities concerning lexical choices between men and women. In order to carry the experiment we built a feed-forward neural network with traditional sparse n-hot encoding using the Keras open-source librar
- …
