1,721,031 research outputs found
MultiNERD: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation)
Named Entity Recognition (NER) is the task of identifying named entities in texts and classifying them through specific semantic categories, a process which is crucial for a wide range of NLP applications. Current datasets for NER focus mainly on coarse-grained entity types, tend to consider a single textual genre and to cover a narrow set of languages, thus limiting the general applicability of NER systems.In this work, we design a new methodology for automatically producing NER annotations, and address the aforementioned limitations by introducing a novel dataset that covers 10 languages, 15 NER categories and 2 textual genres.We also introduce a manually-annotated test set, and extensively evaluate the quality of our novel dataset on both this new test set and standard benchmarks for NER.In addition, in our dataset, we include: i) disambiguation information to enable the development of multilingual entity linking systems, and ii) image URLs to encourage the creation of multimodal systems. We release our dataset at https://github.com/Babelscape/multinerd
NER4ID at SemEval-2022 Task 2: Named Entity Recognition for Idiomaticity Detection
Idioms are lexically-complex phrases whose meaning cannot be derived by compositionally interpreting their components. Although the automatic identification and understanding of idioms is essential for a wide range of Natural Language Understanding tasks, they are still largely under-investigated. This motivated the organization of the SemEval-2022 Task 2, which is divided into two multilingual subtasks: one about idiomaticity detection, and the other about sentence embeddings. In this work, we focus on the first subtask and propose a Transformer-based dual-encoder architecture to compute the semantic similarity between a potentially-idiomatic expression and its context and, based on this, predict idiomaticity. Then, we show how and to what extent Named Entity Recognition can be exploited to reduce the degree of confusion of idiom identification systems and, therefore, improve performance. Our model achieves 92.1 F1 in the one-shot setting and shows strong robustness towards unseen idioms achieving 77.4 F1 in the zero-shot setting. We release our code at https://github.com/Babelscape/ner4id
Volatilità dei consumi: il ruolo della proprietà immobiliare nella protezione dai rischi non assicurabili e le preferenze delle famiglie Italiane
ID10M: Idiom Identification in 10 Languages
Idioms are phrases which present a figurative meaning that cannot be (completely) derived by looking at the meaning of their individual components. Identifying and understanding idioms in context is a crucial goal and a key challenge in a wide range of Natural Language Understanding tasks. Although efforts have been undertaken in this direction, the automatic identification and understanding of idioms is still a largely under-investigated area, especially when operating in a multilingual scenario. In this paper, we address such limitations and put forward several new contributions: we propose a novel multilingual Transformer-based system for the identification of idioms; we produce a high-quality automatically-created training dataset in 10 languages, along with a novel manually-curated evaluation benchmark; finally, we carry out a thorough performance analysis and release our evaluation suite at https://github.com/Babelscape/ID10M
Micro Data Fusion of Italian Expenditures and Incomes Surveys
The aim of this work is to match household consumption information from Indagine sui Consumi delle Famiglie (Household Budget Survey, HBS) by the Italian National Statistical Institute (ISTAT) with Indagine sui Bilanci delle Famiglie Italiane (Survey of Householdsâ€TM Income and Wealth, SHIW) by the Bank of Italy for the year 2010. The work offers a review of the main matching methodologies, coupled with adiscussion of the underlying hypotheses (such as the CIA) which, in our case, are less demanding to assume given the presence consumption aggregates as common variables between the two surveys. Moreover, some tests measuring the validity of the matching procedure are presented in order to check the preservation of joint distributions.The resulting sample is expected to allow better distributional and micro-econometric analyses onconsumption income and wealth (e.g. Engel curves, consumption age/income profiles). Moreover, the very detailed integrated dataset would constitute a platform for an integrated microsimulation analysis of direct, indirect and wealth tax reforms which, so far, has not been feasible taking available sample surveys separately.Our matching achieves a good preservation of the marginal distributions of all consumption aggregates from the donor survey. However, a thorough comparison of the original distributions suggests that the HBS is a convenient donor for the imputation of non-durable commodities only. Consumption aggregates closer to the concept of wealth (such as durables and the extraordinary expenditure for dwelling maintenance) or savings (such as mortgages and private pensions) prove to be better assessed by the longer - and more issue-specific - recall of the SHIW. As secondary outcomes, the information derived from HBS on non-durables entails an increase in the dispersion and an upward adjustment of consumption profiles in the synthetic distribution relative to SHIW. This implies also a downsized average propensity to save for the household sector which gets closer to the National Accounts figures
Towards comprehensive and efficient information extraction across languages
The exponential growth of textual data shared online has created an urgent need for methods that can effectively extract, structure, and interpret information from vast and varied texts. Information Extraction (IE), a key area within Natural Language Processing (NLP), addresses this need by transforming unstructured text into structured formats enabling automated text analytics and decision-making. However, existing IE systems face substantial challenges in scalability and generalization. These challenges include limited labeled data for low-resource languages, computational demands that restrict accessibility to only well-resourced institutions, and a predominant focus on popular entities. Additionally, most IE tasks are entity-centric tasks (e.g. Named Entity Recognition, Entity Disambiguation, and Relation Extraction), thus overlooking the broader contextual richness present in many texts.
This thesis aims at advancing the field of IE by tackling these critical issues through novel resources, methodologies, and theoretical approaches aimed at fostering a multilingual, scalable, and semantically-enriched IE framework. To bridge the multilingual gap, we leverage a combination of neural and knowledge-based approaches and create multilingual datasets for NER and Relation Extraction, ensuring that IE systems can operate effectively across diverse linguistic settings. On the computational front, we propose optimizations designed to reduce the resource requirements of IE models, especially in the context of Entity Disambiguation, enabling broader adoption of NLP technologies by reducing dependence on high-performance hardware and extensive labeled datasets.
Additionally, this work challenges traditional IE frameworks by expanding the focus beyond named entities to encompass abstract concepts, idiomatic expressions, and tail entities, which are essential for a more nuanced and comprehensive understanding of texts. Through these contributions, this research aims to establish a robust foundation for multilingual, resource-efficient IE systems that can meet the evolving demands of global text analytics across varied languages, domains, and cultural contexts. Finally, to further encourage the usage and development of multilingual IE systems, we publicly release all the artifacts -- datasets and models -- introduced in this thesis
Preferences for public education spending in hierarchical education systems: theory and empirical evidence from OECD countries
This paper analyses the factors affecting preferences for public education spending, focusing on household
income and other individuals’ characteristics as well as on institutional features. Standard redistributive arguments à la
Meltzer and Richard (1981) suggest that the impact of household income on preferences should be negative since richer
families are likely to oppose the redistributive effect of public funding. However, the empirical evidence does not seem
to confirm this prediction. To shed some light on this issue, our proposed interpretative key hinges on the hierarchical
structure of the education system. To this purpose, we set up a model in which agents are heterogeneous in terms of
income and education and human capital is produced in a two-tier education system. We show that individual
preferences for public education spending are affected by household income and by variables related to the
socioeconomic context, such as income inequality and social inclusiveness of the education system, which determine
the ultimate redistributive effect of public spending. We are able to test some of the predictions of our model using
individuals’ data from ISSP (2006 wave). The econometric analysis points out that household income is,
unambiguously, a negative predictor of preferences when considering openly redistributive education expenses.
Differently, when considering general schooling expenses, the intensity and even the direction of the income effect is
affected by income inequality and by the social inclusiveness of the education system. We also assess the presence of
significant residual variability in the income coefficient, due to unobserved factors, which for the most part is due to the
individual within-country rather than to the cross-country leve
Smokers are different: The impact of price increases on smoking reduction and downtrading
Using data from an ad hoc survey conducted in July 2016 on Italian smokers’ habits, we investigate how different categories of smokers react to different types of price changes by means of latent class econometric analysis. While the previous literature focused on the effects of general price changes and overlooked substitution effects among brands, the present analysis unveils that the probability of reducing cigarette consumption is always higher for uniform rather than uneven price increases across brands. Moreover, downtrading to cheaper products is found to increase with the size of price changes, provided that these are uneven across brands. Finally, we provide a range for the implicit elasticity of cigarette demand. While being inelastic on average, it ranges between 0.2 and 0.9 depending on the smoker category. These findings have important implications for the design of both health and tax policies, as they provide new insights into the potential reactions of smokers to policy interventions
- …
