Search CORE

1,721,058 research outputs found

Weakly-supervised Deep Cognate Detection Framework for Low-Resourced Languages Using Morphological Knowledge of Closely-Related Languages

Author: Fransen Theodorus
Publication venue
Publication date: 01/01/2023
Field of study

Exploiting cognates for transfer learning in under-resourced languages is an exciting opportunity for language understanding tasks, including unsupervised machine translation, named entity recognition and information retrieval. Previous approaches mainly focused on supervised cognate detection tasks based on orthographic, phonetic or state-of-the-art contextual language models, which under-perform for most under-resourced languages. This paper proposes a novel language-agnostic weakly-supervised deep cognate detection framework for under-resourced languages using morphological knowledge from closely related languages. We train an encoder to gain morphological knowledge of a language and transfer the knowledge to perform unsupervised and weakly-supervised cognate detection tasks with and without the pivot language for the closely-related languages. While unsupervised, it overcomes the need for hand-crafted annotation of cognates. We performed experiments on different published cognate detection datasets across language families and observed not only significant improvement over the state-of-the-art but also our method outperformed the state-of-the-art supervised and unsupervised methods. Our model can be extended to a wide range of languages from any language family as it overcomes the requirement of the annotation of the cognate pairs for training

PubliCatt

HANSEN: Human and AI Spoken Text Benchmark for Authorship Analysis

Author: Nafis Irtiza Tripto
Giannotti Fosca
Adaku Uchendu
Thai Le
Le Thai
Uchendu Adaku
Lee Dongwon
Fosca Giannotti
Tripto Nafis
Dongwon Lee
Setzu Mattia
Mattia Setzu
Publication venue
Publication date: 01/01/2023
Field of study

Authorship Analysis, also known as stylometry, has been an essential aspect of Natural Language Processing (NLP) for a long time. Likewise, the recent advancement of Large Language Models (LLMs) has made authorship analysis increasingly crucial for distinguishing between human-written and AI-generated texts. However, these authorship analysis tasks have primarily been focused on written texts, not considering spoken texts. Thus, we introduce the largest benchmark for spoken texts - HANSEN (Human ANd ai Spoken tExt beNchmark). HANSEN encompasses meticulous curation of existing speech datasets accompanied by transcripts, alongside the creation of novel AI-generated spoken text datasets. Together, it comprises 17 human datasets, and AI-generated spoken texts created using 3 prominent LLMs: ChatGPT, PaLM2, and Vicuna13B. To evaluate and demonstrate the utility of HANSEN, we perform Authorship Attribution (AA) & Author Verification (AV) on human-spoken datasets and conducted Human vs. AI spoken text detection using state-of-the-art (SOTA) models. While SOTA methods, such as, character ngram or Transformer-based model, exhibit similar AA & AV performance in human-spoken datasets compared to written ones, there is much room for improvement in AI-generated spoken text detection. The HANSEN benchmark is available at: https://huggingface.co/datasets/HANSEN-REPO/HANSEN

Archivio istituzionale della Ricerca - Scuola Normale Superiore

Archivio della Ricerca - Università di Pisa

Author Instructions

Author: Instructions Author
Publication venue
Publication date: 04/11/2013
Field of study

Crossref

Cartographic Perspectives (E-Journal - North American Cartographic Information Society, NACIS)

Solving hard analogy questions with relation embedding chains

Author: Kumar Nitesh
Schockaert Steven
Publication venue
Publication date
Field of study

Modelling how concepts are related is a central topic in Lexical Semantics. A common strategy is to rely on knowledge graphs (KGs) such as ConceptNet, and to model the relation between two concepts as a set of paths. However, KGs are limited to a fixed set of relation types, and they are incomplete and often noisy. Another strategy is to distill relation embeddings from a fine-tuned language model. However, this is less suitable for words that are only indirectly related and it does not readily allow us to incorporate structured domain knowledge. In this paper, we aim to combine the best of both worlds. We model relations as paths but associate their edges with relation embeddings. The paths are obtained by first identifying suitable intermediate words and then selecting those words for which informative relation embeddings can be obtained. We empirically show that our proposed representations are useful for solving hard analogy questions

Online Research @ Cardiff

Language Model Quality Correlates with Psychometric Predictive Power in Multiple Languages

Author: Pimentel Tiago
Cotterell Ryan
Wilcox Ethan Gotlieb
Meister Clara Isabel; id_orcid
Publication venue
Publication date: 2023
Field of study

ETHzürich Repository for Publications and Research Data

VivesDebate-Speech: A Corpus of Spoken Argumentation to Leverage Audio Features for Argument Mining

Author: Ruiz-Dolz Ramon
Ruiz-Dolz Ramon; id_orcid
Iranzo-Sánchez Javier
Publication venue
Publication date: 22/09/2022
Field of study

The VivesDebate-Speech corpus contains the acoustic information of 29 different argumentative debates and the annotations of the segmentation (i.e., BIO tags) of the Argumentative Discourse Units identified in the spoken natural language discourse

Discovery Research Portal

Citance-Contextualized Summarization of Scientific Papers

Author: Al-Khatib Khalid
Al-Khatib Khalid; id_orcid
Hakimi Ahmad Dawar
Syed Shahbaz
Potthast Martin
Publication venue
Publication date: 01/01/2023
Field of study

Current approaches to automatic summarization of scientific papers generate informative summaries in the form of abstracts. However, abstracts are not intended to show the relationship between a paper and the references cited in it. We propose a new contextualized summarization approach that can generate an informative summary conditioned on a given sentence containing the citation of a reference (a so-called “citance”). This summary outlines the content of the cited paper relevant to the citation location. Thus, our approach extracts and models the citances of a paper, retrieves relevant passages from cited papers, and generates abstractive summaries tailored to each citance. We evaluate our approach using WEBIS-CONTEXT-SCISUMM-2023, a new dataset containing 540K computer science papers and 4.6M citances therein.</p

University of Groningen

BRAINTEASER: Lateral Thinking Puzzles for Large Language Models

Author: Jiang Yifan
Ma Kaixin
Sourati Zhivar
Ilievski Filip; id_orcid
Publication venue
Publication date: 01/01/2023
Field of study

The success of language models has inspired the NLP community to attend to tasks that require implicit and complex reasoning, relying on human-like commonsense mechanisms. While such vertical thinking tasks have been relatively popular, lateral thinking puzzles have received little attention. To bridge this gap, we devise BRAINTEASER: a multiple-choice Question Answering task designed to test the model's ability to exhibit lateral thinking and defy default commonsense associations. We design a three-step procedure for creating the first lateral thinking benchmark, consisting of data collection, distractor generation, and generation of reconstruction examples, leading to 1,100 puzzles with high-quality annotations. To assess the consistency of lateral reasoning by models, we enrich BRAINTEASER based on a semantic and contextual reconstruction of its questions. Our experiments with state-of-the-art instruction- and commonsense language models reveal a significant gap between human and model performance, which is further widened when consistency across reconstruction formats is considered. We make all of our code and data available to stimulate work on developing and evaluating lateral thinking models.</p

VU Research Portal

Dynamic Top-k Estimation Consolidates Disagreement between Feature Attribution Methods

Author: Fokkens Antske; id_orcid
Beinborn Lisa
Kamp Jonathan; id_orcid
Publication venue
Publication date: 01/01/2023
Field of study

Feature attribution scores are used for explaining the prediction of a text classifier to users by highlighting a k number of tokens. In this work, we propose a way to determine the number of optimal k tokens that should be displayed from sequential properties of the attribution scores. Our approach is dynamic across sentences, method-agnostic, and deals with sentence length bias. We compare agreement between multiple methods and humans on an NLI task, using fixed k and dynamic k. We find that perturbation-based methods and Vanilla Gradient exhibit highest agreement on most method--method and method--human agreement metrics with a static k. Their advantage over other methods disappears with dynamic ks which mainly improve Integrated Gradient and GradientXInput. To our knowledge, this is the first evidence that sequential properties of attribution scores are informative for consolidating attribution signals for human interpretation

VU Research Portal

The intended uses of automated fact-checking artefacts: why, how and who

Author: Ousidhoum Nedjma
Vlachos Andreas
Schlichtkrull Michael
Publication venue
Publication date: 31/12/2023
Field of study

Automated fact-checking is often presented as an epistemic tool that fact-checkers, social media consumers, and other stakeholders can use to fight misinformation. Nevertheless, few papers thoroughly discuss how. We document this by analysing 100 highly-cited papers, and annotating epistemic elements related to intended use, i.e., means, ends, and stakeholders. We find that narratives leaving out some of these aspects are common, that many papers propose inconsistent means and ends, and that the feasibility of suggested strategies rarely has empirical backing. We argue that this vagueness actively hinders the technology from reaching its goals, as it encourages overclaiming, limits criticism, and prevents stakeholder feedback. Accordingly, we provide several recommendations for thinking and writing about the use of fact-checking artefacts

Online Research @ Cardiff