Journal for Language Technology and Computational Linguistics (JLCL)
Not a member yet
253 research outputs found
Sort by
Post hoc implementation of non-standard phonetic features in the context of aphasic speech analysis
Despite current progress, automatic speech recognition (ASR) often struggles with non-standard speech, for example, influenced by dialectal or pathological features. (Re)training ASR models to accommodate these variations is not always possible due to limited data. This paper proposes applying the knowledge about non-standard (aphasic and dialectal) phonetic features to the ASR transcription post hoc. Using speech data from German speakers with aphasia who speak the Thuringian-Upper Saxon dialect, this study evaluates the impact of these modifications on an ASR-based error analysis pipeline. The approach helps to reduce automatic error rates on the recordings manually labelled as error-free. The performance of the pipeline also improves both in general acceptance or rejection of the responses and error attribution. General acceptance/rejection accuracy reaches the mean of 83.3%, which is considered sufficient to be used in a digital application for speech and language therapy support
Measuring the Contributions of Vision and Text Modalities
This dissertation investigates multimodal transformers that process both image and text modalities together to generate outputs for various tasks (such as answering questions about images). Specifically, methods are developed to assess the effectiveness of vision and language models in combining, understanding, utilizing, and explaining information from these two modalities. The dissertation contributes to the advancement of the field in three ways: (i) by measuring specific and task-independent capabilities of vision and language models, (ii) by interpreting these models to quantify the extent to which they use and integrate information from both modalities, and (iii) by evaluating their ability to provide self-consistent explanations of their outputs to users
Political Bias in LLMs: Unaligned Moral Values in Agent-centric Simulations
Contemporary research in social sciences increasingly utilizes state-of-the-art generative language models to annotate or generate content. While these models achieve benchmarkleading performance on common language tasks, their application to novel out-of domain tasks remains insufficiently explored. To address this gap, we investigate how personalized language models align with human responses on the Moral Foundation Theory Questionnaire. We adapt open-source generative language models to different political personas and repeatedly survey these models to generate synthetic data sets where model-persona combinations define our sub-populations. Our analysis reveals that models produce inconsistent results across multiple repetitions, yielding high response variance. Furthermore, the alignment between synthetic data and corresponding human data from psychological studies shows a weak correlation, with conservative persona-prompted models particularly failing to align with actual conservative populations. These results suggest that language models struggle to coherently represent ideologies through in-context prompting due to their alignment process. Thus, using language models to simulate social interactions requires measurable improvements in in-context optimization or parameter manipulation to align with psychological and sociological stereotypes properly
GPT makes a poor AMR parser
This paper evaluates GPT models as out-of-the-box Abstract Meaning Representation (AMR) parsers using prompt-based strategies, including 0-shot, few-shot, Chain-of-Thought (CoT), and a two-step approach in which core arguments and non-core roles are handled separately. Our results show that GPT-3.5 and GPT-4o fall well short of state-of-the-art parsers, with a maximum Smatch score of 60 using GPT-4o in a 5-shot setting. While CoT prompting provides some interpretability, it does not improve performance. We further conduct fine-grained evaluations, revealing GPT’s limited ability to handle AMR-specific linguistic structures and complex semantic roles. Ourfindings suggest that, despite recent advances, GPT models are not yet suitable as standalone AMR parsers
The Struggles of Large Language Models with Zero- and Few-Shot (Extended) Metaphor Detection
Extended metaphor is the use of multiple metaphoric words that express the same domain mapping. Although it would provide valuable insight for computational metaphor processing, detecting extended metaphor has been rather neglected. We fill this gap by providing a series of zero- and few-shot experiments on the detection of all linguistic metaphors and specifically on extended metaphors with LLaMa and GPT models. We find that no model was able to achieve satisfactory performance on either task, and that LLaMa in particular showed problematic overgeneralization tendencies. Moreover, our error analysis showed that LLaMa is not sufficiently able to construct the domain mappings relevant for metaphor understanding
Exploring the Limits of LLMs for German Text Classification: Prompting and Fine-tuning Strategies Across Small and Medium-sized Datasets
Large Language Models (LLMs) are highly capable, state-of-the-art technologies and widely used as text classifiers for various NLP tasks, including sentiment analysis, topic classification, legal document analysis, etc. In this paper, we present a systematic analysis of the performance of LLMs as text classifiers using five German datasets from social media across 13 different tasks. We investigate zero- (ZSC) and few-shot classification (FSC) approaches with multiple LLMs and provide a comparative analysis with fine-tuned models based on Llama-3.2, EuroLLM, Teuken and BübleLM. We concentrate on investigating the limits of LLMs and on accurately describing our findings and overall challenges
Do LLMs fail in bridging generation?
In this work we investigate whether large language models (LLMs) ‘understand’ bridging relations and can use this knowledge effectively. We present the results obtained from two tasks: generation of texts containing bridging and filling in missing bridging spans. We show that in most of the cases LLMs fail to generate bridging in a reliable way
Can we Operationalize Conceptual Metaphor Cross-Lingually?
The conceptual nature of metaphorical expression is a long-discussed phenomenon, highly investigated by linguists, psychologists, translators, and philosophers, amongst others. In theoretical work, distinctions are made between conceptual metaphors (a phenomenon of human cognition) and linguistic metaphors (their concrete realizations in language), while most computational approaches have only addressed the latter. In the age of massive language models, metaphor and other phenomena of figurative speech are earning new attention as more and more textual analyses are built on top of neural-networking tools that do not necessarily make a distinction between the lexicalization of a concept and the concept itself. Hence, an investigation of conceptual metaphor using a more linguistics-driven perspective is of much importance.
In this work, we investigate the conceptuality of metaphoric expressions across two languages utilizing a parallel corpus of news commentaries from the web. We assume that a conceptual metaphor is represented by many instances of linguistic metaphors. This idea presupposes linguistic metaphor as an operationalization of conceptual metaphor. We perform several tests on how metaphors are translated between the languages, to assess whether distinct lexicalizations of a metaphor form conceptual clusters, and whether the usage of words in a metaphorical context is distinguishable from their usage in literal contexts.
We find that we are able to group linguistic metaphors in one language into semantically related sets by clustering their translations in another language. We argue that these semantically related sets constitute an operationalization of conceptual metaphors. In English, the clusters are formed by fewer, but more diverse lexelts (linguistic types), while in German we find more and bigger clusters composed primarily of derivatives and compounds. We also find that when a lexelt is translated similarly in unannotated instances to known metaphoric usages, then its contextual sense tends to be figurative as well.
A Study of Errors in the Output of Large Language Models for Domain-Specific Few-Shot Named Entity Recognition
This paper proposes an error classification framework for a comprehensive analysis of the output that large language models (LLMs) generate in a few-shot named entity recognition (NER) task in a specialised domain. The framework should be seen as an exploratory analysis complementary to established performance metrics for NER classifiers, such as F1 score, as it accounts for outcomes possible in a few-shot, LLMbased NER task. By categorising and assessing incorrect named entity predictions quantitatively, the paper shows how the proposed error classification could support a deeper cross-model and cross-prompt performance comparison, alongside a roadmap for a guided qualitative error analysis