AI-Linguistica
Not a member yet
24 research outputs found
Sort by
AI-driven speech act annotation: accuracy and reproducibility across ChatGPT, LadderWeb and LLaMA
This study evaluates three machine learning systems for annotating pragmatic categories, focusing on cancellations after accepting an invitation. The systems include the supervised model LadderWeb and the pre-trained models ChatGPT-4o and LLaMA-3.2. LadderWeb, built on Apache OpenNLP, was specifically designed for cancellation annotation. ChatGPT-4o was tested through a web interface to simulate non-expert use, while LLaMA-3.2 was run locally to ensure control, reproducibility, and data security. Both large language models were prompted using a few-shot learning approach (Brocca et al., in review). System outputs were compared against a human baseline. GPT achieved the highest agreement across dimensions, with κ values ranging from substantial to almost perfect. LadderWeb also showed substantial agreement, whereas LLaMA performed considerably worse. Repeated testing after seven months revealed that GPT’s results varied, though accuracy remained high, while LadderWeb and LLaMA produced self-consistent outputs. Notably, LLaMA improved when parameters were adjusted. These findings highlight the potential of pre-trained large language models such as ChatGPT-4o to support pragmatic corpus annotation, while also emphasizing their reproducibility challenges—an issue not observed with LadderWeb or LLaMA
Tracing English interference in AI-generated German: An analysis of word order and syntactic fronting
Large language models (LLMs) constitute a transformative advancement in natural language processing, yet their development remains disproportionately skewed toward English. Despite the global linguistic landscape, non-English languages – including major languages like Spanish, French or Chinese – are effectively treated as low-resource in current LLM training paradigms. This study analyses two linguistic traits of AI-generated texts which mimic human-authored German newspaper articles and compares them with a purpose-built corpus of real journalistic texts. These features are (i) word order and (ii) pre-field occupation. Through quantitative and qualitative analyses of the outputs of four distinct LLMs, three key phenomena in AI’s German outputs were identified: (i) a marked preference for SVO word order; (ii) reduced syntactic variability compared to human-authored texts; and (iii) the emergence of stylistically marked constructions which mirror English linear progression rather than native German sentence bracketing. While some models approximate human-like syntactic patterns for certain variables, this equivalence remains limited and context-dependent, which may suggest a cross-linguistic interference from the overwhelming English predominance in LLM training data. The study emphasises the linguistic implications of LLM architectures and calls attention to the urgent need for more equitable representation of world languages in natural language processing development
Benchmarking AI acceptability and grammaticality in German: A study of ChatGPT and human judgments
Die rasante Entwicklung großer Sprachmodelle hat neue Perspektiven für die linguistische Forschung eröffnet – auch in Bereichen, die traditionell auf die Intuitionen von Muttersprachler(inne)n angewiesen sind. Ein solches Feld ist die Grammatikalitäts- und Akzeptabilitätsbeurteilung, bei der Sprecher(innen) einschätzen, ob Sätze strukturell wohlgeformt bzw. kontextuell angemessen sind. Die vorliegende Studie untersucht, inwieweit ChatGPT-4 in der Lage ist, menschliche Urteile im Deutschen zu approximieren. Im Fokus stehen dabei unterschiedliche grammatische und gebrauchsbezogene Phänomene. Ein sorgfältig konstruierter Satz von Testitems wurde sowohl dem Modell als auch muttersprachlichen Sprecher(inne)n vorgelegt, um einen direkten Vergleich zu ermöglichen. Die Ergebnisse zeigen in vielen Fällen eine hohe Übereinstimmung, offenbaren jedoch auch systematische Abweichungen – insbesondere in Zusammenhängen, in denen graduelle Bewertungen, Markiertheit oder kontextabhängige Akzeptabilität eine Rolle spielen. Die Befunde veranschaulichen sowohl das analytische Potenzial als auch die gegenwärtigen Grenzen großer Sprachmodelle in der linguistischen Forschung und leisten einen Beitrag zur aktuellen Diskussion über ihre Fähigkeit, muttersprachliche Kompetenz nachzubilden.The rapid development of large language models has opened new avenues for linguistic research, including areas traditionally reliant on native-speaker intuitions. One such domain is grammaticality and acceptability judgment, where speakers assess whether sentences are structurally well-formed and contextually appropriate. This study investigates the extent to which ChatGPT-4 can approximate human judgments in German, focusing on a diverse range of grammatical and usage-related phenomena. A carefully designed set of test items was presented to both the model and native speakers, allowing for a direct comparison. The results show a high degree of alignment in many cases, but also reveal systematic divergences, particularly in contexts involving gradience, sociolinguistic markedness or context-dependent acceptability. These findings demonstrate both the analytical potential and the current limitations of large language models in linguistic research, and contribute to ongoing discussions about their ability to approximate native speaker competence
Uncanny Semantics. How AI and Human Authors Use Language Differently in Academic Writing
This study explores the semantic differences between human-written and AI-generated academic texts by applying word embedding techniques to a curated corpus of 325 introductions from linguistic articles. The corpus includes human-authored texts and AI-generated texts produced by six language models (OpenAI, Google, and DeepSeek; base and advanced). Each topic was prompted in two different ways: plain and academic. Using cosine similarity, the most frequently occurring lemmas were grouped into semantic categories. The analysis reveals that AI-generated texts, especially under academic prompts, overuse positive-evaluative and methodological vocabulary (e.g., central, crucial, analysis, methodology) and explicitly refer to text structure more often than the plainly prompted texts (e.g., section, chapter). In contrast, human authors employ more epistemically cautious, critical, evaluative, and connective language (e.g., possibly, inconsistent, by no means). I propose that the relative absence of such epistemic markers in AI texts, combined with their tendency to exaggerate the importance of certain topics or data, reflects a pattern of pseudo-commitment: the models produce syntactically assertive, formally academic prose but only weakly modulate epistemic stance and critical engagement, which may contribute to the reported sense of weirdness in AI-generated academic writing
ChatGPT as a linguistic informant. A comparison of human and AI-generated translations
In the past sixty years various research methods have been proposed to help the researcher to collect reliable data to which theoretical analyses can be applied. In this paper it is investigated if artificial intelligence (AI)-generated translations of selected excerpts from literary works by ChatGPT can help the researcher to gain more insight into a linguistic phenomenon, viz. the acceptability of preverbal bare plural nouns in Romance, which have been argued to be much less acceptable than bare nouns in Germanic. The machine translations are checked against the official translations by the professional human translators. Furthermore, the chatbot is queried about its choices. A quantitative and qualitative comparison reveals that the machine’s translations approach those by the human translators. However, while the chatbot shows some metalinguistic knowledge and may explain some of its choices, for explanations for which some more analytical knowledge is required, it fails. The paper concludes that the chatbot’s observational adequacy may, however, help the researcher to do research on specific linguistic phenomena for which data are difficult to obtain
Marking intersubjectivity in human-written and AI-generated editorials published in "Il Foglio"
In March 2025, the Milan-based broadsheet Il Foglio launched Il Foglio AI, a month-long experiment featuring a daily four-page supplement entirely generated by large language models (LLMs). Owing to the success of the experiment, the project has continued as a weekly feature since April 2025. Each edition of Il Foglio AI contains around 25 articles spanning diverse journalistic genres, including editorials, which form the focus of the present analysis. The paper compares human-written and LLM-generated editorials from Il Foglio and Il Foglio AI, examining the use of authorial stance markers to analyze how intersubjective positionings are conveyed. To this end, the study draws on Martin and White’s (2005) taxonomy of four “engagement” meanings typically expressed by markers of intersubjectivity. The analysis is particularly relevant for the description of AI-generated texts as a new textual typology, as LLMs lack experiential grounding and cannot hold attitudes, beliefs, or judgments. The dataset comprises two subcorpora of 25 editorials each, published between April and May 2025 in Il Foglio and Il Foglio AI
The design and discursive construction of a ‘speaking’ vacuum cleaning robot for assistive purposes: Findings on communication ideologies from a current research and development project
This sociolinguistically informed study deals with the design and discursive construction of a voice assistant (Amazon’s Alexa) in a project that develops a vacuum cleaning robot (‘Smart Companion’) capable of detecting a fallen person and providing assistance for older people in case of an emergency. The paper investigates sociotechnical ensembles along the co-constitutive lines of users, the technical device, and society, with communication ideologies (e.g., assumptions about the communicative nature of technical devices) emerging from and influencing this complex triad. The qualitative analysis of interactional data and interviews suggests that communication ideologies materialize in the way how participants interact with Alexa, by drawing on human-human communication strategies and adapting their linguistic behavior if interactional problems occur. Communication ideologies include assumptions about the ontology, (linguistic) agency, the purpose of the voice assistant / robot, and the relationship between the user and the voice assistant / the robot. Discursive constructions include anthropomorphization and hybrid, partly hesitant ontological categorizations, combining both human and non-human qualities. Participants show an interest in the Smart Companion, but do not yet consider themselves as being in need of assistive technologies like the Smart Companion, reiterating the importance of taking the users, their self-image and discourses into account
Assessing the effectiveness of ChatGPT-3.5 and ChatGPT-4o in simplifying Italian institutional texts
This research aims to describe the performance of ChatGPT-3.5 and ChatGPT-4o in the task of Automatic Text Simplification (ATS) in Italian institutional texts. The aim is to analyse the linguistic differences between the original texts compared to their simplified rewritings by ChatGPT, and the impact of these differences on non-expert users’ experience. A dataset of six short texts was compiled to be rewritten using a zero-shot instructional prompt. The methodological approach combined quantitative linguistic analyses, manual analysis and human judgment to assess the effectiveness of the simplification. For the quantitative linguistic analysis, an additional comparison was made between ChatGPT’s rewritings and human revisions, used as an external benchmark to better contextualize the AI’s simplification strategies. The study provides new insights into the linguistic structure of administrative-bureaucratic texts by examining readability parameters and collecting subjective assessments of comprehension and perceived comprehensibility. It also aims to contribute to the growing body of research on text simplification methods and the role of large language models (LLMs) in enhancing accessibility to complex institutional discourse
Prompt Engineering for evaluators: optimizing LLMs to judge linguistic proficiency
Prompt Engineering, the practice of optimizing the question made to a Large Language Model, is closely linked to the evaluation procedures. Depending on the type of task we are performing through LLMs, we can have an evaluation metric with high or low reliability, making Prompt Engineering more or less effective. LLM-as-a-judge represents a possible solution to perform Prompt Engineering in tasks that are hard to evaluate, although the reliability of this practice is not granted, depending on the task and the language model. This paper presents an evaluation of general purpose LLMs in an essay-scoring task using state-of-the-art small models. In particular, the ability of language models to assign proficiency levels to short essays written by Italian L2 learners is evaluated. Test data with expert annotations of CEFR scores are extracted from Kolipsi-II corpus. Several prompting techniques have been used to analyze the impact of Prompt Engineering on this task. Results show a wide difference in accuracy among the three LLMs and that choosing the right prompt radically changes their rating abilities
Syntactic patterns of Italo-Romance CP-layers in Transformers, ChatGPT and Deepseek: two case studies from Romansh and Neapolitan
Large-language models (LLMs) have recently become the object of syntactic investigation, whether via standard probability scores in masked modeling or the interaction with conversationalAI. This study reports the results on some syntactic patterns of the Left Periphery of the clause in two Romance varieties: V2 and violations to V2 in Romansh and topic-subject agreement in Neapolitan. We study masking models with multilingual transformers (multilingual BERT and monolingual via adapters, Swiss-BERT for Romansh; multilingual BERT for Neapolitan) and through interactions with two ConversationalAI (ChatGPT, DeepSeek) by prompting an evaluation task in three languages (English, Italian for ChatGPT, Chinese for DeepSeek). Our results show asymmetries across models and structures: the theoretical predictions for Romansh are confirmed by the monolingual transformer and partially by conversationalAIs, while the Neapolitan topic-subject agreement remains challenging