1,720,974 research outputs found

    Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation

    Full text link
    Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author (A) or by someone else ( A ̄ ̄ ̄ ̄ ). Itehas been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, oreby imitating the style of another author. Inethis paper, weeinvestigate the potential benefits of augmenting the classifier training set with (negative) synthetic examples. These synthetic examples are generated to imitate the style of A. Weeanalyze the improvements in the classifier predictions that this augmentation brings to bear in the task of AV in an adversarial setting. Ineparticular, weeexperiment with three different generator architectures (one based on Recurrent Neural Networks, another based on small-scale transformers, and another based on the popular GPT model) and with two training strategies (one inspired by standard Language Models, and another inspired by Wasserstein Generative Adversarial Networks). Weeevaluate our hypothesis on five datasets (three of which have been specifically collected to represent an adversarial setting) and using two learning algorithms for the AV classifier (Support Vector Machines and Convolutional Neural Networks). This experimentation yields negative results, revealing that, although our methodology proves effective in many adversarial settings, its benefits are too sporadic for a pragmatical application

    : Two Datasets for the Computational Authorship Analysis of Medieval Latin Texts

    Full text link
    We present and make available MedLatinEpi and MedLatinLit, two datasets of medieval Latin texts to be used in research on computational authorship analysis. MedLatinEpi and MedLatinLit consist of 294 and 30 curated texts, respectively, labelled by author; MedLatinEpi texts are of epistolary nature, while MedLatinLit texts consist of literary comments and treatises about various subjects. As such, these two datasets lend themselves to supporting research in authorship analysis tasks, such as authorship attribution, authorship verification, or same-author verification. Along with the datasets, we provide experimental results, obtained on these datasets, for the authorship verification task, i.e., the task of predicting whether a text of unknown authorship was written by a candidate author. We also make available the source code of the authorship verification system we have used, thus allowing our experiments to be reproduced, and to be used as baselines, by other researchers. We also describe the application of the above authorship verification system, using these datasets as training data, for investigating the authorship of two medieval epistles whose authorship has been disputed by scholars. on computational authorship analysis. MedLatinEpi and MedLatinLit consist of 294 and 30 curated texts, respectively, labelled by author; MedLatinEpi texts are of epistolary nature, while MedLatinLit texts consist of literary comments and treatises about various subjects. As such, these two datasets lend themselves to supporting research in authorship analysis tasks, such as authorship attribution, authorship verification, or same-author verification. Along with the datasets, we provide experimental results, obtained on these datasets, for the authorship verification task, i.e., the task of predicting whether a text of unknown authorship was written by a candidate author. We also make available the source code of the authorship verification system we have used, thus allowing our experiments to be reproduced, and to be used as baselines, by other researchers. We also describe the application of the above authorship verification system, using these datasets as training data, for investigating the authorship of two medieval epistles whose authorship has been disputed by scholars

    The Epistle to Cangrande Through the Lens of Computational Authorship Verification

    No full text
    The Epistle to Cangrande is one of the most controversial among the works of Italian poet Dante Alighieri. For more than a hundred years now, scholars have been debating over its real paternity, i.e., whether it should be considered a true work by Dante or a forgery by an unnamed author. In this work we address this philological problem through the methodologies of (supervised) Computational Authorship Verification, by training a classifier that predicts whether a given work is by Dante Alighieri or not. We discuss the system we have set up for this endeavour, the training set we have assembled, the experimental results we have obtained, and some issues that this work leaves open

    Explainable authorship identification in cultural heritage applications

    Full text link
    While a substantial amount of work has recently been devoted to improving the accuracy of computational Authorship Identification (AId) systems for textual data, little to no attention has been paid to endowing AId systems with the ability to explain the reasons behind their predictions. This substantially hinders the practical application of AId methods, since the predictions returned by such systems are hardly useful unless they are supported by suitable explanations. In this article, we explore the applicability of existing general-purpose eXplainable Artificial Intelligence (XAI) techniques to AId, with a focus on explanations addressed to scholars working in cultural heritage. In particular, we assess the relative merits of three different types of XAI techniques (feature ranking, probing, factual and counterfactual selection) on three different AId tasks (authorship attribution, authorship verification and same-authorship verification) by running experiments on real AId textual data. Our analysis shows that, while these techniques make important first steps towards XAI, more work remains to be done to provide tools that can be profitably integrated into the workflows of scholars

    L’Epistola a Cangrande al vaglio della Computational Authorship Verification: risultati preliminari (con una postilla sulla cosiddetta “XIV Epistola di Dante Alighieri”)

    No full text
    Questo lavoro applica tecniche automatiche di “Authorship Verification” (AV) al problema di riconoscere se l’“Epistola a Cangrande” sia un’o- pera autentica di Dante Alighieri o sia invece opera di un falsario. L’al- goritmo di AV che viene utilizzato usa tecniche di “machine learning”: esso “addestra” un sistema automatico (un “classificatore”) a rilevare se un certo testo latino è di Dante o meno, esponendolo a un corpus di testi latini di Dante e di testi latini di autori coevi a Dante. L’algoritmo basa le sue ipotesi sull’analisi di un insieme di caratteristiche stilome- triche, cioè di tratti linguistici legati allo stile, le cui frequenze d’uso tendono a rappresentare la “firma” inconscia di un autore. L’analisi condotta in questo lavoro suggerisce che, delle due parti in cui l’Epistola è tradizionalmente suddivisa, nessuna è di Dante. Esperimenti in cui lo stesso sistema di AV è stato applicato a ciascun testo del corpus sugge- riscono che esso ha un grado di accuratezza abbastanza elevato, dando così credibilità alla sua ipotesi sulla paternità dell’Epistola. Nell’ultima sezione di questo lavoro applichiamo il nostro classificatore a quella che è stata ipotizzata essere la “14a Epistola di Dante”; il sistema rigetta, con grande sicurezza, l’ipotesi che questa epistola possa essere di Dante.n this work we apply techniques from computational Authorship Veri- fication (AV) to the problem of detecting whether the “Epistle to Can- grande” is an authentic work by Dante Alighieri or is instead the work of a forger. The AV algorithm we use is based on “machine learning”: the algorithm “trains” an automatic system (a “classifier”) to detect whether a certain Latin text is Dante’s or not Dante’s, by exposing it to a corpus of example Latin texts by Dante and example Latin texts by authors coeval to Dante. The detection is based on the analysis of a set of stylometric features, i.e., style-related linguistic traits whose us- age frequencies tend to represent an author’s unconscious “signature”. The analysis carried out in this work suggests that, of the two parts into which the Epistle is traditionally subdivided, neither is Dante’s. Experi- ments in which we have applied our AV system to each text in the corpus suggest that the system has a fairly high degree of accuracy, thus lending credibility to its hypothesis about the authorship of the Epistle. In the last section of this paper we apply our system to what has been hypothesized to be “Dante’s 14th Epistle”; the system rejects, with very high confi- dence, the hypothesis that this epistle might be Dante’s

    Investigating topic-agnostic features for authorship tasks in Spanish political speeches

    No full text
    Authorship Identification is the branch of authorship analysis concerned with uncovering the author of a written document. Methods devised for Authorship Identification typically employ stylometry (the analysis of unconscious traits that authors exhibit while writing), and are expected not to make inferences grounded on the topics the authors usually write about (as reflected in their past production). In this paper, we present a series of experiments evaluating the use of feature sets based on rhythmic and psycholinguistic patterns for Authorship Verification and Attribution in Spanish political language, via different approaches of text distortion used to actively mask the underlying topic. We feed these feature sets to a SVM learner, and show that they lead to results that are comparable to those obtained by the BETO transformer when the latter is trained on the original text, i.e., when potentially learning from topical informatio

    Rhythmic and psycholinguistic features for authorship tasks in the Spanish parliament : evaluation and analysis

    No full text
    Among the many tasks of the authorship field, Authorship Identification aims at uncovering the author of a document, while Author Profiling focuses on the analysis of personal characteristics of the author(s), such as gender, age, etc. Methods devised for such tasks typically focus on the style of the writing, and are expected not to make inferences grounded on the topics that certain authors tend to write about. In this paper, we present a series of experiments evaluating the use of topicagnostic feature sets for Authorship Identification and Author Profiling tasks in Spanish political language. In particular, we propose to employ features based on rhythmic and sycholinguistic patterns, obtained via different approaches of text masking that we use to actively mask the underlying topic. We feed these feature sets to a SVM learner, and show that they lead to results that are comparable to those obtained by a BETO transformer, when the latter is trained on the original text, i.e., potentially learning from topical information. Moreover, we further investigate the results for the different authors, showing that variations in performance are partially explainable in terms of the authors’ political affiliation and communication styl
    corecore