Search CORE

1,721,051 research outputs found

Datasets of the article "From Classification to Quantification in Tweet Sentiment Analysis"

Author: Sebastiani Fabrizio
Gao Wei
Publication venue
Publication date: 24/11/2015
Field of study

Datasets used for the following SNAM paper: --------------------------------------------------------------------------------------------------- Title: From Classification to Quantification in Tweet Sentiment Analysis Authors: Wei Gao and Fabrizio Sebastiani Organization: Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar --------------------------------------------------------------------------------------------------- [Content] * SemEval2013, SemEval2014, SemEval2015 datasets: - semeval.train.feature.txt: Training set for learning sentiment models at development stage - semeval.dev.feature.txt: Held-out set for tuning parameters - semeval.train+dev.feature.txt: Training set for learning the final sentiment model - semeval13.test.feature.txt: SemEval2013 test set - semeval14.test.feature.txt: SemEval2014 test set - semeval15.test.feature.txt: SemEval2015 test set * Other datasets: semeval2016, sanders, sst, omd, hcr, gasp, wa, wb - X.train.feature.txt: Training set for learning sentiment models at development stage - X.dev.feature.txt: Held-out set for tuning parameters - X.train+dev.feature.txt: Training set for learning the final sentiment model - X.test.feature.txt (or X.dev-test.feature.txt for semeval2016 only): Test set where X is one of semeval2016, sanders, sst, omd, hcr and gasp. * Training files are saved in ./data/train directory, and held-out and test files are in ./data/test directory For more details, please refer to the paper. [Citation] You can cite the following paper when referring to the dataset: @article{gao2016classification, title={From classification to quantification in tweet sentiment analysis}, author={Gao, Wei and Sebastiani, Fabrizio}, journal={Social Network Analysis and Mining}, volume={6}, number={1}, pages={19}, year={2016}, publisher={Springer}

ZENODO

A Critical Reassessment of the Saerens-Latinne-Decaestecker Algorithm for Posterior Probability Adjustment

Author: Molinari Alessio
Sebastiani Fabrizio
Esuli Andrea
Publication venue
Publication date: 20/12/2020
Field of study

We critically re-examine the Saerens-Latinne-Decaestecker (SLD) algorithm, a well-known method for estimating class prior probabilities (“priors”) and adjusting posterior probabilities (“posteriors”) in scenarios characterized by distribution shift, i.e., difference in the distribution of the priors between the training and the unlabelled documents. Given a machine learned classifier and a set of unlabelled documents for which the classifier has returned posterior probabilities and estimates of the prior probabilities, SLD updates them both in an iterative, mutually recursive way, with the goal of making both more accurate; this is of key importance in downstream tasks such as single-label multiclass classification and cost-sensitive text classification. Since its publication, SLD has become the standard algorithm for improving the quality of the posteriors in the presence of distribution shift, and SLD is still considered a top contender when we need to estimate the priors (a task that has become known as “quantification”). However, its real effectiveness in improving the quality of the posteriors has been questioned. We here present the results of systematic experiments conducted on a large, publicly available dataset, across multiple amounts of distribution shift and multiple learners. Our experiments show that SLD improves the quality of the posterior probabilities and of the estimates of the prior probabilities, but only when the number of classes in the classification scheme is very small and the classifier is calibrated. As the number of classes grows, or as we use non-calibrated classifiers, SLD converges more slowly (and often does not converge at all), performance degrades rapidly, and the impact of SLD on the quality of the prior estimates and of the posteriors becomes negative rather than positive

ZENODO

Archivio della Ricerca - Università di Pisa

: Two Datasets for the Computational Authorship Analysis of Medieval Latin Texts

Author: Alejandro Moreo
Corbara Silvia
Sebastiani Fabrizio
Fabrizio Sebastiani
Silvia Corbara
Moreo Alejandro
Tavoni Mirko
Mirko Tavoni
Publication venue
Publication date: 01/01/2022
Field of study

We present and make available MedLatinEpi and MedLatinLit, two datasets of medieval Latin texts to be used in research on computational authorship analysis. MedLatinEpi and MedLatinLit consist of 294 and 30 curated texts, respectively, labelled by author; MedLatinEpi texts are of epistolary nature, while MedLatinLit texts consist of literary comments and treatises about various subjects. As such, these two datasets lend themselves to supporting research in authorship analysis tasks, such as authorship attribution, authorship verification, or same-author verification. Along with the datasets, we provide experimental results, obtained on these datasets, for the authorship verification task, i.e., the task of predicting whether a text of unknown authorship was written by a candidate author. We also make available the source code of the authorship verification system we have used, thus allowing our experiments to be reproduced, and to be used as baselines, by other researchers. We also describe the application of the above authorship verification system, using these datasets as training data, for investigating the authorship of two medieval epistles whose authorship has been disputed by scholars. on computational authorship analysis. MedLatinEpi and MedLatinLit consist of 294 and 30 curated texts, respectively, labelled by author; MedLatinEpi texts are of epistolary nature, while MedLatinLit texts consist of literary comments and treatises about various subjects. As such, these two datasets lend themselves to supporting research in authorship analysis tasks, such as authorship attribution, authorship verification, or same-author verification. Along with the datasets, we provide experimental results, obtained on these datasets, for the authorship verification task, i.e., the task of predicting whether a text of unknown authorship was written by a candidate author. We also make available the source code of the authorship verification system we have used, thus allowing our experiments to be reproduced, and to be used as baselines, by other researchers. We also describe the application of the above authorship verification system, using these datasets as training data, for investigating the authorship of two medieval epistles whose authorship has been disputed by scholars

Crossref

Archivio istituzionale della Ricerca - Scuola Normale Superiore

The Epistle to Cangrande Through the Lens of Computational Authorship Verification

Author: Alejandro Moreo
Corbara Silvia
Sebastiani Fabrizio
Fabrizio Sebastiani
Silvia Corbara
Moreo Alejandro
Tavoni Mirko
Mirko Tavoni
Publication venue
Publication date: 01/01/2019
Field of study

The Epistle to Cangrande is one of the most controversial among the works of Italian poet Dante Alighieri. For more than a hundred years now, scholars have been debating over its real paternity, i.e., whether it should be considered a true work by Dante or a forgery by an unnamed author. In this work we address this philological problem through the methodologies of (supervised) Computational Authorship Verification, by training a classifier that predicts whether a given work is by Dante Alighieri or not. We discuss the system we have set up for this endeavour, the training set we have assembled, the experimental results we have obtained, and some issues that this work leaves open

Crossref

Archivio istituzionale della Ricerca - Scuola Normale Superiore

Explainable authorship identification in cultural heritage applications

Author: Corbara Silvia
Sebastiani Fabrizio
Monreale Anna
Moreo Alejandro
Setzu Mattia
Publication venue
Publication date: 01/01/2024
Field of study

While a substantial amount of work has recently been devoted to improving the accuracy of computational Authorship Identification (AId) systems for textual data, little to no attention has been paid to endowing AId systems with the ability to explain the reasons behind their predictions. This substantially hinders the practical application of AId methods, since the predictions returned by such systems are hardly useful unless they are supported by suitable explanations. In this article, we explore the applicability of existing general-purpose eXplainable Artificial Intelligence (XAI) techniques to AId, with a focus on explanations addressed to scholars working in cultural heritage. In particular, we assess the relative merits of three different types of XAI techniques (feature ranking, probing, factual and counterfactual selection) on three different AId tasks (authorship attribution, authorship verification and same-authorship verification) by running experiments on real AId textual data. Our analysis shows that, while these techniques make important first steps towards XAI, more work remains to be done to provide tools that can be profitably integrated into the workflows of scholars

Archivio istituzionale della Ricerca - Scuola Normale Superiore

Preferential text classification: learning algorithms and evaluation measures

Author: CARDIN R
SEBASTIANI FABRIZIO
SPERDUTI ALESSANDRO
AIOLLI FABIO
Publication venue
Publication date: 01/01/2009
Field of study

In many applicative contexts in which textual documents are labelled with thematic categories, a distinction is made between the primary categories of a document, which represent the topics that are central to it, and its secondary categories, which represent topics that the document only touches upon. We contend that this distinction, so far neglected in text categorization research, is important and deserves to be explicitly tackled. The contribution of this paper is threefold. First, we propose an evaluation measure for this preferential text categorization task, whereby different kinds of misclassifications involving either primary or secondary categories have a different impact on effectiveness. Second, we establish several baseline results for this task on a well-known benchmark for patent classification in which the distinction between primary and secondary categories is present; these results are obtained by reformulating the preferential text categorization task in terms of well established classification problems, such as single and/or multi-label multiclass classification; state-of-the-art learning technology such as SVMs and kernel-based methods are used. Third, we improve on these results by using a recently proposed class of algorithms explicitly devised for learning from training data expressed in preferential form, i.e., in the form "for document d (i) , category c' is preferred to category c''aEuroe; this allows us to distinguish between primary and secondary categories not only in the classification phase but also in the learning phase, thus differentiating their impact on the classifiers to be generated

Archivio istituzionale della ricerca - Università di Padova

L’Epistola a Cangrande al vaglio della Computational Authorship Verification: risultati preliminari (con una postilla sulla cosiddetta “XIV Epistola di Dante Alighieri”)

Author: Corbara Silvia
Sebastiani Fabrizio
Moreo Alejandro
Tavoni Mirko
Publication venue
Publication date: 01/01/2020
Field of study

Questo lavoro applica tecniche automatiche di “Authorship Verification” (AV) al problema di riconoscere se l’“Epistola a Cangrande” sia un’o- pera autentica di Dante Alighieri o sia invece opera di un falsario. L’al- goritmo di AV che viene utilizzato usa tecniche di “machine learning”: esso “addestra” un sistema automatico (un “classificatore”) a rilevare se un certo testo latino è di Dante o meno, esponendolo a un corpus di testi latini di Dante e di testi latini di autori coevi a Dante. L’algoritmo basa le sue ipotesi sull’analisi di un insieme di caratteristiche stilome- triche, cioè di tratti linguistici legati allo stile, le cui frequenze d’uso tendono a rappresentare la “firma” inconscia di un autore. L’analisi condotta in questo lavoro suggerisce che, delle due parti in cui l’Epistola è tradizionalmente suddivisa, nessuna è di Dante. Esperimenti in cui lo stesso sistema di AV è stato applicato a ciascun testo del corpus sugge- riscono che esso ha un grado di accuratezza abbastanza elevato, dando così credibilità alla sua ipotesi sulla paternità dell’Epistola. Nell’ultima sezione di questo lavoro applichiamo il nostro classificatore a quella che è stata ipotizzata essere la “14a Epistola di Dante”; il sistema rigetta, con grande sicurezza, l’ipotesi che questa epistola possa essere di Dante.n this work we apply techniques from computational Authorship Veri- fication (AV) to the problem of detecting whether the “Epistle to Can- grande” is an authentic work by Dante Alighieri or is instead the work of a forger. The AV algorithm we use is based on “machine learning”: the algorithm “trains” an automatic system (a “classifier”) to detect whether a certain Latin text is Dante’s or not Dante’s, by exposing it to a corpus of example Latin texts by Dante and example Latin texts by authors coeval to Dante. The detection is based on the analysis of a set of stylometric features, i.e., style-related linguistic traits whose us- age frequencies tend to represent an author’s unconscious “signature”. The analysis carried out in this work suggests that, of the two parts into which the Epistle is traditionally subdivided, neither is Dante’s. Experi- ments in which we have applied our AV system to each text in the corpus suggest that the system has a fairly high degree of accuracy, thus lending credibility to its hypothesis about the authorship of the Epistle. In the last section of this paper we apply our system to what has been hypothesized to be “Dante’s 14th Epistle”; the system rejects, with very high confi- dence, the hypothesis that this epistle might be Dante’s

Archivio istituzionale della Ricerca - Scuola Normale Superiore

Preference Learning for Category-Ranking based Interactive Text Categorization

Author: Fabrizio Sebastiani
SEBASTIANI FABRIZIO
SPERDUTI ALESSANDRO
Alessandro Sperduti
AIOLLI FABIO
Fabio Aiolli
Publication venue
Publication date: 01/01/2007
Field of study

Category Ranking is a variant of the multi-label classification problem, in which, rather than performing a (hard) assignment to an object of categories from a predefined set, we rank all categories according to their estimated "degree of suitability" to the object. Category ranking has many applications, all pertaining to "interactive" classification contexts in which the system, rather than taking a final categorization decision, is simply required to support a human expert who is in charge of taking this decision. Despite its high applicative potential in information retrieval applications, and in text categorization in particular, category ranking has mainly been tackled by standard text categorization methods. In this paper, we take a radically different stand to category ranking, i.e. one in which supervision is provided to the learner not in the standard form of labels attached to training documents, but in the form of preferences of type "category c is to be preferred to category c2 for document d". We apply to this problem a recently proposed, very general model for preferential learning, and show, through experiments performed on the standard Reuters-21578 benchmark, that this largely outperforms support vector machines, the learning method which has up to now proved the best-performing one in text categorization comparative experiments

Crossref

Archivio istituzionale della ricerca - Università di Padova

Same or Different? Diff-Vectors for Authorship Analysis

Author: Alejandro Moreo
Corbara Silvia
Sebastiani Fabrizio
Fabrizio Sebastiani
Silvia Corbara
Moreo Alejandro
Publication venue
Publication date: 01/01/2023
Field of study

Crossref

Archivio istituzionale della Ricerca - Scuola Normale Superiore

Discretizing continuous attributes in AdaBoost for text categorization

Author: Pio Nardiello
Fabrizio Sebastiani
NARDIELLO P
SEBASTIANI FABRIZIO
SPERDUTI ALESSANDRO
Alessandro Sperduti
Publication venue
Publication date: 01/01/2003
Field of study

We focus on two recently proposed algorithms in the family of "boosting"-based learners for automated text classification, ADABOOST. MH and ADABOOST.MHKR. While the former is a realization of the well-known ADABOOST algorithm specifically aimed at multilabel text categorization, the latter is a generalization of the former based on the idea of learning a committee of classifier sub-committees. Both algorithms have been among the best performers in text categorization experiments so far. A problem in the use of both algorithms is that they require documents to be represented by binary vectors, indicating presence or absence of the terms in the document. As a consequence, these algorithms cannot take full advantage of the "weighted" representations (consisting of vectors of continuous attributes) that are customary in information retrieval tasks, and that provide a much more significant rendition of the document's content than binary representations. In this paper we address the problem of exploiting the potential of weighted representations in the context of ADABOOST-like algorithms by discretizing the continuous attributes through the application of entropy-based discretization methods. We present experimental results on the Reuters-21578 text categorization collection, showing that for both algorithms the version with discretized continuous attributes outperforms the version with traditional binary representations

Crossref

Archivio istituzionale della ricerca - Università di Padova