1,721,051 research outputs found
Datasets of the article "From Classification to Quantification in Tweet Sentiment Analysis"
Datasets used for the following SNAM paper:
---------------------------------------------------------------------------------------------------
Title: From Classification to Quantification in Tweet Sentiment Analysis
Authors: Wei Gao and Fabrizio Sebastiani
Organization: Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
---------------------------------------------------------------------------------------------------
[Content]
* SemEval2013, SemEval2014, SemEval2015 datasets:
- semeval.train.feature.txt: Training set for learning sentiment models at development stage
- semeval.dev.feature.txt: Held-out set for tuning parameters
- semeval.train+dev.feature.txt: Training set for learning the final sentiment model
- semeval13.test.feature.txt: SemEval2013 test set
- semeval14.test.feature.txt: SemEval2014 test set
- semeval15.test.feature.txt: SemEval2015 test set
* Other datasets: semeval2016, sanders, sst, omd, hcr, gasp, wa, wb
- X.train.feature.txt: Training set for learning sentiment models at development stage
- X.dev.feature.txt: Held-out set for tuning parameters
- X.train+dev.feature.txt: Training set for learning the final sentiment model
- X.test.feature.txt (or X.dev-test.feature.txt for semeval2016 only): Test set
where X is one of semeval2016, sanders, sst, omd, hcr and gasp.
* Training files are saved in ./data/train directory, and held-out and test files are in ./data/test directory
For more details, please refer to the paper.
[Citation]
You can cite the following paper when referring to the dataset:
@article{gao2016classification,
title={From classification to quantification in tweet sentiment analysis},
author={Gao, Wei and Sebastiani, Fabrizio},
journal={Social Network Analysis and Mining},
volume={6},
number={1},
pages={19},
year={2016},
publisher={Springer}
A Critical Reassessment of the Saerens-Latinne-Decaestecker Algorithm for Posterior Probability Adjustment
We critically re-examine the Saerens-Latinne-Decaestecker (SLD) algorithm, a well-known method for estimating class prior probabilities (“priors”) and adjusting posterior probabilities (“posteriors”) in scenarios characterized by distribution shift, i.e., difference in the distribution of the priors between the training and the unlabelled documents. Given a machine learned classifier and a set of unlabelled documents for which the classifier has returned posterior probabilities and estimates of the prior probabilities, SLD updates them both in an iterative, mutually recursive way, with the goal of making both more accurate; this is of key importance in downstream tasks such as single-label multiclass classification and cost-sensitive text classification. Since its publication, SLD has become the standard algorithm for improving the quality of the posteriors in the presence of distribution shift, and SLD is still considered a top contender when we need to estimate the priors (a task that has become known as “quantification”). However, its real effectiveness in improving the quality of the posteriors has been questioned. We here present the results of systematic experiments conducted on a large, publicly available dataset, across multiple amounts of distribution shift and multiple learners. Our experiments show that SLD improves the quality of the posterior probabilities and of the estimates of the prior probabilities, but only when the number of classes in the classification scheme is very small and the classifier is calibrated. As the number of classes grows, or as we use non-calibrated classifiers, SLD converges more slowly (and often does not converge at all), performance degrades rapidly, and the impact of SLD on the quality of the prior estimates and of the posteriors becomes negative rather than positive
: Two Datasets for the Computational Authorship Analysis of Medieval Latin Texts
We present and make available MedLatinEpi and MedLatinLit, two datasets of medieval Latin texts to be used in research
on computational authorship analysis. MedLatinEpi and MedLatinLit consist of 294 and 30 curated texts, respectively,
labelled by author; MedLatinEpi texts are of epistolary nature, while MedLatinLit texts consist of literary comments and
treatises about various subjects. As such, these two datasets lend themselves to supporting research in authorship analysis
tasks, such as authorship attribution, authorship verification, or same-author verification. Along with the datasets, we provide
experimental results, obtained on these datasets, for the authorship verification task, i.e., the task of predicting whether a
text of unknown authorship was written by a candidate author. We also make available the source code of the authorship
verification system we have used, thus allowing our experiments to be reproduced, and to be used as baselines, by other
researchers. We also describe the application of the above authorship verification system, using these datasets as training
data, for investigating the authorship of two medieval epistles whose authorship has been disputed by scholars.
on computational authorship analysis. MedLatinEpi and MedLatinLit consist of 294 and 30 curated texts, respectively,
labelled by author; MedLatinEpi texts are of epistolary nature, while MedLatinLit texts consist of literary comments and
treatises about various subjects. As such, these two datasets lend themselves to supporting research in authorship analysis
tasks, such as authorship attribution, authorship verification, or same-author verification. Along with the datasets, we provide
experimental results, obtained on these datasets, for the authorship verification task, i.e., the task of predicting whether a
text of unknown authorship was written by a candidate author. We also make available the source code of the authorship
verification system we have used, thus allowing our experiments to be reproduced, and to be used as baselines, by other
researchers. We also describe the application of the above authorship verification system, using these datasets as training
data, for investigating the authorship of two medieval epistles whose authorship has been disputed by scholars
The Epistle to Cangrande Through the Lens of Computational Authorship Verification
The Epistle to Cangrande is one of the most controversial among the works of Italian poet Dante Alighieri. For more than a hundred years now, scholars have been debating over its real paternity, i.e., whether it should be considered a true work by Dante or a forgery by an unnamed author. In this work we address this philological problem through the methodologies of (supervised) Computational Authorship Verification, by training a classifier that predicts whether a given work is by Dante Alighieri or not. We discuss the system we have set up for this endeavour, the training set we have assembled, the experimental results we have obtained, and some issues that this work leaves open
Explainable authorship identification in cultural heritage applications
While a substantial amount of work has recently been devoted to improving the accuracy of computational Authorship Identification (AId) systems for textual data, little to no attention has been paid to endowing AId systems with the ability to explain the reasons behind their predictions. This substantially hinders the practical application of AId methods, since the predictions returned by such systems are hardly useful unless they are supported by suitable explanations. In this article, we explore the applicability of existing general-purpose eXplainable Artificial Intelligence (XAI) techniques to AId, with a focus on explanations addressed to scholars working in cultural heritage. In particular, we assess the relative merits of three different types of XAI techniques (feature ranking, probing, factual and counterfactual selection) on three different AId tasks (authorship attribution, authorship verification and same-authorship verification) by running experiments on real AId textual data. Our analysis shows that, while these techniques make important first steps towards XAI, more work remains to be done to provide tools that can be profitably integrated into the workflows of scholars
Preferential text classification: learning algorithms and evaluation measures
In many applicative contexts in which textual documents are labelled with thematic categories, a distinction is made between the primary categories of a document, which represent the topics that are central to it, and its secondary categories, which represent topics that the document only touches upon. We contend that this distinction, so far neglected in text categorization research, is important and deserves to be explicitly tackled. The contribution of this paper is threefold. First, we propose an evaluation measure for this preferential text categorization task, whereby different kinds of misclassifications involving either primary or secondary categories have a different impact on effectiveness. Second, we establish several baseline results for this task on a well-known benchmark for patent classification in which the distinction between primary and secondary categories is present; these results are obtained by reformulating the preferential text categorization task in terms of well established classification problems, such as single and/or multi-label multiclass classification; state-of-the-art learning technology such as SVMs and kernel-based methods are used. Third, we improve on these results by using a recently proposed class of algorithms explicitly devised for learning from training data expressed in preferential form, i.e., in the form "for document d (i) , category c' is preferred to category c''aEuroe; this allows us to distinguish between primary and secondary categories not only in the classification phase but also in the learning phase, thus differentiating their impact on the classifiers to be generated
L’Epistola a Cangrande al vaglio della Computational Authorship Verification: risultati preliminari (con una postilla sulla cosiddetta “XIV Epistola di Dante Alighieri”)
Questo lavoro applica tecniche automatiche di “Authorship Verification”
(AV) al problema di riconoscere se l’“Epistola a Cangrande” sia un’o-
pera autentica di Dante Alighieri o sia invece opera di un falsario. L’al-
goritmo di AV che viene utilizzato usa tecniche di “machine learning”:
esso “addestra” un sistema automatico (un “classificatore”) a rilevare
se un certo testo latino è di Dante o meno, esponendolo a un corpus di
testi latini di Dante e di testi latini di autori coevi a Dante. L’algoritmo
basa le sue ipotesi sull’analisi di un insieme di caratteristiche stilome-
triche, cioè di tratti linguistici legati allo stile, le cui frequenze d’uso
tendono a rappresentare la “firma” inconscia di un autore. L’analisi
condotta in questo lavoro suggerisce che, delle due parti in cui l’Epistola
è tradizionalmente suddivisa, nessuna è di Dante. Esperimenti in cui lo
stesso sistema di AV è stato applicato a ciascun testo del corpus sugge-
riscono che esso ha un grado di accuratezza abbastanza elevato, dando
così credibilità alla sua ipotesi sulla paternità dell’Epistola. Nell’ultima
sezione di questo lavoro applichiamo il nostro classificatore a quella che
è stata ipotizzata essere la “14a Epistola di Dante”; il sistema rigetta,
con grande sicurezza, l’ipotesi che questa epistola possa essere di Dante.n this work we apply techniques from computational Authorship Veri-
fication (AV) to the problem of detecting whether the “Epistle to Can-
grande” is an authentic work by Dante Alighieri or is instead the work of a forger. The AV algorithm we use is based on “machine learning”:
the algorithm “trains” an automatic system (a “classifier”) to detect
whether a certain Latin text is Dante’s or not Dante’s, by exposing it
to a corpus of example Latin texts by Dante and example Latin texts
by authors coeval to Dante. The detection is based on the analysis of a
set of stylometric features, i.e., style-related linguistic traits whose us-
age frequencies tend to represent an author’s unconscious “signature”.
The analysis carried out in this work suggests that, of the two parts into
which the Epistle is traditionally subdivided, neither is Dante’s. Experi-
ments in which we have applied our AV system to each text in the corpus
suggest that the system has a fairly high degree of accuracy, thus lending
credibility to its hypothesis about the authorship of the Epistle. In the last
section of this paper we apply our system to what has been hypothesized
to be “Dante’s 14th Epistle”; the system rejects, with very high confi-
dence, the hypothesis that this epistle might be Dante’s
Preference Learning for Category-Ranking based Interactive Text Categorization
Category Ranking is a variant of the multi-label classification problem, in which, rather than performing a (hard) assignment to an object of categories from a predefined set, we rank all categories according to their estimated "degree of suitability" to the object. Category ranking has many applications, all pertaining to "interactive" classification contexts in which the system, rather than taking a final categorization decision, is simply required to support a human expert who is in charge of taking this decision. Despite its high applicative potential in information retrieval applications, and in text categorization in particular, category ranking has mainly been tackled by standard text categorization methods. In this paper, we take a radically different stand to category ranking, i.e. one in which supervision is provided to the learner not in the standard form of labels attached to training documents, but in the form of preferences of type "category c is to be preferred to category c2 for document d". We apply to this problem a recently proposed, very general model for preferential learning, and show, through experiments performed on the standard Reuters-21578 benchmark, that this largely outperforms support vector machines, the learning method which has up to now proved the best-performing one in text categorization comparative experiments
Discretizing continuous attributes in AdaBoost for text categorization
We focus on two recently proposed algorithms in the family of "boosting"-based learners for automated text classification, ADABOOST. MH and ADABOOST.MHKR. While the former is a realization of the well-known ADABOOST algorithm specifically aimed at multilabel text categorization, the latter is a generalization of the former based on the idea of learning a committee of classifier sub-committees. Both algorithms have been among the best performers in text categorization experiments so far.
A problem in the use of both algorithms is that they require documents to be represented by binary vectors, indicating presence or absence of the terms in the document. As a consequence, these algorithms cannot take full advantage of the "weighted" representations (consisting of vectors of continuous attributes) that are customary in information retrieval tasks, and that provide a much more significant rendition of the document's content than binary representations.
In this paper we address the problem of exploiting the potential of weighted representations in the context of ADABOOST-like algorithms by discretizing the continuous attributes through the application of entropy-based discretization methods. We present experimental results on the Reuters-21578 text categorization collection, showing that for both algorithms the version with discretized continuous attributes outperforms the version with traditional binary representations
- …
