1,720,995 research outputs found
Dalla triennale alla magistrale: continua la “fuga dei cervelli” dal Mezzogiorno d'Italia
Guardando alla geografia della mobilità degli studenti meridionali negli anni accademici dal 2014/15 al 2017/18 nel passaggio dalla laurea triennale a quella magistrale, Massimo Attanasio, Marco Enea e Alessandro Albano rilevano che la fuga, già evidente nel passaggio dalle superiori all’università, continua anche in seguito: gli atenei del Mezzogiorno continuano a perdere iscritti potenziali a favore degli atenei del Centro-Nord
A two-stage LDA algorithm for ranking induced topic readability
Probabilistic topic models, such as LDA, are standard text analysis algorithms that provide predictive and latent topic representation for a corpus. However, due to the unsupervised training process, it is difficult to verify the assumption that the latent space discovered by these models is generally meaningful and valuable.
This paper introduces a two-stage LDA algorithm to estimate latent topics in text documents and use readability scores to link the identified topics to a linguistically motivated latent structure. We define a new interpretative tool called induced topic readability, which is used to rank topics from the one with the most complex linguistic structure to the one with the lowest semantic content readily. The usefulness of our method is shown with an application to real data, using articles from the New York Times
Statistically Validated Networks for assessing topic quality in LDA models
Probabilistic topic models have become one of the most widespread machine learning technique for textual analysis purpose. In this framework, Latent Dirichlet Allocation (LDA) (Blei et al., 2003) gained more and more popularity as a text modelling technique. The idea is that documents are represented as random mixtures over latent topics, where a distribution overwords characterizes each topic. Unfortunately, topic models do not guarantee the interpretability of their outputs. The topics learned from the model may be only characterized by a set of irrelevant or unchained words, being useless for the interpretation. Although many topic-quality metrics were proposed (Newman et al., 2009; Aletras and Stevenson,2013; Roder et al., 2015; Nikolenko et al., 2017), the automatic evaluation of the coherence of topics remains an open research area. The main contributions of this paper are: i) to define a coherence measure (SVN-Coherence) based on a rigorous statistical model that approximates human ratings better than state-of-the-art methods, and ii) to filter out marginal associations of words and facilitate the graphical representation and interpretation of the obtained topics through Statically Validated Networks (SVN) (Tumminello et al., 2011). Specifically, the method builds a co-occurrence network for each topic whose most probable words are the nodes. We set a link between two nodes (words) in each network if their co-occurrences are statistically significant. The Hypergeometric distribution describes the probability mass function under the null hypothesis and it models the probability of co-occurrence between words conditionally to their marginals. Indeed, it allows taking into account the heterogeneity of the vocabulary on a collection of texts. Finally, we derive a global measure of coherence for each topic by considering the number of statistically validated links, the strength of the association between word pairs, and the relative relevance of each word in the topic. We claim that these links carry relevant information about the structure of topics, i.e., the more connected the network, the more semantically coherent the corresponding topic. The new measure provides a coherence-based ranking that distinguishes between high-quality and low-quality topics. We designed a survey to obtain human judgment, which we use as ground truth, to compare our method with the state-of-art coherence measures. Specifically, we asked 222 PhD students to evaluate the coherence of 32 topics (extracted from the New York Times articles dataset) on a 4-point scale. The results show that the proposed SVN-Coherence substantially outperforms all the state-of-art coherence metrics
Statistically Validated Network approach for document clustering and topic modeling
In machine learning, document clustering and topic modeling are scientific challenges
concerning the extraction of useful information from a collection of texts. Traditional
approaches, such as Latent Dirichlet Allocation (LDA), rely on maximising likeli-
hood functions. In this paper, we explore a paradigm shift towards network represen-
tation of textual data and the associated challenges of community detection [3]. We
proposes a new method to face the tasks of document clustering and topic modeling,
representing a collection of documents as a bipartite network. Then, we introduce the
application of Statistically Validated Networks (SVN) to filter out irrelevant con-
nections within the projected networks of words and documents. The SVN method is
promising in the framework of topic modeling. For instance, Simonetti et al. (2022) recently proposed a new application of SVN to measure the coherence of topics. In-
stead, we aim to identify the topics themselves. By doing so, we can naturally find topics
with high coherence according to the measure proposed by the authors. Moreover, the
modularity contribution of each community (topic) can be interpreted as a measure of
coherence since it is an intensive quantity that assesses the tendency of words within a
given topic to occur in the same sentences jointl
MEASURING TOPIC COHERENCE THROUGH STATISTICALLY VALIDATED NETWORKS
Topic models arise from the need of understanding and exploring large text
document collections and predicting their underlying structure. Latent Dirichlet
Allocation (LDA) (Blei et al., 2003) has quickly become one of the most popular
text modelling techniques. The idea is that documents are represented as random
mixtures over latent topics, where a distribution over words characterizes each topic.
Unfortunately, topic models give no guaranty on the interpretability of their outputs.
The topics learned from texts may be characterized by a set of irrelevant or
unchained words. Therefore, topic models require validation of the coherence of
estimated topics. However, the automatic evaluation of the latent space of a topic
model is a difficult task. Formerly, the most used metric for evaluating the quality of
a topic model was the held-out likelihood. Still, the literature has shown that this
method emphasizes complexity rather than interpretability. Although many
procedures were recently proposed (Röder et al., 2015), the automatic evaluation of
topic coherence remains an open research area. Our work aims to provide a new
technique based on Statistically Validated Network (Tumminello et al., 2011). Our
approach consists in representing each topic as a network of its most probable
words. The presence of a link between each pair of words is assessed by statistically
validating their co-occurrences in sentences against the null hypothesis of random
co-occurrence. Thus, we propose a new coherence measure based on the structure of
the statistically validated network. Furthermore, the new measure provides a ranking
of topics and distinguishes high-quality from low-quality topics. The intuition is that
the pairwise associations of words is strictly related to the semantic coherence and
interpretability of a topic
Supervised vs Unsupervised Latent DirichletAllocation: topic detection in lyrics.
Topic modeling is a type of statistical modeling for discovering the abstract ``topics'' that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a fixed number of topics starting from words in each document modeled according to a Dirichlet distribution. In this work we are going to apply LDA to a set of songs from four famous Italian songwriters and split them into topics. This work studies the use of themes in lyrics using statistical analysis to detect topics. Aim of the work is to underline the main limits of the standard unsupervised LDA and to propose a supervised extension based on the Correspondence Analysis (CA) association theory
Distance-based aggregation and consensus for preference-approvals
This paper proposes a distance-based aggregation and consensus method for preference-approvals, a type of preference data where individuals provide a list of approved alternatives in addition to a strict ranking. The proposed method aims to synthesize individual preference-approvals into a unified consensus representing the group's collective view. The consensus is the preference-approval, which minimizes the average distance with the whole set of voters. The proposed method has potential applications in group decision-making, recommendation systems, and social choice theory
Ensemble methods for item-weighted label ranking: a comparison
Label Ranking (LR), an emerging non-standard supervised classification problem, aims at training preference models that order a finite set of labels based on a set of predictor features. Traditional LR models regard all labels as equally important. However, in many cases, failing to predict the ranking position of a highly relevant label can be considered more severe than failing to predict a trivial one. Moreover, an efficient LR classifier should be able to take into account the similarity between the items to be ranked. Indeed, swapping two similar elements should be less penalized than swapping two dissimilar ones. The contribution of the present paper is to formulate more flexible item-weighted label ranking models that make use of well-known decision tree ensemble models; respectively: bagging, random forest and boosting. The three proposed weighted LR classifiers encode the similarity structure and the individual label importance provided by a domain expert. The predictive performances of the three algorithms are compared, through simulations, to determine which ensemble procedure produces the best results for different noise levels and weight sets
Exploring topics in LDA models through Statistically Validated Networks: directed and undirected approaches
Probabilistic topic models are machine learning tools for processing and understanding
large text document collections. Among the different models in the literature, Latent
Dirichlet Allocation (LDA) has turned out to be the benchmark of the topic modelling community. The key idea is to represent text documents as random mixtures
over latent semantic structures called topics. Each topic follows a multinomial distribution over the vocabulary words. In order to understand the result of a topic model, researchers usually select the top-n (essential words) words with the highest probability
given a topic and look for meaningful and interpretable semantic themes.
This work proposes a new method for exploring topics in LDA models, using Statistically Validated Networks (SVNs). The main idea of the proposed method is to consider co-occurrence between essential words as a measure of association. Two different
approaches, called undirected and directed are proposed. Firstly, the symmetrical asso-
ciation between two words is taken into account, i.e. how many times two words are
found in the same sentence. Conversely, in the directed approach, the order in which
the words are in the sentence is also considered.
We use hypothesis testing to assess whether the co-occurrence between two words can
be attributed to the chance or if these links carry relevant information about the structure of topics. Specifically, textual data is represented as a bipartite network in which one set of nodes is made by sentences, and the other set of nodes is made by a list of essential words associated with a given topic. A link between a word and a sentence is set if the
word belongs to that sentence. Therefore, the projection of the bipartite network on the
set of words results in a word-co-occurrence network.
Note that the directed approach produces a directed network while the undirected one
an undirected network. Indeed, a directed link from one word to another may be val-
idated, but not the other way around. The two methods are applied to a real dataset,
highlighting the differences
A comparison of ensemble algorithms for item-weighted Label Ranking
Label Ranking (LR) is a non-standard supervised classification method
with the aim of ranking a finite collection of labels according to a set of predictor
variables. Traditional LR models assume indifference among alternatives. However,
misassigning the ranking position of a highly relevant label is frequently regarded
as more severe than failing to predict a trivial label. Moreover, switching two similar
alternatives should be considered less severe than switching two different ones.
Therefore, efficient LR classifiers should be able to take into account the similarities
and individual weights of the items to be ranked. The contribution of this paper is
to formulate and compare flexible item-weighted Label Ranking algorithms using
bagging, random forest, and boosting ensemble methods
- …
