Search CORE

1,720,995 research outputs found

Dalla triennale alla magistrale: continua la “fuga dei cervelli” dal Mezzogiorno d'Italia

Author: Massimo Attanasio
Marco Enea
ALBANO ALESSANDRO
Publication venue
Publication date: 01/01/2019
Field of study

Guardando alla geografia della mobilità degli studenti meridionali negli anni accademici dal 2014/15 al 2017/18 nel passaggio dalla laurea triennale a quella magistrale, Massimo Attanasio, Marco Enea e Alessandro Albano rilevano che la fuga, già evidente nel passaggio dalle superiori all’università, continua anche in seguito: gli atenei del Mezzogiorno continuano a perdere iscritti potenziali a favore degli atenei del Centro-Nord

Archivio istituzionale della ricerca - Università di Palermo

A two-stage LDA algorithm for ranking induced topic readability

Author: Mariangela Sciandra
Alessandro Albano
Publication venue
Publication date: 2022
Field of study

Probabilistic topic models, such as LDA, are standard text analysis algorithms that provide predictive and latent topic representation for a corpus. However, due to the unsupervised training process, it is difficult to verify the assumption that the latent space discovered by these models is generally meaningful and valuable. This paper introduces a two-stage LDA algorithm to estimate latent topics in text documents and use readability scores to link the identified topics to a linguistically motivated latent structure. We define a new interpretative tool called induced topic readability, which is used to rank topics from the one with the most complex linguistic structure to the one with the lowest semantic content readily. The usefulness of our method is shown with an application to real data, using articles from the New York Times

Archivio istituzionale della ricerca - Università di Palermo

Statistically Validated Networks for assessing topic quality in LDA models

Author: Andrea Simonetti
Alessandro Albano
Publication venue
Publication date: 01/01/2022
Field of study

Probabilistic topic models have become one of the most widespread machine learning technique for textual analysis purpose. In this framework, Latent Dirichlet Allocation (LDA) (Blei et al., 2003) gained more and more popularity as a text modelling technique. The idea is that documents are represented as random mixtures over latent topics, where a distribution overwords characterizes each topic. Unfortunately, topic models do not guarantee the interpretability of their outputs. The topics learned from the model may be only characterized by a set of irrelevant or unchained words, being useless for the interpretation. Although many topic-quality metrics were proposed (Newman et al., 2009; Aletras and Stevenson,2013; Roder et al., 2015; Nikolenko et al., 2017), the automatic evaluation of the coherence of topics remains an open research area. The main contributions of this paper are: i) to define a coherence measure (SVN-Coherence) based on a rigorous statistical model that approximates human ratings better than state-of-the-art methods, and ii) to filter out marginal associations of words and facilitate the graphical representation and interpretation of the obtained topics through Statically Validated Networks (SVN) (Tumminello et al., 2011). Specifically, the method builds a co-occurrence network for each topic whose most probable words are the nodes. We set a link between two nodes (words) in each network if their co-occurrences are statistically significant. The Hypergeometric distribution describes the probability mass function under the null hypothesis and it models the probability of co-occurrence between words conditionally to their marginals. Indeed, it allows taking into account the heterogeneity of the vocabulary on a collection of texts. Finally, we derive a global measure of coherence for each topic by considering the number of statistically validated links, the strength of the association between word pairs, and the relative relevance of each word in the topic. We claim that these links carry relevant information about the structure of topics, i.e., the more connected the network, the more semantically coherent the corresponding topic. The new measure provides a coherence-based ranking that distinguishes between high-quality and low-quality topics. We designed a survey to obtain human judgment, which we use as ground truth, to compare our method with the state-of-art coherence measures. Specifically, we asked 222 PhD students to evaluate the coherence of 32 topics (extracted from the New York Times articles dataset) on a 4-point scale. The results show that the proposed SVN-Coherence substantially outperforms all the state-of-art coherence metrics

Archivio istituzionale della ricerca - Università di Palermo

Statistically Validated Network approach for document clustering and topic modeling

Author: Andrea Simonetti
Alessandro Albano
Publication venue
Publication date: 01/01/2023
Field of study

In machine learning, document clustering and topic modeling are scientific challenges concerning the extraction of useful information from a collection of texts. Traditional approaches, such as Latent Dirichlet Allocation (LDA), rely on maximising likeli- hood functions. In this paper, we explore a paradigm shift towards network represen- tation of textual data and the associated challenges of community detection [3]. We proposes a new method to face the tasks of document clustering and topic modeling, representing a collection of documents as a bipartite network. Then, we introduce the application of Statistically Validated Networks (SVN) to filter out irrelevant con- nections within the projected networks of words and documents. The SVN method is promising in the framework of topic modeling. For instance, Simonetti et al. (2022) recently proposed a new application of SVN to measure the coherence of topics. In- stead, we aim to identify the topics themselves. By doing so, we can naturally find topics with high coherence according to the measure proposed by the authors. Moreover, the modularity contribution of each community (topic) can be interpreted as a measure of coherence since it is an intensive quantity that assesses the tendency of words within a given topic to occur in the same sentences jointl

Archivio istituzionale della ricerca - Università di Palermo

MEASURING TOPIC COHERENCE THROUGH STATISTICALLY VALIDATED NETWORKS

Author: Andrea Simonetti
Alessandro Albano
Publication venue
Publication date: 01/01/2020
Field of study

Topic models arise from the need of understanding and exploring large text document collections and predicting their underlying structure. Latent Dirichlet Allocation (LDA) (Blei et al., 2003) has quickly become one of the most popular text modelling techniques. The idea is that documents are represented as random mixtures over latent topics, where a distribution over words characterizes each topic. Unfortunately, topic models give no guaranty on the interpretability of their outputs. The topics learned from texts may be characterized by a set of irrelevant or unchained words. Therefore, topic models require validation of the coherence of estimated topics. However, the automatic evaluation of the latent space of a topic model is a difficult task. Formerly, the most used metric for evaluating the quality of a topic model was the held-out likelihood. Still, the literature has shown that this method emphasizes complexity rather than interpretability. Although many procedures were recently proposed (Röder et al., 2015), the automatic evaluation of topic coherence remains an open research area. Our work aims to provide a new technique based on Statistically Validated Network (Tumminello et al., 2011). Our approach consists in representing each topic as a network of its most probable words. The presence of a link between each pair of words is assessed by statistically validating their co-occurrences in sentences against the null hypothesis of random co-occurrence. Thus, we propose a new coherence measure based on the structure of the statistically validated network. Furthermore, the new measure provides a ranking of topics and distinguishes high-quality from low-quality topics. The intuition is that the pairwise associations of words is strictly related to the semantic coherence and interpretability of a topic

Archivio istituzionale della ricerca - Università di Palermo

Supervised vs Unsupervised Latent DirichletAllocation: topic detection in lyrics.

Author: Irene Carola Spera
Mariangela Sciandra
Alessandro Albano
Publication venue
Publication date: 01/01/2020
Field of study

Topic modeling is a type of statistical modeling for discovering the abstract ``topics'' that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a fixed number of topics starting from words in each document modeled according to a Dirichlet distribution. In this work we are going to apply LDA to a set of songs from four famous Italian songwriters and split them into topics. This work studies the use of themes in lyrics using statistical analysis to detect topics. Aim of the work is to underline the main limits of the standard unsupervised LDA and to propose a supervised extension based on the Correspondence Analysis (CA) association theory

Archivio istituzionale della ricerca - Università di Palermo

Distance-based aggregation and consensus for preference-approvals

Author: Mariangela Sciandra
Antonella Plaia
Alessandro Albano
Publication venue
Publication date: 01/01/2023
Field of study

This paper proposes a distance-based aggregation and consensus method for preference-approvals, a type of preference data where individuals provide a list of approved alternatives in addition to a strict ranking. The proposed method aims to synthesize individual preference-approvals into a unified consensus representing the group's collective view. The consensus is the preference-approval, which minimizes the average distance with the whole set of voters. The proposed method has potential applications in group decision-making, recommendation systems, and social choice theory

Archivio istituzionale della ricerca - Università di Palermo

Ensemble methods for item-weighted label ranking: a comparison

Author: Mariangela Sciandra
Antonella Plaia
Alessandro Albano
Publication venue
Publication date: 01/01/2022
Field of study

Label Ranking (LR), an emerging non-standard supervised classification problem, aims at training preference models that order a finite set of labels based on a set of predictor features. Traditional LR models regard all labels as equally important. However, in many cases, failing to predict the ranking position of a highly relevant label can be considered more severe than failing to predict a trivial one. Moreover, an efficient LR classifier should be able to take into account the similarity between the items to be ranked. Indeed, swapping two similar elements should be less penalized than swapping two dissimilar ones. The contribution of the present paper is to formulate more flexible item-weighted label ranking models that make use of well-known decision tree ensemble models; respectively: bagging, random forest and boosting. The three proposed weighted LR classifiers encode the similarity structure and the individual label importance provided by a domain expert. The predictive performances of the three algorithms are compared, through simulations, to determine which ensemble procedure produces the best results for different noise levels and weight sets

Archivio istituzionale della ricerca - Università di Palermo

Exploring topics in LDA models through Statistically Validated Networks: directed and undirected approaches

Author: Mariangela Sciandra
Antonella Plaia
Alessandro Albano
Publication venue
Publication date: 01/01/2022
Field of study

Probabilistic topic models are machine learning tools for processing and understanding large text document collections. Among the different models in the literature, Latent Dirichlet Allocation (LDA) has turned out to be the benchmark of the topic modelling community. The key idea is to represent text documents as random mixtures over latent semantic structures called topics. Each topic follows a multinomial distribution over the vocabulary words. In order to understand the result of a topic model, researchers usually select the top-n (essential words) words with the highest probability given a topic and look for meaningful and interpretable semantic themes. This work proposes a new method for exploring topics in LDA models, using Statistically Validated Networks (SVNs). The main idea of the proposed method is to consider co-occurrence between essential words as a measure of association. Two different approaches, called undirected and directed are proposed. Firstly, the symmetrical asso- ciation between two words is taken into account, i.e. how many times two words are found in the same sentence. Conversely, in the directed approach, the order in which the words are in the sentence is also considered. We use hypothesis testing to assess whether the co-occurrence between two words can be attributed to the chance or if these links carry relevant information about the structure of topics. Specifically, textual data is represented as a bipartite network in which one set of nodes is made by sentences, and the other set of nodes is made by a list of essential words associated with a given topic. A link between a word and a sentence is set if the word belongs to that sentence. Therefore, the projection of the bipartite network on the set of words results in a word-co-occurrence network. Note that the directed approach produces a directed network while the undirected one an undirected network. Indeed, a directed link from one word to another may be val- idated, but not the other way around. The two methods are applied to a real dataset, highlighting the differences

Archivio istituzionale della ricerca - Università di Palermo

A comparison of ensemble algorithms for item-weighted Label Ranking

Author: Mariangela Sciandra
Antonella Plaia
Alessandro Albano
Publication venue
Publication date: 01/01/2023
Field of study

Label Ranking (LR) is a non-standard supervised classification method with the aim of ranking a finite collection of labels according to a set of predictor variables. Traditional LR models assume indifference among alternatives. However, misassigning the ranking position of a highly relevant label is frequently regarded as more severe than failing to predict a trivial label. Moreover, switching two similar alternatives should be considered less severe than switching two different ones. Therefore, efficient LR classifiers should be able to take into account the similarities and individual weights of the items to be ranked. The contribution of this paper is to formulate and compare flexible item-weighted Label Ranking algorithms using bagging, random forest, and boosting ensemble methods

Archivio istituzionale della ricerca - Università di Palermo