1,721,013 research outputs found
Low-Rank Analysis of Topic Quality: Comparing LDA, CTM, and Fuzzy-LSA methods
The aim of this study is to evaluate the quality of topic solutions generated by Latent Dirichlet Allocation (LDA), Correlated Topic Model (CTM), and fuzzy Latent Semantic Analysis (fLSA). By introducing the CL, RL, and HO indices, the study focuses on structural properties such as oversimplification, redundancy, and homogeneity, offering a novel approach to complement traditional metrics like coherence and perplexity. This framework provides a nuanced perspective for assessing topic quality
Assessing CO2 emissions from electricity generation: a methodological review and comparative analysis
Accurate estimation of greenhouse gas (GHG) is essential to meet carbon neutrality
targets, particularly through the calculation of direct CO2 emissions from electricity generation.
This work reviews and compares emission factor-based methods for accounting
direct carbon emissions from electricity generation. The emission factor approach is commonly
worldwide used. Empirical comparisons are based on emission factors computed
using data from the Italian electricity market. The analyses reveal significant differences
in the CO2 estimates according to different methods. This, in turn, highlights the need
to select an appropriate method for reliable emissions, which could support effective regulatory
compliance and informed policy-making. As concerns, in particular, the market
zones of the Italian electricity market, the results underscore the importance of tailoring
emission factors to accurately capture regional fuel variations
Predicting and Preventing gender-based violence: A strategic framework for long-term change
Gender-based violence (GBV) remains a critical global issue, requiring proactive prevention strategies to mitigate its long-term impact. This study examines the evolving landscape of GBV prevention, highlighting a shift from reactive interventions to forward- looking strategies. Using the futures cone and three horizons framework, we developed a sustainable model for GBV mitigation. Through Natural Language Processing analysis of survivor narratives, we identified linguistic and semantic patterns that reveal resilience and opportunities for early intervention. Our data-driven approach provides policymakers and advocates with actionable insights to drive systemic change and reduce GBV prevalence
Lost in Noise: When cleaning up clouds the picture. Fuzzy topic modeling and robust low-rank decomposition
This preliminary study assesses the impact of noise-removing techniques, such as Principal Component Pursuit (PCP), on the document-term matrix before topic modeling. Specifically, fuzzy Latent Semantic Analysis (fLSA) is applied to a benchmark dataset of Air France customer reviews to evaluate how different input representations – namely, the standard term-frequency matrix and its low-rank approximation via low-rank decomposition – affect topic coherence and interpretability. Initial results indicate that while fLSA effectively extracts meaningful topics, noise removal via PCP introduces distortions, altering topic structure
Functional Clustering for Survival Curves
This paper investigates the underexplored area of clustering multiple survival curves,
with a focus on the advantages of Functional Data Analysis for analyzing survival or
hazard functions to exploit their inherent continuous nature. We propose customized
functional methods, particularly leveraging Functional Principal Component Analysis,
and compare them with existing methods using two real datasets: the German Breast
Cancer Study (GBCS) and the Lung Cancer dataset. The results show that FDA-based
methods offer faster execution times and improve clustering quality overall, highlighting
the potential of FDA as a more natural and efficient approach for clustering survival
curves, making it a promising direction for future survival data analysis
A Novel Metric for Enhancing Online Review Relevance in E-commerce
In the realm of e-commerce, online reviews are a crucial resource for consumers, yet their usefulness is often hindered by the overwhelming quantity and variability of information. This study proposes an innovative approach to balancing numerical ratings with the sentiment extracted from review texts, leveraging the VADER (Valence Aware Dictionary and sEntiment Reasoner) model. The proposed metric identifies atypical and incongruent reviews by evaluating the consistency between numerical ratings and the sentiment conveyed in textual content.
Through the analysis of real-world review datasets, we demonstrate how this system enhances the relevance of information for consumers, enabling them to navigate reviews with greater ease. Tested on datasets comprising 3 million reviews, the results show that integrating this metric into e-commerce platforms can not only optimize the shopping experience but also provide businesses with an opportunity to increase transparency and foster customer loyalty. This work contributes to the ongoing discourse on the importance of AI-driven tools in supporting informed decision-making within digital marketing
Segmenting the spatial distribution of the adjusted dissimilarity index to detect residential segregation of foreigners in Campania
Residential segregation of the foreign population can depend by several socioeconomic and demographic factors related to both resident population and territorial context. By choosing the adjusted dissimilarity index to assess eveness of the spatial distribution of foreign residents with respect to the Italian population, we propose to resort to conditional inference trees to identify the contextual variables, measured at two spatial domains, that are mostly associated with the chosen measure of residential segregation in Campania, Italy. The analysis distinguishes between European and not European foreigners, to highlight differences in their settlement models
Improved prediction of 100-meter sprint records
In the last years, prediction of sport records has received increased attention by the scientific community. In particular, it is of great interest the evaluation of the goodness of a record. The application of extreme value theory in this context is quite natural. In this work, we use the Gumbel model to analyze the annual speed records in men’s and women’s 100-meter sprint races from 2001 to 2024. We propose the use of a new calibration procedure in order to correctly estimate the probability of future records and the expected time needed to break the current world record
Scoring ordinal variables for constructing composite indicators
In order to provide composite indicators of latent variables, for example of customer satisfaction, it is opportune to identify the structure of the latent variable, in terms of the assignment of items to the subscales defining the latent variable. Adopting the reflective model, the impact of four different methods of scoring ordinal variables on the identification of the true structure of latent variables is investigated. A simulation study composed of 5 steps is conducted: (1) simulation of population data with continuous variables measuring a two-dimensional latent variable with known structure; (2) draw of a number of random samples; (3) discretization of the continuous variables according to different distributional forms; (4) quantification of the ordinal variables obtained in step (3) according to different methods; (5) construction of composite indicators and verification of the correct assignment of variables to subscales by the multiple group method and the factor analysis. Results show that the considered scoring methods have similar performances in assigning items to subscales, and that, when the latent variable is multinormal, the distributional form of the observed ordinal variables is not determinant in suggesting the best scoring method to use
- …
