41530 research outputs found
Sort by
Frequent Itemset Mining with tidyclust in R
Unsupervised learning is closely associated with clustering, however other methods fall under this umbrella such as data mining. In R, the tidyclust package provides a unified interface for clustering models, yet lacks support for data mining. This thesis addresses this gap by introducing the Apriori and ECLAT algorithms into tidyclust, with a focus on frequent itemset mining. Unlike traditional clustering models, frequent itemsets produce groupings of column variables, rather than cluster labels or partitions of observations. To address this, a novel clustering approach is proposed: items (columns) are grouped based on their βdominantβ frequent itemset. A key contribution is a new prediction method, modeled as a recommender system, to predict missing items. This implementation extends tidyclust to support column-based clustering, with applications in market basket analysis and recommender systems
Density-Based and Model-Based Clustering with Tidyclust in R
Clustering is a fundamental technique in unsupervised learning that can be used to find hidden patterns and structures within unlabeled data. The tidyclust package in R provides a unified interface for applying various clustering techniques to data. This paper outlines the addition of density-based clustering with DBSCAN, and model-based clustering using Gaussian mixture models (GMMs) to the tidyclust package. DBSCAN can be performed using the db_clust() function and makes use of the dbscan package implementation as its engine. GMMs can be fit using the gm_clust() function which makes use of the mclust package implementation. This paper highlights the changes made to these underlying implementations in the process of bringing these methods into tidyclust. This includes changes to the model argument names, how the model is fit on data, and how the model is used to predict on future data