1,721,101 research outputs found

    From Patterns in Data to Knowledge Discovery: What Data Mining Can Do

    No full text
    AbstractData mining is defined as the computational process of analyzing large amounts of data in order to extract patterns and useful information. In the last few decades, data mining has been widely recognized as a powerful yet versatile data-analysis tool in a variety of fields: information technology in primis, but also clinical medicine, sociology, physics.In this technical note we provide a high-level overview of the most prominent tasks and methods that form the basis of data mining. The note also focuses on some of the most recent yet promising interdisciplinary aspects of data mining

    The forget-set identification problem

    Full text link
    Machine Unlearning (MU) is the problem of removing the influence of user's unwanted evidence from a trained machine-learning model. MU is typically formulated so that the input unwanted evidence corresponds to a subset of the training set utilized to train the model upstream, which is commonly referred to as the "forget set". However, this requirement is often difficult to satisfy in real-world scenarios, as users may be unaware of the peculiarities of the training set or simply they do not have access to it. In a more realistic setting, users provide their unwanted evidence in a form that is more abstract than or anyway different from a precise subset of training data. In such cases, executing MU methods requires an essential and challenging preliminary step, which, to the best of our knowledge, has never been addressed so far: identifying the forget set based on user's unwanted evidence. In this paper, we fill this important gap in the MU literature and introduce the Forget-Set Identification (ForSId) problem: given a trained machine-learning model, an "unwanted set" of samples (evidence to unlearn), and a "wanted set" of samples (evidence to retain), identify the forget set as a subset of the training set, such that the similarity in the predictions of the original model and the model retrained on the training data remaining after the removal of the forget set is: (i) low on the unwanted set, indicating that the unwanted samples have been effectively unlearned by the model, and (ii) high on the wanted set, to ensure that the model keeps its original performance on the data to be retained. We define ForSId as an optimization problem, prove its NP-hardness, and devise an algorithm based on a theoretical connection to Red-Blue Set Cover. Our ForSId is a novel complementary problem to MU. It serves as a foundational step to be performed before executing MU methods, allowing for extending the range of applicability of MU to all those settings where user's unlearning evidence does not correspond to (or is too hard to be directly expressed in terms of) a forget set. We conduct extensive experiments based on the exact unlearning task (which is the most reliable one) on several real-world datasets and settings, involving nontrivial baselines. Results demonstrate high performance of our proposed algorithm and clear superiority over the baselines

    Correlation Clustering with Global Weight Bounds

    No full text
    Given a set of objects and nonnegative real weights expressing “positive” and “negative” feeling of clustering any two objects together, min-disagreement correlation clustering partitions the input object set so as to minimize the sum of the intra-cluster negative-type weights plus the sum of the inter-cluster positive-type weights. Min-disagreement correlation clustering is APX -hard, but efficient constant-factor approximation algorithms exist if the weights are bounded in some way. The weight bounds so far studied in the related literature are mostly local, as they are required to hold for every object-pair. In this paper, we introduce the problem of min-disagreement correlation clustering with global weight bounds, i.e., constraints to be satisfied by the input weights altogether. Our main result is a sufficient condition that establishes when any algorithm achieving a certain approximation under the probability constraint keeps the same guarantee on an input that violates the constraint. This extends the range of applicability of the most prominent existing correlation-clustering algorithms, including the popular Pivot, thus providing benefits, both theoretical and practical. Experiments demonstrate the usefulness of our approach, in terms of both worthiness of employing existing efficient algorithms, and guidance on the definition of weights from feature vectors in a task of fair clustering

    Correlation Clustering: From Local to Global Constraints

    Full text link
    Given a set of data objects, consider that object pairs are assigned two weights expressing the advantage of putting those objects in the same cluster or in separate clusters, respectively. Correlation clustering partitions the input object set so as to minimize the sum of the intra-cluster negative-type weights plus the sum of the inter-cluster positive-type weights. Existing approximation algorithms provide quality guarantees if the weights are bounded in some way. Regardless of the type, the weight bounds that have been so far studied are local bounds, i.e., constraints that are required to hold for every object pair in isolation. In this paper, we discuss global weight bounds in correlation clustering, and in particular, we derive bounds on edge weights' aggregate functions that are sufficient to lead to proved quality guarantees. Our formulation extends the range of applicability of the most prominent existing correlationclustering algorithms thus providing benefits, both theoretical and practical. Also, we showcase our results in a real-world scenario of feature selection for fair clustering
    corecore