1,721,114 research outputs found

    Generalized spherical principal component analysis

    Full text link
    Outliers contaminating data sets are a challenge to statistical estimators. Even a small fraction of outlying observations can heavily influence most classical statistical methods. In this paper we propose generalized spherical principal component analysis, a new robust version of principal component analysis that is based on the generalized spatial sign covariance matrix. Theoretical properties of the proposed method including influence functions, breakdown values and asymptotic efficiencies are derived. These theoretical results are complemented with an extensive simulation study and two real-data examples. We illustrate that generalized spherical principal component analysis can combine great robustness with solid efficiency properties, in addition to a low computational cost

    Computational Efficient Approximations of the Concordance Probability in a Big Data Setting

    No full text
    Performance measurement is an essential task once a statistical model is created. The area under the receiving operating characteristics curve (AUC) is the most popular measure for evaluating the quality of a binary classifier. In this case, the AUC is equal to the concordance probability, a frequently used measure to evaluate the discriminatory power of the model. Contrary to AUC, the concordance probability can also be extended to the situation with a continuous response variable. Due to the staggering size of data sets nowadays, determining this discriminatory measure requires a tremendous amount of costly computations and is hence immensely time consuming, certainly in case of a continuous response variable. Therefore, we propose two estimation methods that calculate the concordance probability in a fast and accurate way and that can be applied to both the discrete and continuous setting. Extensive simulation studies show the excellent performance and fast computing times of both estimators. Finally, experiments on two real-life data sets confirm the conclusions of the artificial simulations.sponsorship: This work was supported by the Allianz Research Chair Prescriptive business analytics in insurance at KU Leuven and the International Funds KU Leuven under Grant C16/15/068. (Allianz Research Chair Prescriptive business analytics in insurance at KU Leuven|C16/15/068, International Funds KU Leuven|C16/15/068)status: Publishe

    Portfolio optimization using cellwise robust association measures and clustering methods with application to highly volatile markets

    Full text link
    This paper introduces the minCluster portfolio, which is a portfolio optimization method combining the optimization of downside risk measures, hierarchical clustering and cellwise robustness. Using cellwise robust association measures, the minCluster portfolio is able to retrieve the underlying hierarchical structure in the data. Furthermore, it provides downside protection by using tail risk measures for portfolio optimization. We show through simulation studies and a real data example that the minCluster portfolio produces better out-of-sample results than mean-variances or other hierarchical clustering based approaches. Cellwise outlier robustness makes the minCluster method particularly suitable for stable optimization of portfolios in highly volatile markets, such as portfolios containing cryptocurrencies

    Interpretable cost-sensitive regression through one-step boosting

    Full text link
    In most practical prediction problems, such as regression and classification, the different types of prediction errors are not equally costly in the decision-making process. Although there exist numerous real-world cost-sensitive regression problems, ranging from loan charge-off forecasting to house price predictions, the literature on cost-sensitive learning mainly focuses on classification and only a few solutions are proposed for regression problems. These regressions are typically characterized by an asymmetric cost structure, where over- and underpredictions of a similar magnitude face vastly different costs. In this paper, we present a one-step boosting method (OSB) for cost-sensitive regression. The proposed methodology leverages a secondary learner to incorporate cost-sensitivity into an already trained cost-insensitive regression model. The secondary learner is defined as a linear function of certain variables deemed interesting for cost-sensitivity. These variables do not necessarily need to be the same as in the already trained model. An efficient optimization algorithm is achieved through iteratively reweighted least squares using the asymmetric cost function. The obtained results become interpretable through bootstrapping, enabling decision makers to distinguish important variables for cost-sensitivity as well as facilitating statistical inference. Applying different cost functions and various initial cost-insensitive learning methods on several public datasets consistently yields a significant reduction in the average misprediction cost, illustrating the excellent performance of our approach

    Fraud Analytics: A Decade of Research -- Organizing Challenges and Solutions in the Field

    Full text link
    The literature on fraud analytics and fraud detection has seen a substantial increase in output in the past decade. This has led to a wide range of research topics and overall little organization of the many aspects of fraud analytical research. The focus of academics ranges from identifying fraudulent credit card payments to spotting illegitimate insurance claims. In addition, there is a wide range of methods and research objectives. This paper aims to provide an overview of fraud analytics in research and aims to more narrowly organize the discipline and its many subfields. We analyze a sample of almost 300 records on fraud analytics published between 2011 and 2020. In a systematic way, we identify the most prominent domains of application, challenges faced, performance metrics, and methods used. In addition, we build a framework for fraud analytical methods and propose a keywording strategy for future research. One of the key challenges in fraud analytics is access to public datasets. To further aid the community, we provide eight requirements for suitable data sets in research motivated by our research. We structure our sample of the literature in an online database. The database is available online for fellow researchers to investigate and potentially build upon

    Data engineering for fraud detection

    No full text
    Financial institutions increasingly rely upon data-driven methods for developing fraud detection systems, which are able to automatically detect and block fraudulent transactions. From a machine learning perspective, the task of detecting suspicious transactions is a binary classification problem and therefore many techniques can be applied. Interpretability is however of utmost importance for the management to have confidence in the model and for designing fraud prevention strategies. Moreover, models that enable the fraud experts to understand the underlying reasons why a case is flagged as suspicious will greatly facilitate their job of investigating the suspicious transactions. Therefore, we propose several data engineering techniques to improve the performance of an analytical model while retaining the interpretability property. Our data engineering process is decomposed into several feature and instance engineering steps. We illustrate the improvement in performance of these data engineering steps for popular analytical models on a real payment transactions data set.</p

    direpack: A Python 3 package for state-of-the-art statistical dimensionality reduction methods

    Full text link
    The direpack package establishes a set of modern statistical dimensionality reduction techniques into the Python universe as a single, consistent package. Several of the methods included are only available as open source through direpack, whereas the package also offers competitive Python implementations of methods previously only available in other programming languages. In its present version, the package is structured in three subpackages for different approaches to dimensionality reduction: projection pursuit, sufficient dimension reduction and robust M estimators. As a corollary, the package also provides access to regularized regression estimators based on these reduced dimension spaces, as well as a set of classical and robust preprocessing utilities, including very recent developments such as generalized spatial signs. Finally, direpack has been written to be consistent with the scikit-learn API, such that the estimators can flawlessly be included into (statistical and/or machine) learning pipelines in that framework
    corecore