Journal of Statistical Software
Not a member yet
1629 research outputs found
Sort by
BoXHED2.0: Scalable Boosting of Dynamic Survival Analysis
Modern applications of survival analysis increasingly involve time-dependent covariates.The Python package BoXHED2.0 (Boosted eXact Hazard Estimator with Dynamic covariates) is a tree-boosted hazard estimator that is fully nonparametric, and is applicable to survival settings far more general than right-censoring, including recurring events and competing risks. BoXHED2.0 is also scalable to the point of being on the same order of speed as parametric boosted survival models, in part because its core is written in C++ and it also supports the use of GPUs and multicore CPUs. BoXHED2.0 is available from PyPI and also from https://github.com/BoXHED
pymle: A Python Package for Maximum Likelihood Estimation and Simulation of Stochastic Differential Equations
This paper introduces the object-oriented Python package pymle, which provides core functionality for maximum likelihood estimation and simulation of univariate stochastic differential equations. The package supports maximum likelihood estimation using Euler, Elerian, Ozaki, Shoji-Ozaki, Hermite polynomial, and Kessler density approximations, as well as a recently proposed continuous-time Markov chain approximation scheme. Exact maximum likelihood estimation is also provided when available. The framework supports estimation and simulation for 21 stochastic differential equations models at the time of writing, and its object oriented design facilitates easy extensions to new models and approximation methods
stopp: An R Package for Spatio-Temporal Point Pattern Analysis
stopp is a novel R package specifically designed for the analysis of spatio-temporal point patterns which might have occurred in a subset of the Euclidean space or on some specific linear network, such as roads of a city. It represents the first package providing a comprehensive modeling framework for spatio-temporal Poisson point processes. While many specialized models exist in the scientific literature for analyzing complex spatio-temporal point patterns, we address the lack of general software for comparing simpler alternative models and their goodness of fit. The package's main functionalities include modeling and diagnostics, together with exploratory analysis tools and the simulation of point processes. A particular focus is given to local first-order and second-order characteristics. The package aggregates existing methods within one coherent framework, including those we proposed in recent papers, and it aims to welcome many further proposals and extensions from the R community
Stability Selection and Consensus Clustering in R: The R Package sharp
The R package sharp (Stability-enHanced Approaches using Resampling Procedures) provides an integrated framework for stability-enhanced variable selection, graphical modeling and clustering. In stability selection, a feature selection algorithm is combined with a resampling technique to estimate feature selection probabilities. Features with selection proportions above a threshold are considered stably selected. Similarly, a clustering algorithm is applied on multiple subsamples of items to compute co-membership proportions in consensus clustering. The consensus clusters are obtained by clustering using comembership proportions as a measure of similarity. We calibrate the hyper-parameters of stability selection (or consensus clustering) jointly by maximizing a consensus score calculated under the null hypothesis of equiprobability of selection (or co-membership), which characterizes instability. The package offers flexibility in the modeling, includes diagnostic and visualization tools, and allows for parallelization
Estimating Spatial Dynamic Panel Data Models with Unobserved Common Factors in Stata
This article introduces the spxtivdfreg package in Stata, which implements a general instrumental variables (IV) approach for estimating dynamic spatial panel data models with unobserved common factors or interactive effects, when the number of both cross-sectional and time series observations is large. The estimator has been developed in a recent paper by Cui, Sarafidis, and Yamagata (2023). The underlying idea is to project out the common factors from exogenous covariates using principal components analysis, and to run IV regression in both of two stages, using defactored covariates (and their spatial counterparts) as instruments. The resulting two-stage IV estimator is valid for models with homogeneous slope coefficients, and has several advantages relative to existing popular approaches. In addition, the spxtivdfreg package allows estimation of short-run and long-run direct and indirect effects, as well as total effects, accounting for the cumulative effects over time and across space. Standard errors for such effects are computed using the Delta method. Last, the spxtivdfreg package allows for heterogeneous slope coefficients, as in Chen, Cui, Sarafidis, and Yamagata (2025). In particular, we construct a "mean group" IV estimator, which involves averaging first-step IV estimates of individual-specific slope coefficients
ebnm: An R Package for Solving the Empirical Bayes Normal Means Problem Using a Variety of Prior Families
The empirical Bayes normal means (EBNM) model is important to many areas of statistics, including (but not limited to) multiple testing, wavelet denoising, and gene expression analysis. There are several existing software packages that can fit EBNM models under different prior assumptions and using different algorithms. However, the differences across interfaces complicate direct comparisons, and a number of important prior assumptions do not yet have implementations. Motivated by these issues, we developed the R package ebnm, which provides a unified interface for efficiently fitting EBNM models using a variety of prior assumptions, including nonparametric approaches. In some cases, we incorporated existing implementations into ebnm; in others, we implemented new fitting procedures, with an emphasis on speed and numerical stability. We illustrate the use of ebnm in a detailed analysis of baseball statistics. By providing a unified and easily extensible interface, ebnm can facilitate development of new methods in statistics, genetics, and other areas; as an example, we briefly discuss the R package flashier, which harnesses ebnm for flexible and robust matrix factorization
GLMcat: An R Package for Generalized Linear Models for Categorical Responses
In statistical modeling, there is a wide variety of generalized linear models for categorical response variables (nominal or ordinal responses); yet, there is no software embracing all these models together in a unique and generic framework. We propose and present GLMcat, an R package to estimate generalized linear models implemented under the unified specification (r, F, Z) where r represents the ratio of probabilities (reference, cumulative, adjacent, or sequential), F the cumulative distribution function for the linkage, and Z the design matrix. All classical models (and their variations) for categorical data can be written as an (r, F, Z) triplet, thus, they can be fitted with GLMcat. The functions in the package are intuitive and user-friendly. For each of the three components, there are multiple alternatives from which the user should thoroughly select those that best address the objectives of the analysis. The main strengths of the GLMcat package are the possibility of choosing from a large number of link functions (defined by the composition of F and r) and the simplicity for setting constraints in the linear prediction, either on the intercepts or on the slopes. This paper proposes a methodological and practical guide for the appropriate selection of a model considering the concordance between the nature of the data and the properties of the model
TrendLSW: Trend and Spectral Estimation of Nonstationary Time Series in R
The TrendLSW R package has been developed to provide users with a suite of wavelet-based techniques to analyze the statistical properties of nonstationary time series. The key components of the package are (a) two approaches for the estimation of the evolutionary wavelet spectrum in the presence of trend; and (b) wavelet-based trend estimation in the presence of locally stationary wavelet errors via both linear and nonlinear wavelet thresholding; and (c) the calculation of associated pointwise confidence intervals. Lastly, the package directly implements boundary handling methods that enable the methods to be performed on data of arbitrary length, not just dyadic length as is common for wavelet-based methods, ensuring no preprocessing of data is necessary. The key functionality of the package is demonstrated through two data examples, arising from biology and activity monitoring
watson: An R Package for Fitting Mixtures of Watson Distributions
In this paper we present and showcase the R package watson which provides a computational framework for fitting and random sampling of the Watson distribution on a p-dimensional sphere. We first introduce the random sampling scheme of the package, which offers two sampling algorithms that are based of the results of Sablica, Hornik, and Leydold (2025). What is more, the package offers a smart tool to combine these two methods, and based on the selected parameters, it approximates the relative sampling speed for both methods and picks the faster one. In addition, we describe the main fitting function for the mixtures of Watson distribution which uses the expectation-maximization (EM) algorithm. Special features are the possibility to use multiple variants of the E-step and M-step, sparse matrices for the data representation and a control parameter which will dynamically eliminate small clusters with overall contribution smaller than this parameter. Moreover, we discuss the numerical issues of the whole fitting procedure and describe how this is handled and solved in the package. Finally, we demonstrate the package on multiple examples involving misspecified simulation study, estimation of the New Zealand earthquake data and depth image clustering
BayesMix: Bayesian Mixture Models in C++
We describe BayesMix, a C++ library for MCMC posterior simulation for general Bayesian mixture models. The goal of BayesMix is to provide a self-contained ecosystem to perform inference for mixture models to computer scientists, statisticians and practitioners. The key idea of this library is extensibility, as we wish the users to easily adapt our software to their specific Bayesian mixture models. In addition to the several models and MCMC algorithms for posterior inference included in the library, new users with little familiarity on mixture models and the related MCMC algorithms can extend our library with minimal coding effort. Our library is computationally very efficient when compared to competitor software. Examples show that the typical code runtimes are from two to 25 times faster than competitors for data dimension from one to ten. We also provide Python (bayesmixpy) and R (bayesmixr) interfaces. Our library is publicly available on GitHub at https://github.com/bayesmix-dev/bayesmix/