Search CORE

1,721,087 research outputs found

Tight Performance Guarantees of Imitator Policies with Continuous Actions

Author: Restelli Marcello
Metelli Alberto Maria
Maran Davide
Publication venue
Publication date: 01/01/2023
Field of study

Behavioral Cloning (BC) aims at learning a policy that mimics the behavior demonstrated by an expert. The current theoretical understanding of BC is limited to the case of finite actions. In this paper, we study BC with the goal of providing theoretical guarantees on the performance of the imitator policy in the case of continuous actions. We start by deriving a novel bound on the performance gap based on Wasserstein distance, applicable for continuous-action experts, holding under the assumption that the value function is Lipschitz continuous. Since this latter condition is hardy fulfilled in practice, even for Lipschitz Markov Decision Processes and policies, we propose a relaxed setting, proving that value function is always H\"older continuous. This result is of independent interest and allows obtaining in BC a general bound for the performance of the imitator policy. Finally, we analyze noise injection, a common practice in which the expert’s action is executed in the environment after the application of a noise kernel. We show that this practice allows deriving stronger performance guarantees, at the price of a bias due to the noise addition

Archivio istituzionale della ricerca - Politecnico di Milano

Importance Sampling Techniques for Policy Optimization

Author: Montali Nico
Restelli Marcello
Metelli Alberto Maria
Papini Matteo
Publication venue
Publication date: 01/01/2020
Field of study

How can we effectively exploit the collected samples when solving a continuous control task with Reinforcement Learning? Recent results have empirically demonstrated that multiple policy optimization steps can be performed with the same batch by using off-distribution techniques based on importance sampling. However, when dealing with off-distribution optimization, it is essential to take into account the uncertainty introduced by the importance sampling process. In this paper, we propose and analyze a class of model-free, policy search algorithms that extend the recent Policy Optimization via Importance Sampling (Metelli et al., 2018) by incorporating two advanced variance reduction techniques: per-decision and multiple importance sampling. For both of them, we derive a high-probability bound, of independent interest, and then we show how to employ it to define a suitable surrogate objective function that can be used for both action-based and parameter-based settings. The resulting algorithms are finally evaluated on a set of continuous control tasks, using both linear and deep policies, and compared with modern policy optimization methods

Archivio istituzionale della ricerca - Politecnico di Milano

On the Relation between Policy Improvement and Off-Policy Minimum-Variance Policy Evaluation

Author: Restelli Marcello
Metelli Alberto Maria
Meta Samuele
Publication venue
Publication date: 01/01/2023
Field of study

Off-policy methods are the basis of a large number of effective Policy Optimization (PO) algorithms. In this setting, Importance Sampling (IS) is typically employed for off-policy evaluation, with the goal of estimating the performance of a target policy, given samples collected with a different behavioral policy. However, in Monte Carlo simulation, IS represents a variance minimization approach. In this field, a suitable behavioral distribution is employed for sampling, allowing diminishing the variance of the estimator below the one achievable when sampling from the target distribution. In this paper, we analyze IS in these two guises in the context of PO. We provide a novel view of off-policy PO, showing a connection between the policy improvement and variance minimization objectives. Then, we illustrate how minimizing the off-policy variance can, in some circumstances, lead to a policy improvement, with the advantage, compared with direct off-policy learning, of implicitly enforcing a trust region. Finally, we present numerical simulations on continuous RL benchmarks, with a particular focus on the robustness to small batch sizes

Archivio istituzionale della ricerca - Politecnico di Milano

Propagating Uncertainty in Reinforcement Learning via Wasserstein Barycenters

Author: Metelli Alberto Maria
LIKMETA AMARILDO
Marcello Restelli
Publication venue
Publication date: 01/01/2019
Field of study

Archivio istituzionale della ricerca - Politecnico di Milano

IWDA: Importance Weighting for Drift Adaptation in Streaming Supervised Learning Problems

Author: Fedeli Filippo
Restelli Marcello
Metelli Alberto Maria
Trovo' Francesco
Publication venue
Publication date: 01/01/2023
Field of study

Distribution drift is an important issue for practical applications of machine learning (ML). In particular, in streaming ML, the data distribution may change over time, yielding the problem of concept drift, which affects the performance of learners trained with outdated data. In this article, we focus on supervised problems in an online nonstationary setting, introducing a novel learner-agnostic algorithm for drift adaptation, namely (IWDA), with the goal of performing efficient retraining of the learner when drift is detected. IWDA incrementally estimates the joint probability density of input and target for the incoming data and, as soon as drift is detected, retrains the learner using importance-weighted empirical risk minimization. The importance weights are computed for all the samples observed so far, employing the estimated densities, thus, using all available information efficiently. After presenting our approach, we provide a theoretical analysis in the abrupt drift setting. Finally, we present numerical simulations that illustrate how IWDA competes and often outperforms state-of-the-art stream learning techniques, including adaptive ensemble methods, on both synthetic and real-world data benchmarks

Archivio istituzionale della ricerca - Politecnico di Milano

Lifelong Hyper-Policy Optimization with Multiple Importance Sampling Regularization

Author: Liotet Pierre
Metelli Alberto Maria
Vidaich Francesco
Restelli Marcello
Publication venue
Publication date: 01/01/2022
Field of study

Learning in a lifelong setting, where the dynamics continually evolve, is a hard challenge for current reinforcement learning algorithms. Yet this would be a much needed feature for practical applications. In this paper, we propose an approach which learns a hyper-policy, whose input is time, that outputs the parameters of the policy to be queried at that time. This hyper-policy is trained to maximize the estimated future performance, efficiently reusing past data by means of importance sampling, at the cost of introducing a controlled bias. We combine the future performance estimate with the past performance to mitigate catastrophic forgetting. To avoid overfitting the collected data, we derive a differentiable variance bound that we embed as a penalization term. Finally, we empirically validate our approach, in comparison with state-of-the-art algorithms, on realistic environments, including water resource management and trading

Archivio istituzionale della ricerca - Politecnico di Milano

Association for the Advancement of Artificial Intelligence: AAAI Publications

Wasserstein Actor-Critic: Directed Exploration via Optimism for Continuous-Actions Control

Author: Metelli Alberto Maria
Restelli Marcello
Likmeta Amarildo
Sacco Matteo
Publication venue
Publication date: 01/01/2023
Field of study

Uncertainty quantification has been extensively used as a means to achieve efficient directed exploration in Reinforcement Learning (RL). However, state-of-the-art methods for continuous actions still suffer from high sample complexity requirements. Indeed, they either completely lack strategies for propagating the epistemic uncertainty throughout the updates, or they mix it with aleatoric uncertainty while learning the full return distribution (e.g., distributional RL). In this paper, we propose Wasserstein Actor-Critic (WAC), an actor-critic architecture inspired by the recent Wasserstein Q-Learning (WQL), that employs approximate Q-posteriors to represent the epistemic uncertainty and Wasserstein barycenters for uncertainty propagation across the state-action space. WAC enforces exploration in a principled way by guiding the policy learning process with the optimization of an upper bound of the Q-value estimates. Furthermore, we study some peculiar issues that arise when using function approximation, coupled with the uncertainty estimation, and propose a regularized loss for the uncertainty estimation. Finally, we evaluate our algorithm on standard MujoCo tasks as well as suite of continuous-actions domains, where exploration is crucial, in comparison with state-of-the-art baselines. Additional details and results can be found in the supplementary material with our Arxiv preprint

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Association for the Advancement of Artificial Intelligence: AAAI Publications

Information-Theoretic Regret Bounds for Bandits with Fixed Expert Advice

Author: Metelli Alberto Maria
Cesa-Bianchi Nicolò
Eldowa Khaled
Restelli Marcello
Publication venue
Publication date: 01/01/2023
Field of study

We investigate the problem of bandits with expert advice when the experts are fixed and known distributions over the actions. Improving on previous analyses, we show that the regret in this setting is controlled by information-theoretic quantities that measure the similarity between experts. In some natural special cases, this allows us to obtain the first regret bound for EXP4 that can get arbitrarily close to zero if the experts are similar enough. While for a different algorithm, we provide another bound that describes the similarity between the experts in terms of the KL-divergence, and we show that this bound can be smaller than the one of EXP4 in some cases. Additionally, we provide lower bounds for certain classes of experts showing that the algorithms we analyzed are nearly optimal in some cases

Archivio istituzionale della ricerca - Politecnico di Milano

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

A Tale of Sampling and Estimation in Discounted Reinforcement Learning

Author: Restelli Marcello
Metelli Alberto Maria
Mutti Mirco
Publication venue
Publication date: 01/01/2023
Field of study

The most relevant problems in discounted reinforcement learning involve estimating the mean of a function under the stationary distribution of a Markov reward process, such as the expected return in policy evaluation, or the policy gradient in policy optimization. In practice, these estimates are produced through a finite-horizon episodic sampling, which neglects the mixing properties of the Markov process. It is mostly unclear how this mismatch between the practical and the ideal setting affects the estimation, and the literature lacks a formal study on the pitfalls of episodic sampling, and how to do it optimally. In this paper, we present a minimax lower bound on the discounted mean estimation problem that explicitly connects the estimation error with the mixing properties of the Markov process and the discount factor. Then, we provide a statistical analysis on a set of notable estimators and the corresponding sampling procedures, which includes the finite-horizon estimators often used in practice. Crucially, we show that estimating the mean by directly sampling from the discounted kernel of the Markov process brings compelling statistical properties w.r.t. the alternative estimators, as it matches the lower bound without requiring a careful tuning of the episode horizon

Archivio istituzionale della ricerca - Politecnico di Milano

Wasserstein Actor-Critic: Directed Exploration via Optimism for Continuous-Actions Control

Author: Restelli Marcello
Metelli Alberto Maria
Sacco Matteo
Likmeta Amarildo
Publication venue
Publication date: 01/01/2023
Field of study

Archivio istituzionale della ricerca - Politecnico di Milano