1,720,984 research outputs found
Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State Entropy Estimate
In a reward-free environment, what is a suitable intrinsic objective for an agent to pursue so that it can learn an optimal task-agnostic exploration policy? In this paper, we argue that the entropy of the state distribution induced by finite-horizon trajectories is a sensible target. Especially, we present a novel and practical policy-search algorithm, Maximum Entropy POLicy optimization (MEPOL), to learn a policy that maximizes a non-parametric, k-nearest neighbors estimate of the state distribution entropy. In contrast to known methods, MEPOL is completely model-free as it requires neither to estimate the state distribution of any policy nor to model transition dynamics. Then, we empirically show that MEPOL allows learning a maximum-entropy exploration policy in high-dimensional, continuous-control domains, and how this policy facilitates learning meaningful reward-based tasks downstream
A Tale of Sampling and Estimation in Discounted Reinforcement Learning
The most relevant problems in discounted reinforcement learning involve estimating the mean of a function under the stationary distribution of a Markov reward process, such as the expected return in policy evaluation, or the policy gradient in policy optimization. In practice, these estimates are produced through a finite-horizon episodic sampling, which neglects the mixing properties of the Markov process. It is mostly unclear how this mismatch between the practical and the ideal setting affects the estimation, and the literature lacks a formal study on the pitfalls of episodic sampling, and how to do it optimally. In this paper, we present a minimax lower bound on the discounted mean estimation problem that explicitly connects the estimation error with the mixing properties of the Markov process and the discount factor. Then, we provide a statistical analysis on a set of notable estimators and the corresponding sampling procedures, which includes the finite-horizon estimators often used in practice. Crucially, we show that estimating the mean by directly sampling from the discounted kernel of the Markov process brings compelling statistical properties w.r.t. the alternative estimators, as it matches the lower bound without requiring a careful tuning of the episode horizon
Challenging Common Assumptions in Convex Reinforcement Learning
The classic Reinforcement Learning (RL) formulation concerns the maximization
of a scalar reward function. More recently, convex RL has been introduced to
extend the RL formulation to all the objectives that are convex functions of
the state distribution induced by a policy. Notably, convex RL covers several
relevant applications that do not fall into the scalar formulation, including
imitation learning, risk-averse RL, and pure exploration. In classic RL, it is
common to optimize an infinite trials objective, which accounts for the state
distribution instead of the empirical state visitation frequencies, even though
the actual number of trajectories is always finite in practice. This is
theoretically sound since the infinite trials and finite trials objectives can
be proved to coincide and thus lead to the same optimal policy. In this paper,
we show that this hidden assumption does not hold in the convex RL setting. In
particular, we show that erroneously optimizing the infinite trials objective
in place of the actual finite trials one, as it is usually done, can lead to a
significant approximation error. Since the finite trials setting is the default
in both simulated and real-world RL, we believe shedding light on this issue
will lead to better approaches and methodologies for convex RL, impacting
relevant research areas such as imitation learning, risk-averse RL, and pure
exploration among others.Comment: NeurIPS 202
Configurable Markov Decision Processes
In many real-world problems, there is the possibility to configure, to a limited extent, some environmental parameters to improve the performance of a learning agent. In this paper, we propose a novel framework, Configurable Markov Decision Processes (Conf-MDPs), to model this new type of interaction with the environment. Furthermore, we provide a new learning algorithm, Safe Policy-Model Iteration (SPMI), to jointly and adaptively optimize the policy and the environment configuration. After having introduced our approach and derived some theoretical results, we present the experimental evaluation in two explicative problems to show the benefits of the environment configurability on the performance of the learned policy
The Importance of Non-Markovianity in Maximum State Entropy Exploration
In the maximum state entropy exploration framework, an agent interacts with a
reward-free environment to learn a policy that maximizes the entropy of the
expected state visitations it is inducing. Hazan et al. (2019) noted that the
class of Markovian stochastic policies is sufficient for the maximum state
entropy objective, and exploiting non-Markovianity is generally considered
pointless in this setting. In this paper, we argue that non-Markovianity is
instead paramount for maximum state entropy exploration in a finite-sample
regime. Especially, we recast the objective to target the expected entropy of
the induced state visitations in a single trial. Then, we show that the class
of non-Markovian deterministic policies is sufficient for the introduced
objective, while Markovian policies suffer non-zero regret in general. However,
we prove that the problem of finding an optimal non-Markovian policy is
NP-hard. Despite this negative result, we discuss avenues to address the
problem in a tractable way and how non-Markovian exploration could benefit the
sample efficiency of online reinforcement learning in future works.Comment: ICML 202
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
Variations on the Author
“Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship
Appropriate Similarity Measures for Author Cocitation Analysis
We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis
Dispelling the Myths Behind First-author Citation Counts
We conducted a full-scale evaluative citation analysis study of scholars in the XML research field to explore just how different from each other author rankings resulting from different citation counting methods actually are, and to demonstrate the capability of emerging data and tools on the Web in supporting more realistic citation counting methods. Our results contest some common arguments for the continued
use of first-author citation counts in the evaluation of scholars, such as high correlations between author rankings by first-author citation counts and other citation
counting methods, and high costs of using more realistic citation counting methods that are not well-supported by the ISI databases. It is argued that increasingly available digital full text research papers make it possible for citation analysis studies to go beyond what the ISI databases have directly supported and to employ more
sophisticated methods
- …
