1,720,952 research outputs found
Constraint Propagation and Reverse Multi-Agent Learning
The development of multi-agent reinforcement learning has been largely driven by the question of how to design learning algorithms to reach some particular notion of optimality of strategies, e.g. Nash equilibria. The set of optimal strategies is not known before the execution of the learning algorithm,however we can often immediately identify a set of clearly undesirable outcomes. Therefore, we propose to consider a dual problem: given a collection of agent algorithms and a collection of unwanted strategy profiles, can one identify a setof starting strategies that invariably lead there? This leads us to study the algorithmic problem of backpropagation of con-straints defining the forbidden region by learning dynamics,through the lens of set-valued maps and interval arithmetics.Accepted author manuscriptInteractive Intelligenc
Poincaré-Bendixson Limit Sets in Multi-Agent Learning
A key challenge of evolutionary game theory and multi-agent learning is to characterize the limit behavior of game dynamics. Whereas convergence is often a property of learning algorithms in games satisfying a particular reward structure (e.g., zero-sum games), even basic learning models, such as the replicator dynamics, are not guaranteed to converge for general payoffs. Worse yet, chaotic behavior is possible even in rather simple games, such as variants of the Rock-Paper-Scissors game. Although chaotic behavior in learning dynamics can be precluded by the celebrated Poincaré-Bendixson theorem, it is only applicable to low-dimensional settings. Are there other characteristics of a game that can force regularity in the limit sets of learning? We show that behavior consistent with the Poincaré-Bendixson theorem (limit cycles, but no chaotic attractor) can follow purely from the topological structure of the interaction graph, even for high-dimensional settings with an arbitrary number of players and arbitrary payoff matrices. We prove our result for a wide class of follow-the-regularized leader (FoReL) dynamics, which generalize replicator dynamics, for binary games characterized interaction graphs where the payoffs of each player are only affected by one other player (i.e., interaction graphs of indegree one). Since chaos occurs already in games with only two players and three strategies, this class of non-chaotic games may be considered maximal. Moreover, we provide simple conditions under which such behavior translates into efficiency guarantees, implying that FoReL learning achieves time-averaged sum of payoffs at least as good as that of a Nash equilibrium, thereby connecting the topology of the dynamics to social-welfare analysis.Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.Interactive Intelligenc
Non-chaotic limit sets in multi-agent learning
Non-convergence is an inherent aspect of adaptive multi-agent systems, and even basic learning models, such as the replicator dynamics, are not guaranteed to equilibriate. Limit cycles, and even more complicated chaotic sets are in fact possible even in rather simple games, including variants of the Rock-Paper-Scissors game. A key challenge of multi-agent learning theory lies in characterization of these limit sets, based on qualitative features of the underlying game. Although chaotic behavior in learning dynamics can be precluded by the celebrated Poincaré–Bendixson theorem, it is only applicable directly to low-dimensional settings. In this work, we attempt to find other characteristics of a game that can force regularity in the limit sets of learning. We show that behavior consistent with the Poincaré–Bendixson theorem (limit cycles, but no chaotic attractor) follows purely from the topological structure of interactions, even for high-dimensional settings with an arbitrary number of players, and arbitrary payoff matrices. We prove our result for a wide class of follow-the-regularized leader (FoReL) dynamics, which generalize replicator dynamics, for binary games characterized interaction graphs where the payoffs of each player are only affected by one other player (i.e., interaction graphs of indegree one). Moreover, for cyclic games we provide further insight into the planar structure of limit sets, and in particular limit cycles. We propose simple conditions under which learning comes with efficiency guarantees, implying that FoReL learning achieves time-averaged sum of payoffs at least as good as that of a Nash equilibrium, thereby connecting the topology of the dynamics to social-welfare analysis.Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.Interactive Intelligenc
Influence-Based Abstraction in Deep Reinforcement Learning
thousands, or even millions of state variables. Unfortunately, applying reinforcement learning algorithms to handle complex tasks becomes more and more challenging as the number of state variables increases. In this paper, we build on the concept of influence-based abstraction which tries to tackle such scalability issues by decomposing large systems into small regions. We explore this method in the context of deep reinforcement learning, showing that by keeping track of a small set of variables in the history of previous actions and observations we can learn policies that can effectively control a local region in the global system.Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.Interactive Intelligenc
Alternating Maximization with Behavioral Cloning
The key difficulty of cooperative, decentralized planning lies in making accurate predictions about the behavior of one’s teammates. In this paper we introduce a planning method of Alternating maximization with Behavioural Cloning (ABC) – a trainable online decentralized planning algorithm based on Monte Carlo Tree Search (MCTS), combined with models of teammates learned from previous episodic runs. Our algorithm relies on the idea of alternating maximization, where agents adapt their models one at a time in round-robin manner. Under the assumption of perfect policy cloning, and with a sufficient amount of Monte Carlo samples, successive iterations of our method are guaranteed to improve joint policies, and eventually converge.Interactive Intelligenc
Decentralized MCTS via Learned Teammate Models
Decentralized online planning can be an attractive paradigm for cooperative multi-agent systems, due to improved scalability and robustness. A key difficulty of such approach lies in making accurate predictions about the decisions of other agents. In this paper, we present a trainable online decentralized planning algorithm based on decentralized Monte Carlo Tree Search, combined with models of teammates learned from previous episodic runs. By only allowing one agent to adapt its models at a time, under the assumption of ideal policy approximation, successive iterations of our method are guaranteed to improve joint policies, and eventually lead to convergence to a Nash equilibrium. We test the efficiency of the algorithm by performing experiments in several scenarios of the spatial task allocation environment introduced in [Claes et al., 2015]. We show that deep learning and convolutional neural networks can be employed to produce accurate policy approximators which exploit the spatial features of the problem, and that the proposed algorithm improves over the baseline planning performance for particularly challenging domain configurations.Virtual/online event due to COVID-19 ? moved to January 2021 Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.Interactive Intelligenc
Safe Multi-agent Learning via Trapping Regions
One of the main challenges of multi-agent learning lies in establishing convergence of the algorithms, as, in general, a collection of individual, self-serving agents is not guaranteed to converge with their joint policy, when learning concurrently. This is in stark contrast to most single-agent environments, and sets a prohibitive barrier for deployment in practical applications, as it induces uncertainty in long term behavior of the system. In this work, we apply the concept of trapping regions, known from qualitative theory of dynamical systems, to create safety sets in the joint strategy space for decentralized learning. We propose a binary partitioning algorithm for verification that candidate sets form trapping regions in systems with known learning dynamics, and a heuristic sampling algorithm for scenarios where learning dynamics are not known. We demonstrate the applications to a regularized version of Dirac Generative Adversarial Network, a four-intersection traffic control scenario run in a state of the art open-source microscopic traffic simulator SUMO, and a mathematical model of economic competition.Interactive Intelligenc
Using Game Theory to Analyse Local and Global Performance of Traffic Signal Control Strategies
One of the most important bottlenecks that contributes to the congestion of traffic is nonoptimal traffic signal control. Techniques that have been investigated to optimise traffic signal control have been focused on improving the traffic flow through individual intersections. However, if intersections are optimised based on only reducing local congestion, this could result in introducing congestion in other places. Therefore, this paper introduces a normal form game that can be used to analyse the impact on local and global performance when optimising traffic signal control. Results obtained from a realistic traffic simulation suggest that it is sometimes possible for each player to choose a traffic signal control strategy that optimises its own welfare but also maximises the social welfare. The results also indicate that sub-optimal traffic signal control strategy profiles can become optimal when exposed to certain traffic intensities.CSE3000 Research ProjectComputer Science and Engineerin
Exploring the Effects of Conditioning Independent Q-Learners on the Sufficient Statistic for Dec-POMDPs
In this study, we investigate the effects of conditioning Independent Q-Learners (IQL) not solely on the individual action-observation history, but additionally on the sufficient plan-time statistic for Decentralized Partially Observable Markov Decision Processes. In doing so, we attempt to address a key shortcoming of IQL, namely that it is likely to converge to a Nash Equilibrium that can be arbitrarily poor. We identify a novel exploration strategy for IQL when it conditions on the sufficient statistic, and furthermore show that sub-optimal equilibria can be escaped consistently by sequencing the decision-making during learning. The practical limitation is the exponential complexity of both the sufficient statistic and the decision rules.Interactive Intelligenc
Evaluating Design Choices in Tripartite Graph-Based Recommender Systems to Improve Long Tail Recommendations
Even though the abaility to recommend items in the long tail is one of the main strengths of recommendation systems, modern models still show decreased performance when recommending these niche items. Various bipartite and tripartite graph-based models have been proposed that are specifically tailored to solving this long tail issue. This study aims to investigate the effect of the design of the additional layer introduced by tripartite graph-based recommender systems on their performance. All options available in the MovieLens 1M dataset are evaluated on recall and diversity. Experimental results suggest that tripartite graphs based on latent information describing the users perform better than ones utilising item-based latent information, but both these options hardly outperform the baseline bipartite model. Regardless of the graph used, normalising the transition matrix is found to significantly increase performance. It is hypothesised that larger user-focused additional layers show increased diversity over smaller options when normalised. Issues regarding the reproducibility of previous research are identified and addressed, and the development of unified evaluation metrics is advocated to prevent such problems in the future.CSE3000 Research ProjectComputer Science and Engineerin
- …
