Designing policy optimization algorithms for multi-agent reinforcement learning

Thumbnail Image
Zeng, Sihan
Romberg, Justin
Associated Organizations
Supplementary to
Multi-agent reinforcement learning (RL) studies the sequential decision-making problem in the setting where multiple agents exist in an environment and jointly determine the environment transition. The relationship between the agents can be cooperative, competitive, or mixed depending on how the rewards of the agents are aligned. Compared to single-agent RL, multi-agent RL has unique and complicated structure that has not been fully recognized. The overall objective of the thesis is to enhance the understanding of structure in multi-agent RL in various settings and to build reliable and efficient algorithms that exploit and/or respect the structure. First, we observe that many data-driven algorithms in RL such as the gradient temporal difference learning and actor-critic algorithms essentially solve a bi-level optimization problem by tracking an artificial auxiliary variable in addition to the decision variable and updating them at different rates. We propose a two-time-scale stochastic gradient descent method under a special type of gradient oracle which abstracts these algorithms and their analysis in a unified framework. We characterize the convergence rates of the two-time-scale gradient algorithm under several structural properties of the objective functions common in RL problems. Targeting single-agent RL problems, this framework builds the mathematical foundation for designing and studying data-driven multi-agent RL algorithms that we will later deal with. Second, we consider multi-agent RL in the fully cooperative setting where a connected, decentralized network of agents collaborates to solve multiple RL tasks. Our first problem formulation deploys one agent to each task and considers learning a single policy that maximizes the average cumulative return over all tasks. We characterize the key structural differences between multi-task RL and its single-task counterpart, which make multi-task RL a fundamentally more challenging problem. We then extend our formulation by considering maximizing the average return subject to constraints on the return of each task, which forms a more flexible framework and is potentially more practical for modeling multi-task RL applications in real life. We propose and study decentralized (constrained) policy gradient algorithms for optimizing the objectives in these two formulations and validate our analysis with enlightening numerical simulations. While the previous chapter studies cooperative agents, we now shift our focus to the case where the agents compete with each other. We study the two-player zero-sum Markov game, which is a special case of competitive multi-agent RL naturally formulated as a nonconvex- nonconcave minimax optimization program, and consider solving it with the simple gradient descent ascent (GDA) algorithm. The non-convexity/non-concavity of the underlying objective function poses significant challenges to the analysis of the GDA algorithm. We introduce strong structure to the Markov game with an entropy regularization. We apply GDA to the regularized objective and propose schemes of adjusting the regularization weight to make the GDA algorithm efficiently converge to the global Nash equilibrium. The works we have discussed so far treat RL from the perspective of optimization. In the final chapter, we apply RL to solve optimization problems themselves. Specifically, we develop a multi-agent RL based penalty parameter selection method for the alternating cur- rent optimal power flow (ACOPF) problem solved via ADMM, with the goal of minimizing the number of iterations until convergence. Our method leads to significantly accelerated ADMM convergence compared to the state-of-the-art hand-designed parameter selection schemes and exhibits superior generalizability.
Date Issued
Resource Type
Resource Subtype
Rights Statement
Rights URI