Theories and Algorithms for Efficient and Robust Sequential Decision Making

Loading...
Thumbnail Image
Author(s)
Li, Yan
Editor(s)
Associated Organization(s)
Supplementary to:
Abstract
Reinforcement learning refers to the problem of finding the optimal policy of a Markov decision process (MDP), when the underlying transition kernel and cost function are only accessible through samples. This emerging problem class has enabled transforming technologies in healthcare, operations research, and modern machine learning. Before harnessing its potential in broader contexts, we are confronted with a rather unpleasant reality in how this problems is solved. Within current practice: (1) overwhelming computational resources has become a prerequisite; (2) empirical successes are obtained without guarantees, and breakdowns offer no interpretability; (3) learned policies often exhibit brittle robustness, with significant performance degradation under slight environment shift. These obstacles have collectively prevented the widespread deployment of reinforcement learning, particularly in resource-scarce and high-stakes domains. This thesis aims to develop efficient and scalable first-order methods for solving reinforcement learning. The suite of algorithms developed here, termed policy gradient methods, directly use the gradient information of the non-convex objective with respect to the policy, yet are able to offer global convergence guarantees and oftentimes optimal computational and statistical complexities. Unlike traditional methods such as Q-learning, the developed methods here also admit natural variants to handle large-scale problems, and are capable of learning robust policies. In Chapter 2, we design a novel policy gradient method for solving reinforcement learning problems with large state spaces. At each iteration, the computation involves only performing the policy update for a randomly sampled state, and hence is independent of the state space. With a uniform sampling distribution, the total computation of the resulting method is comparable to that of the conventional policy gradient methods with a batch update rule. We further show that with an instance-dependent sampling scheme, the resulting method is capable of achieving substantial acceleration over existing alternatives. In Chapter 3, we develop the first policy gradient method with provable convergence in both value and policy spaces, thereby addressing an important open problem revolving the computational behavior of policy gradient methods. The developed method adopts a mirror descent type policy update with a diminishing decomposable convex regularizer. In particular, we reveal the global linear and local superlinear convergence of the optimality gap. This global-to-local phase transition is subsequently exploited by the diminishing regularization to induce the convergence in the policy space. Notably, we show the limiting policy is precisely the optimal policy with the maximal entropy. In Chapter 4, we proceed to address an important statistical challenge of reinforcement learning, namely the exploration of the action space. This refers to the notorious yet inevitable bias of stochastic gradient around the optimal policy. Existing approaches, such as $\epsilon$-greedy strategy, offer unsatisfactory patches that yield non-optimal sample complexities. We instead develop a novel construction of the stochastic policy gradient, whose bias can be effectively corrected by the policy update. Subsequently we establish an optimal sample complexity of the resulting method, even though there is no explicit exploration over the actions. In Chapter 5, we turn our attention to the problem of learning robust policies. To this end, we investigate the formulation of robust MDPs. Such problems consider finding the optimal robust policy that minimizes the worst-case value function when the underlying transition kernel falls inside an ambiguity set. We introduce a rather unifying dynamic game formulation that subsumes all existing case-by-case studies of robust MDPs. Notably this reveals the dynamic nature of robust MDPs that has been mostly discussed in a static form, and offers a new way of constructing ambiguity sets that have tractable static formulation. Such a formulation can also be naturally used to handle ambiguity in the cost function, showing that cost-robust MDPs can be simply reduced to standard MDPs with a convex policy-dependent cost function. We also establish the strong duality of the game and the static formulations, and discuss issues associated with history-dependent~policies. In Chapter 6, we consider optimizing robust MDPs and subsequently learning robust policies (i.e., robust reinforcement learning). In particular, we design a policy gradient method that performs mirror descent update to improve the policy at each iteration, with its first-order information constructed by using method to be discussed in Chapter 7. We establish linear convergence when the ambiguity is known and optimal sample complexities when the ambiguity is unknown. Notably, the method introduced here is the first and only scalable method that is applicable to solving large-scale robust MDPs in the literature. In Chapter 7, we consider evaluating the worst-case performance of a policy, which can be viewed as a non-concave maximization problem over the kernel ambiguity set. This problem serves as an essential role of constructing the first-order information when designing policy gradient methods for robust MDPs. We exploit the dynamic nature of this robust evaluation problem, and formulate an MDP of nature that has its optimal value function precisely as the worst-case performance of the policy. We then design an efficient policy gradient method for solving the MDPs of nature, with optimal computational and statistical complexities. Importantly, the method introduced here is the first and only method capable of incorporating function approximation, and thereby addresses an open problem standing for over a decade.
Sponsor
Date
2024-07-23
Extent
Resource Type
Text
Resource Subtype
Dissertation
Rights Statement
Rights URI