Data-driven stochastic optimization in the presence of distributional uncertainty
Author(s)
Lin, Yifan
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
Stochastic optimization is a mathematical framework that models decision making under uncertainty. It usually assumes that the decision maker has full knowledge about the underlying uncertainty through a known probability distribution and minimizes (or maximizes) a functional of the cost (or reward) function. However, the probability distribution of the randomness in the system is rarely known in practice and is often estimated from historical data. The goal of the decision maker is therefore to select the optimal decision under this distributional uncertainty. This thesis aims to address the distributional uncertainty in the context of stochastic optimization by proposing new formulations and devising new approaches.
In Chapter 2, we consider stochastic optimization under distributional uncertainty, where the unknown distributional parameter is estimated from streaming data that arrive sequentially over time. Moreover, data may depend on the decision of the time when they are generated. For both decision-independent and decision-dependent uncertainties, we propose an approach to jointly estimate the distributional parameter via Bayesian posterior distribution and update the decision by applying stochastic gradient descent (SGD) on the Bayesian average of the objective function. Our approach converges asymptotically over time and achieves the convergence rates of classical SGD in the decision-independent case.
In Chapter 3, we deviate from the static stochastic optimization studied in the previous chapters and instead focus on a multistage setting. Specifically, we consider a class of sequential decision making problem called multi-armed bandit (MAB). MAB is an online decision-making problem with limited feedback. It is a class of reinforcement learning (RL) problems, where there is no state transition and the action is just a single choice from a fixed and finite set of choices. In certain situations, the decision maker may also be provided with contexts (also known as covariates or side information). We consider the contextual MAB with linear payoffs under a risk-averse criterion. At each round, contexts are revealed for each arm, and the decision maker chooses one arm to pull and receives the corresponding reward. In particular, we consider mean-variance as the risk criterion, and the best arm is the one with the largest mean-variance reward. We apply the Thompson Sampling algorithm and provide a comprehensive regret analysis for a variant of the proposed algorithm.
In Chapter 4, we consider the multistage stochastic optimization problem in the context of Markov decision processes (MDPs). MDP a paradigm for modeling sequential decision making under distributional uncertainty.
In the first half of the chapter, we consider finite-horizon Markov Decision Processes where parameters, such as transition probabilities, are unknown and estimated from data. The popular distributionally robust approach to addressing the distributional uncertainty can sometimes be overly conservative. We propose a new formulation, Bayesian risk Markov Decision Process (BR-MDP), to address distributional uncertainty in MDPs, where a risk functional is applied in nested form to the expected total cost with respect to the Bayesian posterior distributions of the unknown parameters. The proposed formulation provides more flexible risk attitudes towards distributional uncertainty and takes into account the availability of data in future time stages. To solve the proposed formulation with the conditional value-at-risk (CVaR) risk functional, we propose an efficient approximation algorithm by deriving an analytical approximation of the value function and utilizing the convexity of CVaR.
In the second half of the chapter, we consider infinite-horizon BR-MDP. To solve the infinite-horizon BR-MDP with a class of convex risk measures, we propose a computationally efficient approach of approximate bilevel difference convex programming. The optimization is performed offline and produces the optimal policy that is represented as a finite state controller with desirable performance guarantees. We also demonstrate the empirical performance of the infinite-horizon BR-MDP formulation and proposed algorithms.
In Chapter 5, we consider a more general RL setting, and focus on improving the sample efficiency of policy optimization algorithm. The success of RL largely depends on the amount of data it can utilize. The efficient utilization of historical trajectories obtained from previous policies is essential for expediting policy optimization. Empirical evidence has shown that policy gradient methods based on importance sampling work well. However, existing literature often neglect the interdependence between trajectories from different iterations, and the good empirical performance lacks a rigorous theoretical justification. In this paper, we study a variant of the natural policy gradient method with reusing historical trajectories via importance sampling. We show that the bias of the proposed estimator of the gradient is asymptotically negligible, the resultant algorithm is convergent, and reusing past trajectories helps improve the convergence rate. We further apply the proposed estimator to popular policy optimization algorithms such as trust region policy optimization. Our theoretical results are verified on classical benchmarks.
Sponsor
Date
2024-04-09
Extent
Resource Type
Text
Resource Subtype
Dissertation