General theory of MDPs:
Basic concepts of Markov Decision Processes (MDPs), main types such, as stochastic shortest path, total discounted and average (ergodic) cost problems. Control policies, value functions, Bellman operators and their core properties, monotonicity, contraction, optimality equations, greedy policies and approximation possibilities. Existence and properties of optimal control policies.
Standard RL methods for MDPs:
Main solution directions of MDPs: iterative approximations of value functions (e.g., value iteration, Q-learning, SARSA); direct search in the space of policies (e.g., policy iteration, policy gradient); linear programming formulation, complexity. Generalizations: unbounded costs, partial observability.
Temporal difference (TD) learning based policy evaluations: Monte Carlo methods, eligibility traces, TD(0), TD(1) and TD(lambda): online, offline, first-visit, every-visit variants, convergence theorems, optimistic policy improvements. Examples, like Q-learning for SSPs. Off-policy policy evaluations.
Small-sample theory for RL:
Multi-armed bandit problems, exploration-vs-exploitation, regret, Hannan consistency, canonical bandit model, subgaussian bandits, explore-then-commit algorithm, upper confidence bound (UCB) algorithm, the optimism principle, regret bounds for UCB, asymptotic optimality of UCB, contextual bandits, stochastic linear bandits, confidence regions for bandit problems, online confidence sets.
Large-sample theory for RL:
General adaptive algorithms and stochastic approximation (SA), fixed point and root finding problems. Examples of classical SA algorithms with analogous methods in RL: Robbins-Monro (e.g., Q-learning), Kiefer-Wolfowitz (e.g., policy gradient), SPSA (simultaneous perturbation stochastic approximation), stochastic gradient descent (SGD) and its acceleration methods (e.g., momentum).
Asymptotic analysis of general adaptive algorithms assuming martingale difference noise sequences. Convergence results based on Lyapunov (potential) functions. Applications to classical examples: SGD with Lipschitz continuous gradient and Euclidean norm pseudo-contractions. Convergence analysis based on contraction and monotonicity properties, and their illustration through RL examples.
Recommended literature:
-
Bertsekas D. P. & Tsitsiklis J. N.: Neuro-Dynamic Programming, Athena Scientific, 1996.
-
Lattimore, T. & Szepesvári, Cs.: Bandit Algorithms, Cambridge University Press, 2018.
-
Feinberg, A. E. & Shwartz, A. (eds.): Handbook of Markov Decision Processes, Kluwer Academic Publishers, 2002.
-
Kushner, H. & Yin, G.: Stochastic Approximation and Recursive Algorithms and Applications, 2nd edition, Springer, 2003.
-
Sutton, R. S. & Barto, A. G: Reinforcement Learning: An Introduction, 2nd edition, MIT Press, 2018.