MA661 Dynamic Programming and Reinforcement Learning

Course Catalog Description


The main purpose of this course is to present an introduction to dynamic programming as the most popular methodology for learning and control of dynamic stochastic systems. We discuss basic models, some theoretical results and numerical methods for these problems. They will be developed starting from basic models of dynamical systems, through finite-horizon stochastic problems, to infinite-horizon stochastic models of fully or partially observable systems. Throughout the class special attention will be paid on the application of dynamic programming to statistical learning. The class will include introduction to approximate dynamic programming techniques, which are used in statistical learning, such as tree-based methods for classification, Bayesian learning, etc. The concepts and methods will be illustrated by various applications including learning in stochastic networks, engineering, business, and finance.


Professor Email Office
Darinka Dentcheva

More Information

Course Outcomes

1. Knowledge of the models listed in the syllabus and the modelling ability to identify suitable problem formulation for typical applications.
2. Know and understanding of the notions of value function, decision rules, policy, and dynamic programming equations; knowledge of the different types of decision rules and policies.
3. Ability to formulate and use the principle of optimality, to analyze the optimality equations to gain insight of the optimal policy.
4. Familiarity with the methods of backward induction, value and policy iteration, as well as linear programming techniques for solving the dynamic models discussed in class.
5. Ability to identify the optimal policy for a given problem (as a result of a numerical or theoretical analysis), its value, and interpretation in the context of problem.
6. Know and understand the concept of partial information and the mathematical models addressing partially observable dynamical systems.
7. Ability to properly formulate approximate dynamic programming for learning; feature space equations and model framework.
8. Ability to properly choose a numerical approach to a given model, apply the method correctly, and be aware of its convergence.

Basic knowledge of analysis, linear algebra, and probability is necessary. Large portion of the class will deal with controlled Markov chains, and knowledge of some fundamental notions and results of the theory of Markov chains is useful.
The students who need a review of the theory of Markov chains may consult the following textbooks:
1. Çinlar: “Introduction to Stochastic Processes”, Prentice Hall, 1975 (chapters 5 and 6)
2. R. Grimmett and D.R. Stirzaker, “Probability and Random Processes”, Oxford 1992 (chapter 6)
In fact, almost any book on stochastic processes will provide the necessary basic information.

Textbook : Lecture notes distributed in class.

Supplementary material :
M.L. Puterman, Markov Decision Processes, John Wiley & Sons, 2005
D. Bertsekas, Dynamic Programming and Optimal Control, Vol. I, Athena Scientific; 3rd edition, 2005 and Vol. II (Approximate Dynamic Programming), Athena Scientific; 4th edition, 2012
W. B. Powell, Approximate Dynamic Programming, John Wiley & Sons, 2007
None of the books is required. The lectures will draw from many sources.


Grading Policies

Seven-eight assignments constituting 60% of the final score.
Class work 10%
Final Exam 30%

Lecture Outline

Date Topic Reading
Week 1 Sequential decision problems. The principle of optimality and the shortest path.
Week 2 Stochastic discrete-time control problems. Controlled Markov chains
Week 3 Dynamic programming equations. Monotonicity properties.
Week 4 Discounted infinite horizon problems.
Week 5 Value and policy iteration methods. Linear programming approach.
Week 6 Undiscounted infinite horizon problems.
Week 7 Optimal stopping.
Week 8 Average Cost Problem.
Week 9 Multiarmed bandit problems.
Week 10 Partially observable Markov decision problems.
Week 11 Examples
Week 12 Introduction to approximate dynamic programming. Feature space Equations
Week 13 Temporal difference methods. Q-learning.
Week 14 Policy gradient method.