Reinforcement Learning (RL) builds on the idea that an agent learns an optimal behavior through iterative interaction with an environment. In model-free reinforcement learning the agent does not have access to the world knowledge (i.e., to the reward process and environment dynamics) and its task is to deduce an optimal behavior, i.e., a policy, solely through receiving observations and rewards as a consequence of actions being taken. Model-free RL has achieved tremendous results in a variety of tasks ranging from robotics control (Schulman, 2015), supply chain management (Valluri, 2009), helicopter control (Kim, 2003) and games (Mnih, 2013).
However, these remarkable results come with the cost of sampling efficiency. In practice, learning good controllers for such applications requires a good balance between exploration and exploitation, and in many cases, simulators are not available or not accurate enough. However, exploration in the real world might not be a good option for many use cases. Consider for instance the helicopter application from above. Finding a good policy that successfully flies complex maneuvers (without any prior knowledge or expert behavior) requires exploration which can easily lead to damage to the real helicopter (Kim, 2003). In practice usually such policies are generated through the imitation learning of expert behavior. Here, we either make use of behavioral cloning (Sakmann and Gruben-Trejo, 2016) i.e., we try to copy the expert, or inverse reinforcement learning (Abbeel, 2012), i.e., we try to estimate the reward function which is being optimized by the expert. However, the performance of the policies derived from both of these methods is bounded by the underlying expert policies.
Instead, offline reinforcement learning (Levine, 2020; Kumar, 2019; Seita, 2020) approaches a slightly similar problem in a different way. The idea of offline RL is to use existing rollouts/trajectories that have been generated under many sub-optimal baseline policies, i.e., “good” controllers in order to generate an overall better policy. A practical example comes from autonomous driving (Levine, 2019). Here, we aim to use rollouts (state-action-reward sequences) from different controllers (drivers) to learn a policy that is a superior driver than the baseline controllers. Offline RL only uses existing data and transfers the RL problem into a supervised learning setting (i.e., data-driven reinforcement learning) and has received an increasing interest lately (Fujimoto, 2019a; Fujimoto, 2019b; Agarwal, 2020), as available rollouts exist more and more in a variety of applications.
Task definition
The goal of this thesis is to evaluate existing offline reinforcement learning algorithms on a variety of available environments in the Coach framework (Caspi, 2017). First, the thesis should line out the necessary requirements that are needed for the development of offline RL (e.g. interfaces to monitoring applications such as Tensorboard (https://www.tensorflow.org/tensorboard), MLflow (https://mlflow.org), Coach Dashboard (https://nervanasystems.github.io/coach/dashboard.html), OpenShift with ML Lifecycle (https://www.openshift.com/learn/topics/ai-ml) (https://www.openshift.com) OpenShift? (with ML Lifecycle) or others, data handling, access to databases, etc.) and evaluate the algorithms implemented in Coach against available implementations on other frameworks. Second, the thesis should integrate publicly available offline RL algorithms (such as Naïve RL (e.g. DQN, QR-DQN (https://nervanasystems.github.io/coach/components/agents/value_optimization/qr_dqn.html)), batch-constrained Q-learning (BCQ) (Fujimoto, 2019a), bootstrapping error accumulation reduction (BEAR) (Kumar, 2019), behavioral cloning baselines, etc.), select reasonable environments that pose different challenges, and implement relevant metrics to benchmark the algorithms. Third, the thesis should evaluate and interpret experimental results from the offline RL algorithms on the selected environments using the implemented metrics. As a last step, the thesis should use available rollouts that are generated by policies that are used in an autonomous driving application in the Carla simulator. As only the datasets are used, there is no need to implement an interface to Carla.
Milestones:
- Literature review (Batch/Offline RL baselines algorithms, coach environment); compare the capabilities of coach to different frameworks (on the basis of their specifications, not through implementation and actual benchmarks) that allow for offline RL
- Selecting benchmark environments and configurations
- Design and implementation of a benchmarking suite
- Integrating baselines algorithms and coach into the benchmarking suite
- Design and run experiments to evaluate the behavior of the offline RL algorithms
- Evaluate coach and offline RL on rollouts from an autonomous driving application that runs in the Carla simulator
Requirements: Python programming, RL.
Supervisors: Dr. Georgios Kontes (IIS), Dr.-Ing. Christopher Mutschler (IIS)
References
- John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan and Pieter Abbeel: Trust Region Policy Optimization. International Conference on Machine Learning. 2015.
- Kaspar Sakmann and Michael Gruben-Trejo: Behavioral Cloning. Available online: https://github.com/ksakmann/CarND-BehavioralCloning. 2016.
- Annapurna Valluri, Michael J. North, and Charles M. Macal: Reinforcement Learning in Supply Chains. International Journal of Neural Systems. Vol. 19, No. 05. pp. 331-334. 2009.
- J. Kim, Michael I. Jordan, Shankar Sastry and Andrew Y. Ng: Autonomous Helicopter Flight via Reinforcement Learning. In Advances in Neural Information Processing Systems (NIPS). 2003.
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ionannis Antonoglou, Daan Wierstra and Martin Riedmiller: Playing Atari with Deep Reinforcement Learning. In NIPS Deep Learning Workshop. 2013.
- Pieter Abbeel: Inverse Reinforcement Learning. In Lecture Notes UC Berkeley EECS. https://people.eecs.berkeley.edu/~pabbeel/cs287-fa12/slides/inverseRL.pdf. 2012.
- Sergey Levine, Aviral Kumar, George Tucker and Justin Fu: Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv preprint arXiv:2005.01643v1. 2020.
- Aviral Kumar, Justin Fu, George Tucker and Sergey Levine: Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. In Advances in Neural Information Processing Systems (NeurIPS). 2019.
- Daniel Seita: Offline (Batch) Reinforcement Learning: A Review of Literature and Applications. Available online: https://danieltakeshi.github.io/2020/06/28/offline-rl/. 2020.
- Scott Fujimoto, David Meger and Doina Precup: Off-Policy Deep Reinforcement Learning without Exploration. International Conference on Machine Learning. 2019a.
- Rishabh Agarwal, Dale Schuurmans and Mojammad Norouzi: An Optimistic Perspective on Offline Reinforcement Learning. International Conference on Machine Learning. 2020.
- Scott Fujimoto, Edoardo Conti, Mohammad Ghavamzadeh and Joelle Pineau: Benchmarking Batch Deep Reinforcement Learning Algorithms. NeurIPS Deep RL Workshop. 2019b.
- Sergey Levine: Imitation, Prediction, and Model-based Reinforcement learning for Autonomous Driving. International Conference on Machine Learning. 2019.
- Itai Caspi, Gal Leibovich, Gal Novik and Shadi Endrawis: Reinforcement Learning Coach. Available online: https://github.com/NervanaSystems/coach. 2017.