|Continual Reinforcement Learning via Deep Network Compression [Master Thesis] (show more)
|Towards Verifiable and Interpretable Reinforcement Learning [Master Thesis] (show more)
|Training Interpretable Reinforcement Learning Policies through Local Lipschitzness [Master Thesis] (show more)
|Daniel Landgraf: Hierarchical Learning and Model Predictive Control [Master Thesis @FAU] (show more)
|Lukas Frieß: Model-based Reinforcement Learning with First-Principle Models [Master Thesis @FAU] (show more)
Reinforcement learning (RL) is increasingly used in robotics to learn complex tasks from repeated interactions with the environment. For example, a mobile robot can learn to avoid an obstacle by testing a large number of randomized motion strategies. The success is measured by a reward function that should be maximized in the course of the iterations. Algorithms for reinforcement learning are commonly divided in model-free and model-based approaches, where the former ones directly learn a policy and the latter ones use the interactions to build a model. In contrast, a model predictive controller solves a dynamic optimization problem in each time step to determine the optimal control strategy. Although reinforcement learning and model predictive control can be used to solve the same tasks, the two approaches are rarely compared directly.
In a collaboration between the Chair of Automatic Control and the Machine Learning and Information Fusion Group of Fraunhofer IIS, research is performed into possible combinations of reinforcement learning and real-time nonlinear model predictive control. The goal of this master thesis is to implement a model-based reinforcement learning algorithm using methodologies commonly used in model predictive control for dynamical systems such as, for example, adjoint- based sensitivity analysis. In order to compare the approach to existing algorithms, one of the examples in the freely available benchmark suite by Tingu Wang et al. can be used.
Requirements: Basic knowledge of control theory and model predictive control, programming experience in MATLAB, C and Python of advantage.
Supervisors: Dr.-Ing. Andreas Völz (Chair of Automatic Control Friedrich-Alexander-Universität), Prof. Dr.-Ing. Knut Graichen (Chair of Automatic Control Friedrich-Alexander-Universität), Dr. Georgios Kontes (IIS), Dr.-Ing. Christopher Mutschler (IIS)
|Ilona Bamiller: Benchmarking Offline Reinforcement Learning on an Autonomous Driving Application [Bachelor Thesis @LMU] (show more)
Reinforcement Learning (RL) builds on the idea that an agent learns an optimal behavior through iterative interaction with an environment. In model-free reinforcement learning the agent does not have access to the world knowledge (i.e., to the reward process and environment dynamics) and its task is to deduce an optimal behavior, i.e., a policy, solely through receiving observations and rewards as a consequence of actions being taken. Model-free RL has achieved tremendous results in a variety of tasks ranging from robotics control (Schulman, 2015), supply chain management (Valluri, 2009), helicopter control (Kim, 2003) and games (Mnih, 2013).
However, these remarkable results come with the cost of sampling efficiency. In practice, learning good controllers for such applications requires a good balance between exploration and exploitation, and in many cases, simulators are not available or not accurate enough. However, exploration in the real world might not be a good option for many use cases. Consider for instance the helicopter application from above. Finding a good policy that successfully flies complex maneuvers (without any prior knowledge or expert behavior) requires exploration which can easily lead to damage to the real helicopter (Kim, 2003). In practice usually such policies are generated through the imitation learning of expert behavior. Here, we either make use of behavioral cloning (Sakmann and Gruben-Trejo, 2016) i.e., we try to copy the expert, or inverse reinforcement learning (Abbeel, 2012), i.e., we try to estimate the reward function which is being optimized by the expert. However, the performance of the policies derived from both of these methods is bounded by the underlying expert policies.
Instead, offline reinforcement learning (Levine, 2020; Kumar, 2019; Seita, 2020) approaches a slightly similar problem in a different way. The idea of offline RL is to use existing rollouts/trajectories that have been generated under many sub-optimal baseline policies, i.e., “good” controllers in order to generate an overall better policy. A practical example comes from autonomous driving (Levine, 2019). Here, we aim to use rollouts (state-action-reward sequences) from different controllers (drivers) to learn a policy that is a superior driver than the baseline controllers. Offline RL only uses existing data and transfers the RL problem into a supervised learning setting (i.e., data-driven reinforcement learning) and has received an increasing interest lately (Fujimoto, 2019a; Fujimoto, 2019b; Agarwal, 2020), as available rollouts exist more and more in a variety of applications.
The goal of this thesis is to evaluate existing offline reinforcement learning algorithms on a variety of available environments in the Coach framework (Caspi, 2017). First, the thesis should line out the necessary requirements that are needed for the development of offline RL (e.g. interfaces to monitoring applications such as Tensorboard (https://www.tensorflow.org/tensorboard), MLflow (https://mlflow.org), Coach Dashboard (https://nervanasystems.github.io/coach/dashboard.html), OpenShift with ML Lifecycle (https://www.openshift.com/learn/topics/ai-ml) (https://www.openshift.com) OpenShift? (with ML Lifecycle) or others, data handling, access to databases, etc.) and evaluate the algorithms implemented in Coach against available implementations on other frameworks. Second, the thesis should integrate publicly available offline RL algorithms (such as Naïve RL (e.g. DQN, QR-DQN (https://nervanasystems.github.io/coach/components/agents/value_optimization/qr_dqn.html)), batch-constrained Q-learning (BCQ) (Fujimoto, 2019a), bootstrapping error accumulation reduction (BEAR) (Kumar, 2019), behavioral cloning baselines, etc.), select reasonable environments that pose different challenges, and implement relevant metrics to benchmark the algorithms. Third, the thesis should evaluate and interpret experimental results from the offline RL algorithms on the selected environments using the implemented metrics. As a last step, the thesis should use available rollouts that are generated by policies that are used in an autonomous driving application in the Carla simulator. As only the datasets are used, there is no need to implement an interface to Carla.
Requirements: Python programming, RL.
|Matthias Gruber: Learning to avoid your Supervisor [Master Thesis @LMU] (show more)
|Hyeyoung Park: Towards Interpretable (and Robust) Reinforcement Learning Policies through Local Lopschitzness and Randomization [Consulting Project @LMU] (show more)
Reinforcement Learning is broadly applicable for diverse tasks across many domains. On many problems, it has achieved superhuman performance . However, the black-box neural networks used by modern RL algorithms are hard to interpret and their predictions are not easily explainable. Furthermore, except for some limited architectures and settings , both their correctness and safety are not verifiable. Thus, we cannot give any safety guarantee for their operation. This limits the application of reinforcement learning for many domains where this is essential, e.g., for autonomous vehicles or medical tasks. Previous work has shown that adding a loss based on the local Lipschitzness of the classifier leads to more robust policies, especially against adversarial attacks .
This project should investigate what effect a local Lipschitzness constraint (see the visualization in Figure 1) or domain randomization on the original policy has on the generated policies. For an environment that is easily visualized, policies should be trained with the above additions and visualized. In particular, the decision boundaries are of interest here (see the overfitted decision boundary in Figure 2). In terms of robustness, the policy may be more robust to slight variations of the environment dynamics (so-called model mismatches). In contrast, in terms of explainability, smooth and wide boundaries (see Figure 3) may lead to an easier extraction of rule-based (e.g., decision tree) policies (e.g. through ).
Recommended requirements for students
Main contact at the department: Dr.-Ing. Christopher Mutschler
|Sebastian Fischer: Back to the Basics: Offline Reinforcement Learning with Least-Squares Methods for Policy Iteration [Master Thesis @LMU] (show more)
Recently, offline (sometimes also called ‘batch’) Reinforcement Learning (RL) algorithms have gained significant research traction . The reason behind this is that – unlike in the classical Reinforcement Learning formulation – they do not require interaction with the environment. Instead, when provided with logged data they are able to generate a better policy than the one used to collect the data. Obviously, this has significant practical applications, e.g., we could learn a better driving policy using only recorded data from real drivers.
Even though offline RL algorithms are appealing, they can be quite unstable during training. The main reason behind this instability is the bootstrapping error . Here, the problem is that the state-action values can be overestimated in regions far from the actually experienced data, which leads to erroneous target updates .
In value-based RL, two ways to mitigate this problem have been proposed. The first approach tries to penalize policies (or actions selected to specific states be more precise) that are not included in the original data (or the specific action has very low probability according to the data) [2,3]. A different approach facilitates an ensemble of models and through a form of dropout ensures that outlier values for the state-action value function are cancelled out.
Even though these approaches are in-line with recent deep RL algorithms, we have to consider that there is a family of classical offline RL methods (e.g. see [5,6]) that have shown remarkable results and stable performance, even though they are based in linear function approximation for the state-action value function. So, the question here is whether we can combine the theory and findings of classical algorithms with modern deep learning and reinforcement learning techniques.
The main goal of this thesis is to analyze the properties of the Least Squares Policy Iteration (LSPI) algorithm  and design a new deep RL offline algorithm based on key properties and insights of LSPI. For example, Radial Basis Function (RBF) approximators can be used to avoid the extrapolation error in the state-action value function, the widely-used Q-learning loss function could be replaced by the LSPI state-action approximator weight update, the benefits of having a multi-head network for providing different weights for each set of actions could be explored, while the added benefits of pre-training with Behavioral Cloning (BC) should be determined.
The new algorithm should be benchmarked against state-of-the-art results [2,3,4] both in classical control environments and the offline Atari Games dataset. . In addition, the possibility of extending the algorithm to continuous action spaces should be explored.
David Kießling: Real-Time Non-Linear Model Predictive Control for Collision Avoidance of Vehicles [Master Thesis @ FAU]
Leonid Butyrev: Overcoming Catastrophic Forgetting in Deep Reinforcement Learning [Master Thesis @ FAU]
Andreas Mühlroth: Transparent and Intepretable Reinforcement Learning [Master Thesis @ FAU]
Tim Nisslbeck: Learning a Car Driving Simulator to enable Deep Reinforcement Learning [Master Thesis @ FAU]
Leonid Butyrev: Exploiting Reinforcement Learning for Complex Trajectory Planning for Mobile Robots [Bachelor Thesis @ FAU]