Open

Continual Reinforcement Learning via Deep Network Compression [Master Thesis] (show more)

    Continual Learning is the ability to learn continually from new data, stepping on what was learned previously while being able to adapt to new situations. There is a number of desiderata for continual learning [1]:

    • Online learning: there is no fixed order of tasks to be learned or clear boundaries between them;
    • Forward and backward transfer: the model should not only use previous knowledge to adapt quickly to new tasks, but also make use of more recent experience to improve in older tasks
    • No direct access to previous data: the model could use a limited amount of selected past datapoints, but no access to the full past datasets should be required
    • Bounded system size: the model’s learning capacity should be fixed and available resources should be used intelligently.

    One of the main problems hindering the realization of continual learning is catastrophic forgetting [2]. Here, if the input distribution shifts over time, a learned model will overfit to the most recently seen data, forgetting how to perform well in past tasks.

    There are four families of approaches that address the Continual Learning problem: rehearsal methods that keep samples from previous tasks, regularization methods that constrain the model weights to maintain knowledge about previous tasks, dynamic network architectures that – usually – expand the network capacity for each new task and generative replay methods, where a generative model is used to generate samples from past tasks.

    The Continual Learning problem becomes even more difficult in the Reinforcement Learning setting since the demanding training process adds another level of complexity to the problem. Here different approaches have been proposed (e.g. see [2-4]), but no single method that is scalable, easy to tune and can guarantee zero performance loss on past tasks exists.

    In this thesis, we plan to leverage recent developments in deep network compression applied in the Continual Learning problem[5,6] but focusing on Reinforcement Learning scenarios.

    Supervisors: Axel Plinge, Christopher Mutschler

    References

    1. Continual Learning workshop, NeurIPS 2018: https://sites.google.com/view/continual2018/home
    2. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al., “Overcoming catastrophic forgetting in neural networks,” Proceedings of the national academy of sciences, p. 201611835, 2017.
    3. Schwarz, J., Luketina, J., Czarnecki, W. M., Grabska-Barwinska, A., Teh, Y. W., Pascanu, R., & Hadsell, R. (2018). Progress & compress: A scalable framework for continual learning. arXiv preprint arXiv:1805.06370.
    4. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavuk-cuoglu, R. Pascanu, and R. Hadsell, “Progressive neural networks,” arXiv preprint arXiv:1606.04671, 2016.
    5. Golkar, S., Kagan, M., & Cho, K. (2019). Continual learning via neural pruning. arXiv preprint arXiv:1903.04476.
    6. Mallya, A., & Lazebnik, S. (2018). Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7765-7773).

Towards Verifiable and Interpretable Reinforcement Learning [Master Thesis] (show more)

    Reinforcement Learning is broadly applicable for diverse tasks across many domains. On many problems, it has achieved superhuman performance [5]. However, the black-box neural networks used by modern RL algorithms are hard to interpret and their predicts are not easily explainable. Furthermore, except for some limited architectures and settings [3], their correctness and safety is not verifiable. Thus, we can not give any safety guarantee for their operation. This limits the application of reinforcement learning for many domains where this is essential, e.g., for autonomous vehicles or medical tasks.

    Previous works have made some contributions towards verifiable reinforcement learning. VIPER [1] extracts decision tree policies through imitation learning from a pretrained DQN-agent [5]. They use Z3 [2], an SMT solver, to prove that for some simple tasks, their agents provably solve the problem in a safe manner. However, the extracted decision trees are very large even for simple problems. This makes them hard to understand and interpret, as a human is not able to comprehend their operation in a reasonable amount of time [7].

    A possible solution for this problem is to use more powerful decision boundaries at the notes. Moet [6] uses linear decision boundaries for the decision tree nodes to generate smaller and more interpretable nodes. Coppens et al. [8] applied previous work on the extraction of so-called soft decision trees [9] from neural networks to reinforcement learning, by extracting a decision tree from a policy network. Soft decision trees use a perceptrons activation as a decision boundary at each node. This makes the individual decisions easy to visualize. However, the dense filters (both for MoET and for soft decision trees) prohibit an obvious, easy-to-grasp interpretation.

    The milestones of the thesis are the following:

    • Literature and algorithm review (1 month) 
    • Implementation (2 months)
    • Evaluation (1 month)
    • Writing (1 month)

    Supervisors: Lukas Schmidt, Christopher Mutschler

    References

    1. Bastani, Osbert, et al. “Verifiable Reinforcement Learning via Policy Extraction.” NIPS 2018: The 32nd Annual Conference on Neural Information Processing Systems, 2018, pp. 2494–2504.
    2. Moura, Leonardo De, and Nikolaj Bjørner. “Z3: An Efficient SMT Solver.” TACAS’08/ETAPS’08 Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, 2008, pp. 337–340.
    3. Katz, Guy, et al. “Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks.” International Conference on Computer Aided Verification, 2017, pp. 97–117.
    4. Cuccu, Giuseppe, et al. “Playing Atari with Six Neurons.” Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, 2019, pp. 998–1006.
    5. Mnih, Volodymyr, et al. “Playing Atari with Deep Reinforcement Learning.” ArXiv Preprint ArXiv:1312.5602, 2013.
    6. Vasic, Marko, et al. “MoET: Interpretable and Verifiable Reinforcement Learning via Mixture of Expert Trees.” ArXiv Preprint ArXiv:1906.06717, 2019.
    7. Lipton, Zachary C. “The Mythos of Model Interpretability.” ACM Queue, vol. 61, no. 10, 2018, pp. 36–43.
    8. Coppens, Youri, et al. “Distilling Deep Reinforcement Learning Policies in Soft Decision Trees.” Proceedings of the IJCAI 2019 Workshop on Explainable Artificial Intelligence, 2019.
    9. ­­Frosst, Nicholas, and Geoffrey E. Hinton. “Distilling a Neural Network Into a Soft Decision Tree.” CEx@AI*IA, 2017.

Training Interpretable Reinforcement Learning Policies through Local Lipschitzness [Master Thesis] (show more)

    Reinforcement Learning is broadly applicable for diverse tasks across many domains. On many problems, it has achieved superhuman performance [6]. However, the black-box neural networks used by modern RL algorithms are hard to interpret and their action-selection or high-level strategies are not easily explainable. Furthermore, except for some limited architectures and settings [2], their correctness and safety cannot be verified. Thus, we cannot give any safety guarantee for their operation. This limits the application of reinforcement learning for many domains where this is essential, e.g., for autonomous vehicles or medical tasks.

    Previous works have addressed a few issues and have made contributions towards verifiable reinforcement learning. For instance, VIPER [1] extracts decision tree policies through imitation learning from a pretrained DQN-agent [3] and MoET derives a mixture-of-experts tree [4]. However, such approaches have several limitations in practice. First of all, the extracted decision trees are very large even for simple problems. This makes them hard to understand and interpret, as a human is not able to comprehend their operation in a reasonable amount of time [5]. Second, neural network policies are highly optimized to the particular dynamics of the given MDP. This is an issue not only in the robustness of the agent in the final application – since the agent is overfit to a fixed snapshot of the simulation environment which is different than the real world – but it also implies that the decision boundary of the policy will exhibit overfit behavior to the specific training dynamics.

    The goal of this thesis is to train a policy that derives smooth decision boundaries, (for example smoothly separable), through regularizing the policy optimization with a Lipschitz-continuity constraint [9] (though there are also other possibilities such as the usage of a contrastive loss [10]). This enforces LIME-like [8] local interpretability. From machine learning theory we know that smoother decision boundaries lead to both better generalization and can be expressed through less complex models [2]. Such easier models than allow well-known algorithms such as VIPER [1] to extract small and compact decision trees that can in turn be easily interpreted and also formally be verified. Furthermore, we can then apply well-known visualization approaches such as t-SNE embeddings [6] to see what the policy is actually focusing on.

    A second but very similar approach is the combination with domain randomization and adversarial noise. Here, a policy is trained on a set of environments where the dynamics are changed across the runs. Such approaches are initially employed to train robust policies that are less sensible to small variations in the domain. Initial work also applies a Lipschitz-continuity constraint – however, over the set of different environments [7]. At times, such policies also exhibit smoother decision boundaries that might results in easier to interpret policies.

    Milestones:

    • Literature and algorithm review (1 month)
    • Setting up training environment and implementing continuity/smooth constraints and domain randomization (3 month)
    • Evaluation (1 month)
    • Writing (1 month)

    Supervisors: Lukas Schmidt, Christopher Mutschler

    Literature:

    1. Bastani, Osbert, et al. “Verifiable Reinforcement Learning via Policy Extraction.” NIPS 2018: The 32nd Annual Conference on Neural Information Processing Systems, 2018, pp. 2494–2504.
    2. Cuccu, Giuseppe, et al. “Playing Atari with Six Neurons.” Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, 2019, pp. 998–1006.
    3. Mnih, Volodymyr, et al. “Playing Atari with Deep Reinforcement Learning.” ArXiv Preprint ArXiv:1312.5602, 2013.
    4. Vasic, Marko, et al. “MoET: Interpretable and Verifiable Reinforcement Learning via Mixture of Expert Trees.” ArXiv Preprint ArXiv:1906.06717, 2019.
    5. Lipton, Zachary C. “The Mythos of Model Interpretability.” ACM Queue, vol. 61, no. 10, 2018, pp. 36–43.
    6. Mnih, V., et al. “Human-Level Control through Deep Reinforcement Learning”. Nature 518, 529-533. 2015.
    7. Slaoui, Reda Bahi et al. “Robust Visual Domain Randomization for Reinforcement Learning”. BeTR-RL Workshop at International Conference on Learning Representations. 2020. (Longer version: https://openreview.net/forum?id=H1xSOTVtvH)
    8. https://christophm.github.io/interpretable-ml-book/lime.html
    9. https://ucsdml.github.io/jekyll/update/2020/05/04/adversarial-robustness-through-local-lipschitzness.html
    10. He et al. “Momentum Contrast for Unsupervised Visual Representation Learning”. Computer Vision and Pattern Recognition. 2020.

     

    Running

    Daniel Landgraf: Hierarchical Learning and Model Predictive Control [Master Thesis @FAU] (show more)

      Motivation

      Recent progress in (deep) learning algorithms has shown great potential to learn complex tasks purely from large numbers of samples or interactions with the environment. However, performing such interactions can be difficult if sub-optimal control strategies can lead to serious damages. Therefore, the interactions are often performed in simulation environments instead of using the real system. On the other hand, using recent advances in model predictive control, such methods can be applied to control highly complex nonlinear systems in real-time. However, these approaches are difficult to use if the optimization problem is non-convex and exhibits multiple local minima that correspond to fundamentally different solutions. Therefore, it is of interest to combine modern predictive control methods with learning algorithms to solve such problems.

      Task definition

      In a collaboration between the Chair of Automatic Control and the Machine Learning and Information Fusion Group of Fraunhofer IIS, research is performed into possible combinations of reinforcement learning and real-time nonlinear model predictive control. The goal of this thesis is to use a learning algorithm at the higher level and a model predictive controller at the lower level to solve a collision avoidance problem for an autonomous vehicle. Here, the controller is responsible for dealing with the nonlinear dynamics of the vehicle, while the learning strategy should solve the decision making problem, e.g. whether to avoid the obstacle on the left or right side or perform an emergency braking.

      Requirements: Basic knowledge of control theory and model predictive control, programming experience in MATLAB, C and Python of advantage.

      Supervisors: Dr.-Ing. Andreas Völz (Chair of Automatic Control Friedrich-Alexander-Universität), Prof. Dr.-Ing. Knut Graichen (Chair of Automatic Control Friedrich-Alexander-Universität), Dr. Georgios Kontes (IIS), Dr.-Ing. Christopher Mutschler (IIS)

      References

      1. P. L. Bacon et al., The option-critic architecture. In AAAI Conference on Artificial Intelligence.
      2. Tingwu Wang et al., Benchmarking Model-Based Reinforcement Learning, arXiv:1907.02057.

    Lukas Frieß: Model-based Reinforcement Learning with First-Principle Models [Master Thesis @FAU] (show more)

      Motivation

      Reinforcement learning (RL) is increasingly used in robotics to learn complex tasks from repeated interactions with the environment. For example, a mobile robot can learn to avoid an obstacle by testing a large number of randomized motion strategies. The success is measured by a reward function that should be maximized in the course of the iterations. Algorithms for reinforcement learning are commonly divided in model-free and model-based approaches, where the former ones directly learn a policy and the latter ones use the interactions to build a model. In contrast, a model predictive controller solves a dynamic optimization problem in each time step to determine the optimal control strategy. Although reinforcement learning and model predictive control can be used to solve the same tasks, the two approaches are rarely compared directly.

      Task definition

      In a collaboration between the Chair of Automatic Control and the Machine Learning and Information Fusion Group of Fraunhofer IIS, research is performed into possible combinations of reinforcement learning and real-time nonlinear model predictive control. The goal of this master thesis is to implement a model-based reinforcement learning algorithm using methodologies commonly used in model predictive control for dynamical systems such as, for example, adjoint- based sensitivity analysis. In order to compare the approach to existing algorithms, one of the examples in the freely available benchmark suite by Tingu Wang et al. can be used.

      Requirements: Basic knowledge of control theory and model predictive control, programming experience in MATLAB, C and Python of advantage.

      Supervisors: Dr.-Ing. Andreas Völz (Chair of Automatic Control Friedrich-Alexander-Universität), Prof. Dr.-Ing. Knut Graichen (Chair of Automatic Control Friedrich-Alexander-Universität), Dr. Georgios Kontes (IIS), Dr.-Ing. Christopher Mutschler (IIS)

      References

      1. Tingwu Wang et al., Benchmarking Model-Based Reinforcement Learning, arXiv:1907.02057

    Ilona Bamiller: Benchmarking Offline Reinforcement Learning on an Autonomous Driving Application [Bachelor Thesis @LMU] (show more)

      Motivation

      Reinforcement Learning (RL) builds on the idea that an agent learns an optimal behavior through iterative interaction with an environment. In model-free reinforcement learning the agent does not have access to the world knowledge (i.e., to the reward process and environment dynamics) and its task is to deduce an optimal behavior, i.e., a policy, solely through receiving observations and rewards as a consequence of actions being taken. Model-free RL has achieved tremendous results in a variety of tasks ranging from robotics control (Schulman, 2015), supply chain management (Valluri, 2009), helicopter control (Kim, 2003) and games (Mnih, 2013).

      However, these remarkable results come with the cost of sampling efficiency. In practice, learning good controllers for such applications requires a good balance between exploration and exploitation, and in many cases, simulators are not available or not accurate enough. However, exploration in the real world might not be a good option for many use cases. Consider for instance the helicopter application from above. Finding a good policy that successfully flies complex maneuvers (without any prior knowledge or expert behavior) requires exploration which can easily lead to damage to the real helicopter (Kim, 2003). In practice usually such policies are generated through the imitation learning of expert behavior. Here, we either make use of behavioral cloning (Sakmann and Gruben-Trejo, 2016) i.e., we try to copy the expert, or inverse reinforcement learning (Abbeel, 2012), i.e., we try to estimate the reward function which is being optimized by the expert. However, the performance of the policies derived from both of these methods is bounded by the underlying expert policies.

      Instead, offline reinforcement learning (Levine, 2020; Kumar, 2019; Seita, 2020) approaches a slightly similar problem in a different way. The idea of offline RL is to use existing rollouts/trajectories that have been generated under many sub-optimal baseline policies, i.e., “good” controllers in order to generate an overall better policy. A practical example comes from autonomous driving (Levine, 2019). Here, we aim to use rollouts (state-action-reward sequences) from different controllers (drivers) to learn a policy that is a superior driver than the baseline controllers. Offline RL only uses existing data and transfers the RL problem into a supervised learning setting (i.e., data-driven reinforcement learning) and has received an increasing interest lately (Fujimoto, 2019a; Fujimoto, 2019b; Agarwal, 2020), as available rollouts exist more and more in a variety of applications.

      Task definition

      The goal of this thesis is to evaluate existing offline reinforcement learning algorithms on a variety of available environments in the Coach framework (Caspi, 2017). First, the thesis should line out the necessary requirements that are needed for the development of offline RL (e.g. interfaces to monitoring applications such as Tensorboard (https://www.tensorflow.org/tensorboard), MLflow (https://mlflow.org), Coach Dashboard (https://nervanasystems.github.io/coach/dashboard.html), OpenShift with ML Lifecycle (https://www.openshift.com/learn/topics/ai-ml)  (https://www.openshift.com) OpenShift? (with ML Lifecycle) or others, data handling, access to databases, etc.) and evaluate the algorithms implemented in Coach against available implementations on other frameworks. Second, the thesis should integrate publicly available offline RL algorithms (such as Naïve RL (e.g. DQN, QR-DQN (https://nervanasystems.github.io/coach/components/agents/value_optimization/qr_dqn.html)), batch-constrained Q-learning (BCQ) (Fujimoto, 2019a), bootstrapping error accumulation reduction (BEAR) (Kumar, 2019), behavioral cloning baselines, etc.), select reasonable environments that pose different challenges, and implement relevant metrics to benchmark the algorithms. Third, the thesis should evaluate and interpret experimental results from the offline RL algorithms on the selected environments using the implemented metrics. As a last step, the thesis should use available rollouts that are generated by policies that are used in an autonomous driving application in the Carla simulator. As only the datasets are used, there is no need to implement an interface to Carla.

      Milestones:

      • Literature review (Batch/Offline RL baselines algorithms, coach environment); compare the capabilities of coach to different frameworks (on the basis of their specifications, not through implementation and actual benchmarks) that allow for offline RL
      • Selecting benchmark environments and configurations
      • Design and implementation of a benchmarking suite
      • Integrating baselines algorithms and coach into the benchmarking suite
      • Design and run experiments to evaluate the behavior of the offline RL algorithms
      • Evaluate coach and offline RL on rollouts from an autonomous driving application that runs in the Carla simulator

      Requirements: Python programming, RL.

      Supervisors: Dr. Georgios Kontes (IIS), Dr.-Ing. Christopher Mutschler (IIS)

      References

      • John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan and Pieter Abbeel: Trust Region Policy Optimization. International Conference on Machine Learning. 2015.
      • Kaspar Sakmann and Michael Gruben-Trejo: Behavioral Cloning. Available online: https://github.com/ksakmann/CarND-BehavioralCloning. 2016.
      • Annapurna Valluri, Michael J. North, and Charles M. Macal: Reinforcement Learning in Supply Chains. International Journal of Neural Systems. Vol. 19, No. 05. pp. 331-334. 2009.
      • J. Kim, Michael I. Jordan, Shankar Sastry and Andrew Y. Ng: Autonomous Helicopter Flight via Reinforcement Learning. In Advances in Neural Information Processing Systems (NIPS). 2003.
      • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ionannis Antonoglou, Daan Wierstra and Martin Riedmiller: Playing Atari with Deep Reinforcement Learning. In NIPS Deep Learning Workshop. 2013.
      • Pieter Abbeel: Inverse Reinforcement Learning. In Lecture Notes UC Berkeley EECS. https://people.eecs.berkeley.edu/~pabbeel/cs287-fa12/slides/inverseRL.pdf. 2012.
      • Sergey Levine, Aviral Kumar, George Tucker and Justin Fu: Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv preprint arXiv:2005.01643v1. 2020.
      • Aviral Kumar, Justin Fu, George Tucker and Sergey Levine: Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. In Advances in Neural Information Processing Systems (NeurIPS). 2019.
      • Daniel Seita: Offline (Batch) Reinforcement Learning: A Review of Literature and Applications. Available online: https://danieltakeshi.github.io/2020/06/28/offline-rl/. 2020.
      • Scott Fujimoto, David Meger and Doina Precup: Off-Policy Deep Reinforcement Learning without Exploration. International Conference on Machine Learning. 2019a.
      • Rishabh Agarwal, Dale Schuurmans and  Mojammad Norouzi: An Optimistic Perspective on Offline Reinforcement Learning. International Conference on Machine Learning. 2020.
      • Scott Fujimoto, Edoardo Conti, Mohammad Ghavamzadeh and Joelle Pineau: Benchmarking Batch Deep Reinforcement Learning Algorithms. NeurIPS Deep RL Workshop. 2019b.
      • Sergey Levine: Imitation, Prediction, and Model-based Reinforcement learning for Autonomous Driving. International Conference on Machine Learning. 2019.
      • Itai Caspi, Gal Leibovich, Gal Novik and Shadi Endrawis: Reinforcement Learning Coach. Available online: https://github.com/NervanaSystems/coach. 2017.

    Matthias Gruber: Learning to avoid your Supervisor [Master Thesis @LMU] (show more)

      Reinforcement Learning is broadly applicable for diverse tasks across many domains. On many problems, it has achieved superhuman performance [5]. However, the black-box neural networks used by modern RL algorithms are hard to interpret and their predictions are not easily explainable. Furthermore, except for some limited architectures and settings [3], both their correctness and safety are not verifiable. Thus, we cannot give any safety guarantee for their operation. This limits the application of reinforcement learning for many domains where this is an essential requirement, e.g., for autonomous vehicles or medical tasks.

      For many domains, safety regulations are in place that define “safe” working conditions. For factory robots moving in a hall, this might be a minimum distance from any humans or unidentified obstacles. Strict adherence to these regulations is often a requirement in these situations. However, this also limits the application of many modern reinforcement learning techniques, as it is usually hard (if not intractable) to prove that the agent acts correctly in all possible situations. Existing approaches [1,3] limit the capabilities of the agent (e.g., by using decision trees or only small DNNs as policy models). 

      One approach to ensure the safety of black-box model is to implement a “supervisor” that monitors the safety regulations, and that brings the system to a non-catastrophic stop. An example for this would be emergency brakes in an autonomous robot that automatically stop the robot if an obstacle is within a safety perimeter. As long as the supervisor does not detect a critical situation, the black-box agent can implement any policy and maximize his reward in the environment. Such supervisors have been constructed manually [1], from specifications based on logical specifications [10], learned, either from previous violations [15] or from world models and a continuity assumption [11, 12], and based on Lyapunov functions [13, 14].

      Simply adding a supervisor on top of an agent that is already trained, however, might not be an optimal solution: a black-box policy might use smaller safety margins to empirically maximize the reward, i.e., take more risks than needed. Such a policy would necessitate common interventions of the supervisor, which (by employing a “hard” safety stop, for example) would lead to reduced reward for the agent. A promising idea, then, is to train a reinforcement learning algorithm in the “safe” environment defined by the original environment and the supervisor, and to punish the agent whenever an intervention is needed. The agent thus learns to avoid critical situations and hence learns a policy that considers any safety regulations specified. The supervised agent is trivially safe, as the supervisor will ensure safety regulations are upheld, and should achieve a reward that is higher than a purely white-box solution.

      This thesis should investigate the agent-with-a-supervisor scenario in more detail within one or more environments: drone delivery [16], train environment [17], AirSim [18] and/or CARLA [19]. First, an agent should be trained in a test environment with and without a (manual) supervisor (as a baseline), to evaluate the influence of the latter on safety and performance. Then, different modifications of the supervisor and the training procedure should be implemented and evaluated. The supervisor could be (i) logical constraints (e.g. getting too close to an obstacle will deploy a non-catastrophic stop), (ii) an RL agent that decides between deploying a safe action and a black-box agent, or (iii) control-barrier functions [15]. The idea is to train the algorithm with different minimum safety distances to the obstacles and allow it to violate them occasionally. For the supervisor, it should be found out if current synthesis options lead to satisfying results. In the case of manipulating an unsafe action to a safe one, it is necessary to predict if the action proposed by the black-box agent leads to a safe or unsafe state. An interesting direction for this is the synthesis of a supervisor on static, pre-generated, unsafe datasets of one of the environments. For training, different reward-based methods that shape the reward of the agent based on the original environment, the current risk, the speed of learning, the number of safety violations and the number of interventions by the supervisor (maybe negative curiosity, or related) should be evaluated. The result should be a pipeline that allows the (safe) training of safe, performant agents, combining the training enhancements and the synthesized supervisor.

      The milestones of the thesis are as follows:

      • Literature and algorithm review (1 month)
      • Implementation (3 months)
        • Implementation and adaptation of any environments needed
        • Implementation of the baseline training framework
        • Implementation of supervisor synthesis
        • Implementation of training and reward modifications
      • Evaluation (1 month)
        • Evaluation of baseline
        • Evaluation of the full supervisor and training framework
      • Writing (1 month)

      Supervisors: Lukas Schmidt, Christopher Mutschler

      References

      1. Bastani, Osbert, et al. “Verifiable Reinforcement Learning via Policy Extraction.” NIPS 2018: The 32nd Annual Conference on Neural Information Processing Systems, 2018, pp. 2494–2504.
      2. Moura, Leonardo De, and Nikolaj Bjørner. “Z3: An Efficient SMT Solver.” TACAS’08/ETAPS’08 Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, 2008, pp. 337–340.
      3. Katz, Guy, et al. “Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks.” International Conference on Computer Aided Verification, 2017, pp. 97–117.
      4. Cuccu, Giuseppe, et al. “Playing Atari with Six Neurons.” Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, 2019, pp. 998–1006.
      5. Mnih, Volodymyr, et al. “Playing Atari with Deep Reinforcement Learning.” ArXiv Preprint ArXiv:1312.5602, 2013.
      6. Vasic, Marko, et al. “MoET: Interpretable and Verifiable Reinforcement Learning via Mixture of Expert Trees.” ArXiv Preprint ArXiv:1906.06717, 2019.
      7. Lipton, Zachary C. “The Mythos of Model Interpretability.” ACM Queue, vol. 61, no. 10, 2018, pp. 36–43.
      8. Coppens, Youri, et al. “Distilling Deep Reinforcement Learning Policies in Soft Decision Trees.” Proceedings of the IJCAI 2019 Workshop on Explainable Artificial Intelligence, 2019.
      9. ­­Frosst, Nicholas, and Geoffrey E. Hinton. “Distilling a Neural Network Into a Soft Decision Tree.” CEx@AI*IA, 2017.
      10. Alshiekh et al. „Safe Reinforcement Learning via Shielding“, 2017
      11. Dalal et al. “Safe Exploration in Continuous Action Spaces”, 2018
      12. Cheng et al. “End-to-End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks”, 2019
      13. Chow et al. “A Lyapunov-based Approach to Safe Reinforcement Learning”, 2018
      14. Berkenkamp et al. “Safe Model-based Reinforcement Learning with Stability Guarantees”, 2017
      15. Mohit SrinivasanAmogh DabholkarSamuel CooganPatricio A. Vela: Synthesis of Control Barrier Functions Using a Supervised Machine Learning Approach. CoRR abs/2003.04950(2020)
      16. https://github.com/IBM/vsrl-examples/blob/master/examples/EnvBuild/DroneDelivery.md
      17. https://flatland.aicrowd.com/intro.html
      18. https://github.com/Microsoft/AirSim
      19. https://carla.org/

    Hyeyoung Park: Towards Interpretable (and Robust) Reinforcement Learning Policies through Local Lopschitzness and Randomization [Consulting Project @LMU] (show more)

      Reinforcement Learning is broadly applicable for diverse tasks across many domains. On many problems, it has achieved superhuman performance [5]. However, the black-box neural networks used by modern RL algorithms are hard to interpret and their predictions are not easily explainable. Furthermore, except for some limited architectures and settings [3], both their correctness and safety are not verifiable. Thus, we cannot give any safety guarantee for their operation. This limits the application of reinforcement learning for many domains where this is essential, e.g., for autonomous vehicles or medical tasks. Previous work has shown that adding a loss based on the local Lipschitzness of the classifier leads to more robust policies, especially against adversarial attacks [8].

      Figure 1: A Lipschitz-continuity: the gradient is below some threshold. Figure 2: Overfitted decision boundary. Figure 3: Smooth decision boundary.

      This project should investigate what effect a local Lipschitzness constraint (see the visualization in Figure 1) or domain randomization on the original policy has on the generated policies. For an environment that is easily visualized, policies should be trained with the above additions and visualized. In particular, the decision boundaries are of interest here (see the overfitted decision boundary in Figure 2). In terms of robustness, the policy may be more robust to slight variations of the environment dynamics (so-called model mismatches). In contrast, in terms of explainability, smooth and wide boundaries (see Figure 3) may lead to an easier extraction of rule-based (e.g., decision tree) policies (e.g. through [1]).

      Recommended requirements for students
      • Experience with Reinforcement Learning in theory and practice
      • Experience with Python and (preferably) PyTorch

      Main contact at the department: Dr.-Ing. Christopher Mutschler

      External point of contact: Lukas Schmidt, Dr. Georgios Kontes (Self-Learning Systems Group, Fraunhofer IIS, Nuremberg)

      References

      1. Bastani, Osbert, et al. “Verifiable Reinforcement Learning via Policy Extraction.” NIPS 2018: The 32nd Annual Conference on Neural Information Processing Systems, 2018, pp. 2494–2504.
      2. Moura, Leonardo De, and Nikolaj Bjørner. “Z3: An Efficient SMT Solver.” TACAS’08/ETAPS’08 Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, 2008, pp. 337–340.
      3. Katz, Guy, et al. “Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks.” International Conference on Computer Aided Verification, 2017, pp. 97–117.
      4. Cuccu, Giuseppe, et al. “Playing Atari with Six Neurons.” Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, 2019, pp. 998–1006.
      5. Mnih, Volodymyr, et al. “Playing Atari with Deep Reinforcement Learning.” ArXiv Preprint ArXiv:1312.5602, 2013.
      6. Vasic, Marko, et al. “MoET: Interpretable and Verifiable Reinforcement Learning via Mixture of Expert Trees.” ArXiv Preprint ArXiv:1906.06717, 2019.
      7. Lipton, Zachary C. “The Mythos of Model Interpretability.” ACM Queue, vol. 61, no. 10, 2018, pp. 36–43.
      8. Adversarial Robustness through local lipschitzness – https://ucsdml.github.io/jekyll/update/2020/05/04/adversarial-robustness-through-local-lipschitzness.html

    Sebastian Fischer: Back to the Basics: Offline Reinforcement Learning with Least-Squares Methods for Policy Iteration [Master Thesis @LMU] (show more)

      Recently, offline (sometimes also called ‘batch’) Reinforcement Learning (RL) algorithms have gained significant research traction [1]. The reason behind this is that – unlike in the classical Reinforcement Learning formulation – they do not require interaction with the environment. Instead, when provided with logged data they are able to generate a better policy than the one used to collect the data. Obviously, this has significant practical applications, e.g., we could learn a better driving policy using only recorded data from real drivers.

      Even though offline RL algorithms are appealing, they can be quite unstable during training. The main reason behind this instability is the bootstrapping error [4]. Here, the problem is that the state-action values can be overestimated in regions far from the actually experienced data, which leads to erroneous target updates [3].

      In value-based RL, two ways to mitigate this problem have been proposed. The first approach tries to penalize policies (or actions selected to specific states be more precise) that are not included in the original data (or the specific action has very low probability according to the data) [2,3]. A different approach facilitates an ensemble of models and through a form of dropout ensures that outlier values for the state-action value function are cancelled out.

      Even though these approaches are in-line with recent deep RL algorithms, we have to consider that there is a family of classical offline RL methods (e.g. see [5,6]) that have shown remarkable results and stable performance, even though they are based in linear function approximation for the state-action value function. So, the question here is whether we can combine the theory and findings of classical algorithms with modern deep learning and reinforcement learning techniques.

      The main goal of this thesis is to analyze the properties of the Least Squares Policy Iteration (LSPI) algorithm [5] and design a new deep RL offline algorithm based on key properties and insights of LSPI. For example, Radial Basis Function (RBF) approximators can be used to avoid the extrapolation error in the state-action value function, the widely-used Q-learning loss function could be replaced by the LSPI state-action approximator weight update, the benefits of having a multi-head network for providing different weights for each set of actions could be explored, while the added benefits of pre-training with Behavioral Cloning (BC) should be determined.

      The new algorithm should be benchmarked against state-of-the-art results [2,3,4] both in classical control environments and the offline Atari Games dataset. [7]. In addition, the possibility of extending the algorithm to continuous action spaces should be explored.

      Supervisors: Dr. Georgios Kontes, Dr.-Ing. Christopher Mutschler

      References

      1. Levine, S., Kumar, A., Tucker, G., & Fu, J. (2020). Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643.
      2. Fujimoto, S., Meger, D., & Precup, D. (2019, May). Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning (pp. 2052-2062).
      3. Kumar, A., Fu, J., Soh, M., Tucker, G., & Levine, S. (2019). Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems (pp. 11784-11794).
      4. Agarwal, R., Schuurmans, D., & Norouzi, M. (2020). An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning.Golkar, S., Kagan, M., & Cho, K. (2019). Continual learning via neural pruning. arXiv preprint arXiv:1903.04476.
      5. Lagoudakis, M. G., & Parr, R. (2003). Least-squares policy iteration. Journal of machine learning research, 4(Dec), 1107-1149.
      6. JNedić, A., & Bertsekas, D. P. (2003). Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems, 13(1-2), 79-110.
      7. Gulcehre, C., Wang, Z., Novikov, A., Paine, T. L., Colmenarejo, S. G., Zolna, K., … & Dulac-Arnold, G. (2020). Rl unplugged: Benchmarks for offline reinforcement learning. arXiv preprint arXiv:2006.13888.

     

    Finished

    David Kießling: Real-Time Non-Linear Model Predictive Control for Collision Avoidance of Vehicles [Master Thesis @ FAU]

    Leonid Butyrev: Overcoming Catastrophic Forgetting in Deep Reinforcement Learning [Master Thesis @ FAU]

    Andreas Mühlroth: Transparent and Intepretable Reinforcement Learning [Master Thesis @ FAU]

    Tim Nisslbeck: Learning a Car Driving Simulator to enable Deep Reinforcement Learning [Master Thesis @ FAU]

    Leonid Butyrev: Exploiting Reinforcement Learning for Complex Trajectory Planning for Mobile Robots [Bachelor Thesis @ FAU]