2021: Learning to Avoid your Supervisor [Matthias Gruber @ LMU München]

Reinforcement Learning is broadly applicable for diverse tasks across many domains. On many problems, it has achieved superhuman performance [5]. However, the black-box neural networks used by modern RL algorithms are hard to interpret and their predictions are not easily explainable. Furthermore, except for some limited architectures and settings [3], both their correctness and safety are not verifiable. Thus, we cannot give any safety guarantee for their operation. This limits the application of reinforcement learning for many domains where this is an essential requirement, e.g., for autonomous vehicles or medical tasks.

For many domains, safety regulations are in place that define “safe” working conditions. For factory robots moving in a hall, this might be a minimum distance from any humans or unidentified obstacles. Strict adherence to these regulations is often a requirement in these situations. However, this also limits the application of many modern reinforcement learning techniques, as it is usually hard (if not intractable) to prove that the agent acts correctly in all possible situations. Existing approaches [1,3] limit the capabilities of the agent (e.g., by using decision trees or only small DNNs as policy models). 

One approach to ensure the safety of black-box model is to implement a “supervisor” that monitors the safety regulations, and that brings the system to a non-catastrophic stop. An example for this would be emergency brakes in an autonomous robot that automatically stop the robot if an obstacle is within a safety perimeter. As long as the supervisor does not detect a critical situation, the black-box agent can implement any policy and maximize his reward in the environment. Such supervisors have been constructed manually [1], from specifications based on logical specifications [10], learned, either from previous violations [15] or from world models and a continuity assumption [11, 12], and based on Lyapunov functions [13, 14].

Simply adding a supervisor on top of an agent that is already trained, however, might not be an optimal solution: a black-box policy might use smaller safety margins to empirically maximize the reward, i.e., take more risks than needed. Such a policy would necessitate common interventions of the supervisor, which (by employing a “hard” safety stop, for example) would lead to reduced reward for the agent. A promising idea, then, is to train a reinforcement learning algorithm in the “safe” environment defined by the original environment and the supervisor, and to punish the agent whenever an intervention is needed. The agent thus learns to avoid critical situations and hence learns a policy that considers any safety regulations specified. The supervised agent is trivially safe, as the supervisor will ensure safety regulations are upheld, and should achieve a reward that is higher than a purely white-box solution.

This thesis should investigate the agent-with-a-supervisor scenario in more detail within one or more environments: drone delivery [16], train environment [17], AirSim [18] and/or CARLA [19]. First, an agent should be trained in a test environment with and without a (manual) supervisor (as a baseline), to evaluate the influence of the latter on safety and performance. Then, different modifications of the supervisor and the training procedure should be implemented and evaluated. The supervisor could be (i) logical constraints (e.g. getting too close to an obstacle will deploy a non-catastrophic stop), (ii) an RL agent that decides between deploying a safe action and a black-box agent, or (iii) control-barrier functions [15]. The idea is to train the algorithm with different minimum safety distances to the obstacles and allow it to violate them occasionally. For the supervisor, it should be found out if current synthesis options lead to satisfying results. In the case of manipulating an unsafe action to a safe one, it is necessary to predict if the action proposed by the black-box agent leads to a safe or unsafe state. An interesting direction for this is the synthesis of a supervisor on static, pre-generated, unsafe datasets of one of the environments. For training, different reward-based methods that shape the reward of the agent based on the original environment, the current risk, the speed of learning, the number of safety violations and the number of interventions by the supervisor (maybe negative curiosity, or related) should be evaluated. The result should be a pipeline that allows the (safe) training of safe, performant agents, combining the training enhancements and the synthesized supervisor.

The milestones of the thesis are as follows:

  • Literature and algorithm review (1 month)
  • Implementation (3 months)
    • Implementation and adaptation of any environments needed
    • Implementation of the baseline training framework
    • Implementation of supervisor synthesis
    • Implementation of training and reward modifications
  • Evaluation (1 month)
    • Evaluation of baseline
    • Evaluation of the full supervisor and training framework
  • Writing (1 month)

Supervisors: Lukas Schmidt, Christopher Mutschler


  1. Bastani, Osbert, et al. “Verifiable Reinforcement Learning via Policy Extraction.” NIPS 2018: The 32nd Annual Conference on Neural Information Processing Systems, 2018, pp. 2494–2504.
  2. Moura, Leonardo De, and Nikolaj Bjørner. “Z3: An Efficient SMT Solver.” TACAS’08/ETAPS’08 Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, 2008, pp. 337–340.
  3. Katz, Guy, et al. “Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks.” International Conference on Computer Aided Verification, 2017, pp. 97–117.
  4. Cuccu, Giuseppe, et al. “Playing Atari with Six Neurons.” Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, 2019, pp. 998–1006.
  5. Mnih, Volodymyr, et al. “Playing Atari with Deep Reinforcement Learning.” ArXiv Preprint ArXiv:1312.5602, 2013.
  6. Vasic, Marko, et al. “MoET: Interpretable and Verifiable Reinforcement Learning via Mixture of Expert Trees.” ArXiv Preprint ArXiv:1906.06717, 2019.
  7. Lipton, Zachary C. “The Mythos of Model Interpretability.” ACM Queue, vol. 61, no. 10, 2018, pp. 36–43.
  8. Coppens, Youri, et al. “Distilling Deep Reinforcement Learning Policies in Soft Decision Trees.” Proceedings of the IJCAI 2019 Workshop on Explainable Artificial Intelligence, 2019.
  9. ­­Frosst, Nicholas, and Geoffrey E. Hinton. “Distilling a Neural Network Into a Soft Decision Tree.” CEx@AI*IA, 2017.
  10. Alshiekh et al. „Safe Reinforcement Learning via Shielding“, 2017
  11. Dalal et al. “Safe Exploration in Continuous Action Spaces”, 2018
  12. Cheng et al. “End-to-End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks”, 2019
  13. Chow et al. “A Lyapunov-based Approach to Safe Reinforcement Learning”, 2018
  14. Berkenkamp et al. “Safe Model-based Reinforcement Learning with Stability Guarantees”, 2017
  15. Mohit SrinivasanAmogh DabholkarSamuel CooganPatricio A. Vela: Synthesis of Control Barrier Functions Using a Supervised Machine Learning Approach. CoRR abs/2003.04950(2020)
  16. https://github.com/IBM/vsrl-examples/blob/master/examples/EnvBuild/DroneDelivery.md
  17. https://flatland.aicrowd.com/intro.html
  18. https://github.com/Microsoft/AirSim
  19. https://carla.org/