Thorsten Bescher: Schedule-Net and Gumbel-Alpha-Zero Play-to-Plan: a Hybrid Approach for the Job-Shop Scheduling Problem (FAU Erlangen-Nürnberg, 2024)

Motivation / Related Work

Recently, reinforcement learning (RL) is applied frequently for combinatorial optimization problems [1] with real world applications, e.g., the job-shop scheduling problem (JSSP) or the travelling salesman problem (TSP). Of special interest are RL approaches that solve the problems end-to-end, i.e., the agent is trained from scratch without any usage of heuristics or solver-generated solutions. 

For example, ScheduleNet [2] enables to solve multi-agent scheduling problems based on a graph neural network (GNN) architecture. After the training procedure, the approach can be applied to problems of arbitrary size (also larger sizes than seen during training). It was shown that ScheduleNet can outperform many heuristics and other RL based approaches for the JSSP. 

Another approach that shows promising results for the JSSP is Gumbel-Alpha-Zero Play-to-Plan (GAZ PTP) [3]. It addresses the issue that AlphaZero-type algorithms may stop improving on single-player tasks in case the value network guiding the tree search is unable to approximate the outcome of an episode sufficiently well. To overcome this, a historical baseline is incorporated into the planning process. Based on the recently introduced Gumbel AlphaZero [4], the agent learns to find strong trajectories by planning against possible strategies of its past self. Using a transformer architecture, it was shown that the approach outperforms various single-player variants based on [4].

Overall goal

The overall goal of this thesis is to implement a hybrid approach based on ScheduleNet and 
GAZ PTP. In particular, the costly transformer architecture of GAZ PTP should be replaced by the GNN architecture that is used for ScheduleNet. Therefore, the resulting framework will combine the scalability of ScheduleNet and the planning capabilities incorporated in GAZ PTP.

In a first step, the student will familiarize with [2] and implement a basic version of the approach that will be refined stepwise towards ScheduleNet. The resulting neural network architecture is then to be integrated in an existing code-base for GAZ PTP [3]. Using this framework, the student will run experiments to train an agent to solve JSSP problems of similar size as carried out in [2] and [3]. In a final step, the student will analyze and refine the framework regarding hyperparameters. Within this analysis, special attention should be paid to the influence of the tree search depth on the overall performance (since it is a known fact that GAZ approaches in general work already quite well using very few tree search simulations [4]).

Timetable (6 months, in 24 person weeks (PW))

3 PW               Familiarization with relevant work in the subject areas and literature, especially [2] and [3].

7 PW               Implementation of the approach described in [2] in Python (starting from a basic GNN, improve stepwise towards the ScheduleNet architecture).

4 PW               Integration of the GNN architecture of the previous step into the existing codebase for GAZ PTP [3].

4 PW               Experiments and refinement of the method w.r.t hyperparameters (especially tree search depth).

6 PW               Writing of the final transcript.

Expected Results and Scientific Contributions

  • A Python implementation of the RL framework described in [2] that can be used to approach JSSPs of variable size. In particular, the implementation of the environment and the GNN should be modular so that that they can be used on top of the existing code-base for GAZ PTP [3].
  • A Python implementation of a hybrid framework between ScheduleNet and GAZ PTP (code-base already exists).
  • Evaluation report that analyzes and compares the performance of the resulting hybrid framework to the results described in [2] and [3]. Additionally, the report should contain an analysis regarding the influence of the tree search depth on the performance of the hybrid framework.

Supervisor: Dr. Christopher Mutschler (Fraunhofer IIS), Dr. Quirin Göttl (Fraunhofer IIS), Prof. Dr. Björn Eskofier (FAU)

References

[1]       Mazyavkina, N., Sviridov, S., Ivanov, S., Burnaev, E.. 2021. Reinforcement learning for combinatorial optimization: A survey. Comput. Oper. Res., 134, 105400.

[2]       Park, J., Bakhtiyar, S., Park, J.. 2021. ScheduleNet: Learn to solve multi-agent scheduling problems with reinforcement learning. arXiv:2106.03051v1.

[3]       Pirnay, J., Göttl, Q., Burger, J., Grimm, D.G.. 2023. Policy-Based Self-Competition for Planning Problems. ICLR.

[4]       Danihelka, I., Guez, A., Schrittwieser, J., Silver, D.. 2022. Policy improvement by planning with Gumbel. ICLR.