Recently, offline (sometimes also called ‘batch’) Reinforcement Learning (RL) algorithms have gained significant research traction . The reason behind this is that – unlike in the classical Reinforcement Learning formulation – they do not require interaction with the environment. Instead, when provided with logged data they are able to generate a better policy than the one used to collect the data. Obviously, this has significant practical applications, e.g., we could learn a better driving policy using only recorded data from real drivers.
Even though offline RL algorithms are appealing, they can be quite unstable during training. The main reason behind this instability is the bootstrapping error . Here, the problem is that the state-action values can be overestimated in regions far from the actually experienced data, which leads to erroneous target updates .
In value-based RL, two ways to mitigate this problem have been proposed. The first approach tries to penalize policies (or actions selected to specific states be more precise) that are not included in the original data (or the specific action has very low probability according to the data) [2,3]. A different approach facilitates an ensemble of models and through a form of dropout ensures that outlier values for the state-action value function are cancelled out.
Even though these approaches are in-line with recent deep RL algorithms, we have to consider that there is a family of classical offline RL methods (e.g. see [5,6]) that have shown remarkable results and stable performance, even though they are based in linear function approximation for the state-action value function. So, the question here is whether we can combine the theory and findings of classical algorithms with modern deep learning and reinforcement learning techniques.
The main goal of this thesis is to analyze the properties of the Least Squares Policy Iteration (LSPI) algorithm  and design a new deep RL offline algorithm based on key properties and insights of LSPI. For example, Radial Basis Function (RBF) approximators can be used to avoid the extrapolation error in the state-action value function, the widely-used Q-learning loss function could be replaced by the LSPI state-action approximator weight update, the benefits of having a multi-head network for providing different weights for each set of actions could be explored, while the added benefits of pre-training with Behavioral Cloning (BC) should be determined.
The new algorithm should be benchmarked against state-of-the-art results [2,3,4] both in classical control environments and the offline Atari Games dataset. . In addition, the possibility of extending the algorithm to continuous action spaces should be explored.
- Levine, S., Kumar, A., Tucker, G., & Fu, J. (2020). Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643.
- Fujimoto, S., Meger, D., & Precup, D. (2019, May). Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning (pp. 2052-2062).
- Kumar, A., Fu, J., Soh, M., Tucker, G., & Levine, S. (2019). Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems (pp. 11784-11794).
- Agarwal, R., Schuurmans, D., & Norouzi, M. (2020). An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning.Golkar, S., Kagan, M., & Cho, K. (2019). Continual learning via neural pruning. arXiv preprint arXiv:1903.04476.
- Lagoudakis, M. G., & Parr, R. (2003). Least-squares policy iteration. Journal of machine learning research, 4(Dec), 1107-1149.
- JNedić, A., & Bertsekas, D. P. (2003). Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems, 13(1-2), 79-110.
- Gulcehre, C., Wang, Z., Novikov, A., Paine, T. L., Colmenarejo, S. G., Zolna, K., … & Dulac-Arnold, G. (2020). Rl unplugged: Benchmarks for offline reinforcement learning. arXiv preprint arXiv:2006.13888.