Safe Policy Improvement with Baseline Bootstrapping
听A common goal in Reinforcement Learning is to derive a good strategy given a limited batch of data. In this paper, we propose a new strategy to compute a safe policy, guaranteed to perform at least as well as a given baseline strategy. We advocate that the assumptions made in previous work are too strong for real world applications and propose new algorithms allowing those assumptions to be satisfied only in a subset of the state-action pairs. While significantly relaxing the assumptions, our algorithms achieve the same accuracy guarantees than the previous work, and are also much more computationally efficient. We also show that the algorithms can be adapted to model-free Reinforcement Learning.听