While reinforcement learning (RL) has been applied to a wide range of challenging domains, from game playing to real-world applications such as effective canal control, more widespread deployment in the real world is hampered by the lack of guarantees provided with the learned policies. Although there are RL algorithms which have limit-convergence guarantees in the discrete setting (and even in some continuous settings with function approximation), these are lost when applying more advanced techniques which make use of general nonlinear function approximators to deal with continuous Markov decision processes (MDPs) such as deep-RL. In this work, we apply such advanced RL algorithms to unknown continuous MDPs with (safety constrained) reachability or discounted-reward objectives, and we consider the challenge of simplifying and verifying RL policies. Our goal is to enable model checking by learning an accurate, tractable model of the environment. Extended abstract here.