The idea behind significantly improving the performance on hard real-world tasks is to train a value function, condition the model on advantages computed from the value function, and running an iterative improvement loop where the model learns from it’s own data.
Research Scientist @physical_int.
Formerly Google DeepMind