The key to efficient learning is not the verifiability of the final result, it's getting feedback at every step. E.g., when you drive a car, every moment you observe the difference between your predictions and what really happened. In domains like games, coding and math, your actions have deterministic results and no such learning occurs (or is needed). RL has overfit to those.
Loading thread detail
Fetching the original tweets from X for a clean reading view.
Hang tight—this usually only takes a few seconds.