I still maintain that reflective learning is the future of learning algorithms. This is related to but quite a bit richer than thinking about making value functions that work.
The problem with conventional supervised and reinforcement learning is that the system gets subjected to gradient updates it didn’t curate from experiments it didn’t design. If we’re all about “scaling thinking time”, shouldn’t the agent think much harder about how it learns?