Out of the box, no game-specific RL or tuning? No way. I'd be genuinely shocked. As an interaction problem, this is way, way harder than driving or humanoid control. And most of the game LLM results so far have been bullshit. Voyager for Minecraft calls actions like "go mine a coal" and tons of public example scripts. This would be a standalone from-scratch RL task. Chess just so happens to record games in the exact text format that you'd want for LLMs. And there are lots of them. When you don't have that but have access to a sim? Small model RL crushes. We have several examples of super-human play trained in seconds on a single GPU on https://t.co/wPfmdJfe1d. It's not just games either. Most fancy sims we build for clients end up being easier to RL than even relatively simple games. In my mind, the best result of our field by far was OpenAI five. Beat top pros at DoTA with ~1000 GPUs. You could probably do it with 64-256 H100s now. The CPUs are a real killer, but hey, that's why we build fast custom sims for problems we really care about. We continually see RL come up with solutions that I don't see how in hell an LLM will ever just zero-shot. Interaction is fundamental to intelligence. If you RL finetune an LLM by playing the game? Sure, and it will be more sample efficient than training from scratch. But vastly compute inefficient. We've got pretty good evidence that scaling laws in RL tend towards way smaller model sizes and way more data. This is the bet I've made in my own research, and so far so good. So what if you really wanted to use Grok for an impressive RL result? My bet is on bridging the gap between giant and tiny models. Take almost all (>>99.9%) of actions during training with the small model. Play lots of games. Use the large model to guide exploration etc. There are some results on this already in games and robotics, but nothing really satisfactory yet. I'm not working on it because imo the small-model RL side has way more potential right now w/ a very clear path forward at even small scale.
Loading thread detail
Fetching the original tweets from X for a clean reading view.
Hang tight—this usually only takes a few seconds.