宝玉 (@dotey): Andrej Karpathy 是 OpenAI 联合创始人、前特斯拉 AI 总监，也是全球最有影响力的 AI 研究者之一。他刚刚发布了一…

Andrej Karpathy, co-founder of OpenAI, former AI director at Tesla, and one of the world's most influential AI researchers, has just released a 2025 LLM year-end review. The first major change: a paradigm shift in training methods. Before 2025, training a usable large model basically involved three steps: pre-training, supervised fine-tuning, and reinforcement learning with human feedback. This formula has been used since 2020 and remains stable and reliable. In 2025, a crucial fourth step was added: RLVR, which stands for Reinforcement Learning from Verifiable Rewards. What does this mean? Simply put, it means letting the model practice repeatedly in an environment with "standard answers." For example, in math problems, the answer is either right or wrong; there's no need for human scoring. The same applies to code; if it runs, it runs. What's the fundamental difference between this and previous training? Previous supervised fine-tuning and human feedback were essentially "copying the model," where the model learned whatever samples the human provided. But RLVR is different; it allows the model to discover its own problem-solving strategies. It's like learning to swim: previously, you watched instructional videos and imitated the movements; now, you're simply thrown into the water—as long as you can swim to the other side, how you paddle doesn't matter. The result? The model "figured out" something that looked like reasoning on its own. It learned to break down big problems into smaller steps and to turn back and start over when it went astray. These strategies cannot be demonstrated by humans, because even humans cannot clearly explain what the "correct thought process" looks like. This change has triggered a chain reaction: the way computing power is allocated has changed. Previously, most computing power was spent on the pre-training stage, but now more and more computing power is used in the RL stage. The model's parameter size hasn't increased much, but its inference ability has skyrocketed. OpenAI's o1 was the starting point of this path, and o3 was the inflection point that truly made people "feel the difference". There's another new approach: it can also expend more computing power during inference. Making the model "think longer" generates longer inference chains, resulting in better performance. This is essentially adding an adjustment knob to control its capabilities. The second major change: We finally understand what "shape" AI's intelligence is. Karpathy used a brilliant analogy: we're not "raising animals," we're "summoning ghosts." Human intelligence evolves, and its optimization goal is "to help the tribe survive in the jungle." The intelligence of large models is trained, and its optimization goal is "to mimic human text, score points in math problems, and rack up scores on benchmark lists." The optimization goals are completely different, so the results will naturally be completely different as well. Therefore, AI's intelligence is "jagged intelligence." It can behave like an omniscient scholar in some fields, while making mistakes that even a primary school student wouldn't make in others. One second it's helping you derive complex formulas, and the next second it's being tricked into giving you data by a simple jailbreak hint. Why is this? Because in areas with "verifiable rewards," models will develop "spikes" in those areas. Mathematics has standard answers, and code can be tested, so progress in these areas is rapid. But in areas like common sense, social interaction, and creativity, what is "right" is difficult to define, making it harder for models to learn efficiently. This also caused Karpathy to lose faith in benchmarks. The reason is simple: the test questions themselves are "verifiable environments," and the model can be optimized for these environments. Dominating benchmarks became an art. It's entirely possible to max out all the benchmarks but still fall far short of true general intelligence. The third major change: LLM application layer emerges. Cursor has become incredibly popular this year, but Karpathy believes its greatest significance lies not in the product itself, but in proving the existence of a new species: "LLM applications". The emergence of discussions about "cursors in the X domain" indicates the formation of a new software paradigm. What will these applications do? First, perform context engineering. Organize the relevant information and feed it into the model. Second, orchestrate multiple model calls. The backend may be handling a bunch of API calls; balance performance and cost. Third, provide interfaces for specialized scenarios, allowing humans to intervene at key points. Fourth, give users a "degree of autonomy slider." You can make it do more or less. One question has been discussed for a whole year: How "thick" is this application layer? Will model vendors eat up all the applications? Karpathy's assessment is that model manufacturers train "university graduates with general skills," but LLM applications are responsible for organizing, training, and putting these graduates into jobs, turning them into professional teams capable of working in specific industries. Data, sensors, actuators, feedback loops—these are all application-layer tasks. The fourth major change: AI has moved into your computer. Claude Code is one of the products that impressed Karpathy the most this year. It demonstrates what an "AI agent" should look like: capable of calling tools, performing inference, executing loops, and solving complex problems. But more importantly, it runs on your computer. It uses your environment, your data, and your context. Karpathy believes OpenAI misjudged the situation here. They focused Codex and the agents on cloud containers, scheduling them from ChatGPT. This seems like they're aiming for the "AGI endgame," but we're not there yet. The reality is that AI capabilities vary widely, and humans still need to supervise and assist them. Placing intelligent agents locally, working alongside developers, is the more sensible approach at present. Claude Code achieves this with a minimalist command-line interface. AI is no longer just a website you visit, but a little sprite "living" in your computer. This is a completely new paradigm of human-computer interaction. The fifth major change: Vibe Coding has taken off. In 2025, AI's capabilities crossed a threshold: you could describe your needs purely in English and have it write the program for you, without having to worry about the code's appearance. Karpathy casually tweeted about this programming style, calling it "vibe coding," and the term went viral. What does this mean? Programming is no longer the exclusive domain of professional programmers; ordinary people can do it too. This is unlike any previous technology diffusion model. In the past, new technologies were always first mastered by large companies, governments, and professionals before gradually spreading to other sectors. But the model is reversed, with ordinary people benefiting far more than professionals. It's not just about "enabling non-programmers to program." For those who can program, many small programs that were previously "not worth writing" are now worth writing. Karpathy himself has done a lot of projects with Vibe Coding: he wrote a custom tokenizer in Rust, made several utility apps, and even wrote a one-off program just to find a bug. Code suddenly becomes cheap, disposable, and can be written as casually as on scrap paper. This will completely change the form of software and the job of programmers. The sixth major change: The "graphical interface era" for large-scale models is coming. Google's Gemini Nano Banana is one of the most underrated products this year. It can generate images, infographics, and animations in real time based on conversation content, "drawing" instead of "writing" replies. Karpathy places this within a larger historical context: large models represent the next major computing paradigm, much like computers in the 1970s and 80s. Therefore, we will see a similar evolutionary path. "Chatting" with large models now is a bit like typing commands on a terminal in the 1980s. Text is a format that machines prefer, but not that humans. Humans don't actually like reading text; it's slow and tiring. People prefer looking at pictures, videos, and spatial layouts. This is why traditional computers invented graphical interfaces. Large models also need their own "GUI". It should speak to us in ways we like: images, slides, whiteboards, animations, mini-apps. Current emojis and Markdown are just rudimentary forms, merely "dressing up" text. What will a true LLM GUI look like? Nano Banana is an early hint. What's most interesting is that this isn't just about image generation. It requires intertwining text generation, image generation, and world knowledge, integrating them all into the model weights. Karpathy's conclusion is this: the 2025 grand model is both smarter and dumber than he expected. Both are true simultaneously. But one thing is certain: even with our current capabilities, we haven't even tapped 10% of our potential. There are still so many ideas to try; the entire field feels open. He said something seemingly contradictory on Dwarkesh's podcast: He believes progress will continue at a rapid pace. > At the same time, I believe there is still a lot of work to be done. These two things are not contradictory. Buckle up and keep accelerating in 2026.

Thread by 宝玉 (@dotey)

Author details

Thread content