探索 | Thread Easy - 展开 Twitter 线程｜阅读、总结与创作

This problem fits in a broader context of understanding THE SHAPE OF LEARNING CURVES. The most basic property of such shapes is that hopefully ... they are decreasing! Specifically from the statistical perspective, assume that you add more data, can you prove that your test loss will be lower? Surprisingly this is quite non-obvious and there are many counterexamples. This was discussed at length in the classic book [Devroye, Gyorfi, Lugosi, 1996] (which I remember reading voraciously 20 years ago but that's a different story!). More recently in a 2019 COLT Open Problem it was pointed out that some extremely basic versions of this question are still open, such as: if you estimate the (co)variance of an unknown Gaussian, is the risk monotone (ie adding more data helps you estimate this covariance better)?? @MarkSellke asked this question to GPT-5.2 and ... it solved it! And then Mark engaged in a back and forth with the model to keep generalizing the result (with no mathematical input from Mark except asking good questions) and it kept going ... eventually this became a nice paper, with results for both Gaussian and Gamma distributions for forward KL, and more general exponential families for reverse KL. You can read more about it here: https://t.co/XLETMtURcd .

I work on AI at OpenAI. Former VP AI and Distinguished Scientist at Microsoft.

Sebastien Bubeck

Thu Dec 11 18:34:25

If you recall, this was the first problem I used to show GPT-5's research capabilities, and the goal was to determine the step-size condition under which gradient descent for smooth convex optimization admits a learning curve which itself is convex! There was a nice paper showing that eta < 1/L is sufficient and eta < 1.75/L is necessary, and a v2 of that paper closed the gap showing 1.75/L is the right "if and only if" condition. Back in August (4 months ago!), given the v1 of the paper in context, GPT-5 was able to improve the sufficient condition from 1/L to 1.5/L (so short of the optimal 1.75/L). Now GPT-5.2, given NOTHING, derives BOTH the necessary and sufficient condition of 1.75/L! To derive the necessary part it uses code to search over counterexamples ... (and ofc the corresponding paper is still beyond the knowledge cutoff of 5.2)

This problem fits in a broader context of understanding THE SHAPE OF LEARNING CURVES. The most basic property of such shapes is that hopefully ... they are decreasing! Specifically from the statistical perspective, assume that you add more data, can you prove that your test loss will be lower? Surprisingly this is quite non-obvious and there are many counterexamples. This was discussed at length in the classic book [Devroye, Gyorfi, Lugosi, 1996] (which I remember reading voraciously 20 years ago but that's a different story!). More recently in a 2019 COLT Open Problem it was pointed out that some extremely basic versions of this question are still open, such as: if you estimate the (co)variance of an unknown Gaussian, is the risk monotone (ie adding more data helps you estimate this covariance better)?? @MarkSellke asked this question to GPT-5.2 and ... it solved it! And then Mark engaged in a back and forth with the model to keep generalizing the result (with no mathematical input from Mark except asking good questions) and it kept going ... eventually this became a nice paper, with results for both Gaussian and Gamma distributions for forward KL, and more general exponential families for reverse KL. You can read more about it here: https://t.co/XLETMtURcd .

Sebastien Bubeck

Thu Dec 11 18:34:24

GPT 5.2 is our best model for science yet: 92.4% GPQA, 40% Frontier Math, 52.9% ARC-AGI-2, 89% CharXiv (w. tools), HLE 45% (w. tools) ... Moreover at research level the model has become a lot more reliable. It now one-shots the convex optimization problem to its optimal value!

If you recall, this was the first problem I used to show GPT-5's research capabilities, and the goal was to determine the step-size condition under which gradient descent for smooth convex optimization admits a learning curve which itself is convex! There was a nice paper showing that eta < 1/L is sufficient and eta < 1.75/L is necessary, and a v2 of that paper closed the gap showing 1.75/L is the right "if and only if" condition. Back in August (4 months ago!), given the v1 of the paper in context, GPT-5 was able to improve the sufficient condition from 1/L to 1.5/L (so short of the optimal 1.75/L). Now GPT-5.2, given NOTHING, derives BOTH the necessary and sufficient condition of 1.75/L! To derive the necessary part it uses code to search over counterexamples ... (and ofc the corresponding paper is still beyond the knowledge cutoff of 5.2)

Sebastien Bubeck

Thu Dec 11 18:34:23

arc-agi-1 is not the reference that it used to be, especially after contamination.

making models learn • eXperiments lab

tokenbender

Thu Dec 11 18:31:57

it’s ridiculous that evals are still improving so fast this late into the AI era. top models are only keeping SOTA for months, even weeks, still

dei ex machina @openai, past: posttraining o3/4o, sora 1 & 2, applied research

will depue

Thu Dec 11 18:31:50

RT @mikeknoop: On an energy basis, my best estimate is human efficiency for solving simple ARC v1 tasks is 1,000,000X higher than last Dece…

Co-founder @ndea. Co-founder @arcprize. Creator of Keras and ARC-AGI. Author of 'Deep Learning with Python'.

François Chollet

Thu Dec 11 18:31:44

探索

最新在前，按卡片方式浏览线程

探索

最新在前，按卡片方式浏览线程

GPT 5.2 is our best model for science yet: 92.4% GPQA, 40% Frontier Math, 52.9% ARC-AGI-2, 89% CharXiv (w. tools), HLE 45% (w. tools) ... Moreover at research level the model has become a lot more reliable. It now one-shots the convex optimization problem to its optimal value!

arc-agi-1 is not the reference that it used to be, especially after contamination.

it’s ridiculous that evals are still improving so fast this late into the AI era. top models are only keeping SOTA for months, even weeks, still

RT @mikeknoop: On an energy basis, my best estimate is human efficiency for solving simple ARC v1 tasks is 1,000,000X higher than last Dece…