LogoThread Easy
  • 탐색
  • 스레드 작성
LogoThread Easy

트위터 스레드의 올인원 파트너

© 2025 Thread Easy All Rights Reserved.

탐색

Newest first — browse tweet threads

Keep on to blur preview images; turn off to show them clearly

This problem fits in a broader context of understanding THE SHAPE OF LEARNING CURVES. The most basic property of such shapes is that hopefully ... they are decreasing! Specifically from the statistical perspective, assume that you add more data, can you prove that your test loss will be lower? 

Surprisingly this is quite non-obvious and there are many counterexamples. This was discussed at length in the classic book [Devroye, Gyorfi, Lugosi, 1996] (which I remember reading voraciously 20 years ago but that's a different story!). More recently in a 2019 COLT Open Problem it was pointed out that some extremely basic versions of this question are still open, such as: if you estimate the (co)variance of an unknown Gaussian, is the risk monotone (ie adding more data helps you estimate this covariance better)??

@MarkSellke asked this question to GPT-5.2 and ... it solved it! And then Mark engaged in a back and forth with the model to keep generalizing the result (with no mathematical input from Mark except asking good questions) and it kept going ... eventually this became a nice paper, with results for both Gaussian and Gamma distributions for forward KL, and more general exponential families for reverse KL. You can read more about it here: https://t.co/XLETMtURcd .

This problem fits in a broader context of understanding THE SHAPE OF LEARNING CURVES. The most basic property of such shapes is that hopefully ... they are decreasing! Specifically from the statistical perspective, assume that you add more data, can you prove that your test loss will be lower? Surprisingly this is quite non-obvious and there are many counterexamples. This was discussed at length in the classic book [Devroye, Gyorfi, Lugosi, 1996] (which I remember reading voraciously 20 years ago but that's a different story!). More recently in a 2019 COLT Open Problem it was pointed out that some extremely basic versions of this question are still open, such as: if you estimate the (co)variance of an unknown Gaussian, is the risk monotone (ie adding more data helps you estimate this covariance better)?? @MarkSellke asked this question to GPT-5.2 and ... it solved it! And then Mark engaged in a back and forth with the model to keep generalizing the result (with no mathematical input from Mark except asking good questions) and it kept going ... eventually this became a nice paper, with results for both Gaussian and Gamma distributions for forward KL, and more general exponential families for reverse KL. You can read more about it here: https://t.co/XLETMtURcd .

I work on AI at OpenAI. Former VP AI and Distinguished Scientist at Microsoft.

avatar for Sebastien Bubeck
Sebastien Bubeck
Thu Dec 11 18:34:25
If you recall, this was the first problem I used to show GPT-5's research capabilities, and the goal was to determine the step-size condition under which gradient descent for smooth convex optimization admits a learning curve which itself is convex! There was a nice paper showing that eta < 1/L is sufficient and eta < 1.75/L is necessary, and a v2 of that paper closed the gap showing 1.75/L is the right "if and only if" condition. 

Back in August (4 months ago!), given the v1 of the paper in context, GPT-5 was able to improve the sufficient condition from 1/L to 1.5/L (so short of the optimal 1.75/L).

Now GPT-5.2, given NOTHING, derives BOTH the necessary and sufficient condition of 1.75/L! To derive the necessary part it uses code to search over counterexamples ...

(and ofc the corresponding paper is still beyond the knowledge cutoff of 5.2)

If you recall, this was the first problem I used to show GPT-5's research capabilities, and the goal was to determine the step-size condition under which gradient descent for smooth convex optimization admits a learning curve which itself is convex! There was a nice paper showing that eta < 1/L is sufficient and eta < 1.75/L is necessary, and a v2 of that paper closed the gap showing 1.75/L is the right "if and only if" condition. Back in August (4 months ago!), given the v1 of the paper in context, GPT-5 was able to improve the sufficient condition from 1/L to 1.5/L (so short of the optimal 1.75/L). Now GPT-5.2, given NOTHING, derives BOTH the necessary and sufficient condition of 1.75/L! To derive the necessary part it uses code to search over counterexamples ... (and ofc the corresponding paper is still beyond the knowledge cutoff of 5.2)

This problem fits in a broader context of understanding THE SHAPE OF LEARNING CURVES. The most basic property of such shapes is that hopefully ... they are decreasing! Specifically from the statistical perspective, assume that you add more data, can you prove that your test loss will be lower? Surprisingly this is quite non-obvious and there are many counterexamples. This was discussed at length in the classic book [Devroye, Gyorfi, Lugosi, 1996] (which I remember reading voraciously 20 years ago but that's a different story!). More recently in a 2019 COLT Open Problem it was pointed out that some extremely basic versions of this question are still open, such as: if you estimate the (co)variance of an unknown Gaussian, is the risk monotone (ie adding more data helps you estimate this covariance better)?? @MarkSellke asked this question to GPT-5.2 and ... it solved it! And then Mark engaged in a back and forth with the model to keep generalizing the result (with no mathematical input from Mark except asking good questions) and it kept going ... eventually this became a nice paper, with results for both Gaussian and Gamma distributions for forward KL, and more general exponential families for reverse KL. You can read more about it here: https://t.co/XLETMtURcd .

avatar for Sebastien Bubeck
Sebastien Bubeck
Thu Dec 11 18:34:24
GPT 5.2 is our best model for science yet: 92.4% GPQA, 40% Frontier Math, 52.9% ARC-AGI-2, 89% CharXiv (w. tools), HLE 45% (w. tools) ... 

Moreover at research level the model has become a lot more reliable. It now one-shots the convex optimization problem to its optimal value!

GPT 5.2 is our best model for science yet: 92.4% GPQA, 40% Frontier Math, 52.9% ARC-AGI-2, 89% CharXiv (w. tools), HLE 45% (w. tools) ... Moreover at research level the model has become a lot more reliable. It now one-shots the convex optimization problem to its optimal value!

If you recall, this was the first problem I used to show GPT-5's research capabilities, and the goal was to determine the step-size condition under which gradient descent for smooth convex optimization admits a learning curve which itself is convex! There was a nice paper showing that eta < 1/L is sufficient and eta < 1.75/L is necessary, and a v2 of that paper closed the gap showing 1.75/L is the right "if and only if" condition. Back in August (4 months ago!), given the v1 of the paper in context, GPT-5 was able to improve the sufficient condition from 1/L to 1.5/L (so short of the optimal 1.75/L). Now GPT-5.2, given NOTHING, derives BOTH the necessary and sufficient condition of 1.75/L! To derive the necessary part it uses code to search over counterexamples ... (and ofc the corresponding paper is still beyond the knowledge cutoff of 5.2)

avatar for Sebastien Bubeck
Sebastien Bubeck
Thu Dec 11 18:34:23
arc-agi-1 is not the reference that it used to be, especially after contamination.

arc-agi-1 is not the reference that it used to be, especially after contamination.

making models learn • eXperiments lab

avatar for tokenbender
tokenbender
Thu Dec 11 18:31:57
it’s ridiculous that evals are still improving so fast this late into the AI era. top models are only keeping SOTA for months, even weeks, still

it’s ridiculous that evals are still improving so fast this late into the AI era. top models are only keeping SOTA for months, even weeks, still

dei ex machina @openai, past: posttraining o3/4o, sora 1 & 2, applied research

avatar for will depue
will depue
Thu Dec 11 18:31:50
RT @mikeknoop: On an energy basis, my best estimate is human efficiency for solving simple ARC v1 tasks is 1,000,000X higher than last Dece…

RT @mikeknoop: On an energy basis, my best estimate is human efficiency for solving simple ARC v1 tasks is 1,000,000X higher than last Dece…

Co-founder @ndea. Co-founder @arcprize. Creator of Keras and ARC-AGI. Author of 'Deep Learning with Python'.

avatar for François Chollet
François Chollet
Thu Dec 11 18:31:44
  • Previous
  • 1
  • More pages
  • 975
  • 976
  • 977
  • More pages
  • 5634
  • Next