开启时会模糊预览图,关闭后正常显示
![This problem fits in a broader context of understanding THE SHAPE OF LEARNING CURVES. The most basic property of such shapes is that hopefully ... they are decreasing! Specifically from the statistical perspective, assume that you add more data, can you prove that your test loss will be lower?
Surprisingly this is quite non-obvious and there are many counterexamples. This was discussed at length in the classic book [Devroye, Gyorfi, Lugosi, 1996] (which I remember reading voraciously 20 years ago but that's a different story!). More recently in a 2019 COLT Open Problem it was pointed out that some extremely basic versions of this question are still open, such as: if you estimate the (co)variance of an unknown Gaussian, is the risk monotone (ie adding more data helps you estimate this covariance better)??
@MarkSellke asked this question to GPT-5.2 and ... it solved it! And then Mark engaged in a back and forth with the model to keep generalizing the result (with no mathematical input from Mark except asking good questions) and it kept going ... eventually this became a nice paper, with results for both Gaussian and Gamma distributions for forward KL, and more general exponential families for reverse KL. You can read more about it here: https://t.co/XLETMtURcd . This problem fits in a broader context of understanding THE SHAPE OF LEARNING CURVES. The most basic property of such shapes is that hopefully ... they are decreasing! Specifically from the statistical perspective, assume that you add more data, can you prove that your test loss will be lower?
Surprisingly this is quite non-obvious and there are many counterexamples. This was discussed at length in the classic book [Devroye, Gyorfi, Lugosi, 1996] (which I remember reading voraciously 20 years ago but that's a different story!). More recently in a 2019 COLT Open Problem it was pointed out that some extremely basic versions of this question are still open, such as: if you estimate the (co)variance of an unknown Gaussian, is the risk monotone (ie adding more data helps you estimate this covariance better)??
@MarkSellke asked this question to GPT-5.2 and ... it solved it! And then Mark engaged in a back and forth with the model to keep generalizing the result (with no mathematical input from Mark except asking good questions) and it kept going ... eventually this became a nice paper, with results for both Gaussian and Gamma distributions for forward KL, and more general exponential families for reverse KL. You can read more about it here: https://t.co/XLETMtURcd .](/_next/image?url=https%3A%2F%2Fpbs.twimg.com%2Fprofile_images%2F1709609135321583616%2F6bXuF85D_400x400.jpg&w=3840&q=75)
I work on AI at OpenAI. Former VP AI and Distinguished Scientist at Microsoft.


This problem fits in a broader context of understanding THE SHAPE OF LEARNING CURVES. The most basic property of such shapes is that hopefully ... they are decreasing! Specifically from the statistical perspective, assume that you add more data, can you prove that your test loss will be lower? Surprisingly this is quite non-obvious and there are many counterexamples. This was discussed at length in the classic book [Devroye, Gyorfi, Lugosi, 1996] (which I remember reading voraciously 20 years ago but that's a different story!). More recently in a 2019 COLT Open Problem it was pointed out that some extremely basic versions of this question are still open, such as: if you estimate the (co)variance of an unknown Gaussian, is the risk monotone (ie adding more data helps you estimate this covariance better)?? @MarkSellke asked this question to GPT-5.2 and ... it solved it! And then Mark engaged in a back and forth with the model to keep generalizing the result (with no mathematical input from Mark except asking good questions) and it kept going ... eventually this became a nice paper, with results for both Gaussian and Gamma distributions for forward KL, and more general exponential families for reverse KL. You can read more about it here: https://t.co/XLETMtURcd .


If you recall, this was the first problem I used to show GPT-5's research capabilities, and the goal was to determine the step-size condition under which gradient descent for smooth convex optimization admits a learning curve which itself is convex! There was a nice paper showing that eta < 1/L is sufficient and eta < 1.75/L is necessary, and a v2 of that paper closed the gap showing 1.75/L is the right "if and only if" condition. Back in August (4 months ago!), given the v1 of the paper in context, GPT-5 was able to improve the sufficient condition from 1/L to 1.5/L (so short of the optimal 1.75/L). Now GPT-5.2, given NOTHING, derives BOTH the necessary and sufficient condition of 1.75/L! To derive the necessary part it uses code to search over counterexamples ... (and ofc the corresponding paper is still beyond the knowledge cutoff of 5.2)


making models learn • eXperiments lab


dei ex machina @openai, past: posttraining o3/4o, sora 1 & 2, applied research


Co-founder @ndea. Co-founder @arcprize. Creator of Keras and ARC-AGI. Author of 'Deep Learning with Python'.
