> I am quite curious what the score would have looked like if the model had produced outputs for every sample without exceeding the maximum output token limit. They really need to reduce reasoning verbosity, and/or extend context to 256K+. DSA makes that economical, in theory.
Loading thread detail
Fetching the original tweets from X for a clean reading view.
Hang tight—this usually only takes a few seconds.