because of things like this, evals should all have: - cost per Task - time per task alongside overall perf and token metrics. users care about their experience so we should measure that experience (even diff hardware). if a Task is done cheaper/faster, ppl wanna see that
Loading thread detail
Fetching the original tweets from X for a clean reading view.
Hang tight—this usually only takes a few seconds.