RT @bemikelive: We released OfficeQA today -- a hard benchmark for evaluating agents on grounded reasoning tasks. More details in our blog…
Loading thread detail
Fetching the original tweets from X for a clean reading view.
Hang tight—this usually only takes a few seconds.