Explore

OfficeQA stands in contrast to "superintelligence" benchmarks that test esoteric or abstract knowledge but do not necessarily translate into better performance on real work. One way to view it is "can ASI make it through one day at the office?"

OfficeQA is neat because we believe any new grad can do the tasks reliably, but it highlights the challenges enterprises have with AI. Elaborate agents with our latest document AI tools do a bit better, but there is still plenty of headroom. We hope researchers find this useful!

Matei Zaharia

Tue Dec 09 22:36:26

OfficeQA is neat because we believe any new grad can do the tasks reliably, but it highlights the challenges enterprises have with AI. Elaborate agents with our latest document AI tools do a bit better, but there is still plenty of headroom. We hope researchers find this useful!

CTO at @Databricks and CS prof at @UCBerkeley. Working on data+AI, including @ApacheSpark, @DeltaLakeOSS, @MLflow, @DSPyOSS. https://t.co/nmRYAKFsWr

Matei Zaharia

Tue Dec 09 22:36:26

LLMs are claimed to reach PhD intelligence, but still fail mundane tasks. To understand this challenge, Databricks launched OfficeQA, a benchmark of useful tasks that require reliability&diligence, not specialized knowledge. We're also doing a competition! https://t.co/W8PFESKXAF

OfficeQA stands in contrast to "superintelligence" benchmarks that test esoteric or abstract knowledge but do not necessarily translate into better performance on real work. One way to view it is "can ASI make it through one day at the office?"

Matei Zaharia

Tue Dec 09 22:36:25

RT @bemikelive: We released OfficeQA today -- a hard benchmark for evaluating agents on grounded reasoning tasks. More details in our blog…

CTO at @Databricks and CS prof at @UCBerkeley. Working on data+AI, including @ApacheSpark, @DeltaLakeOSS, @MLflow, @DSPyOSS. https://t.co/nmRYAKFsWr

Matei Zaharia

Tue Dec 09 22:13:28

RT @MLflow: MLflow 3.7.0 is here and it brings major features and improvements for GenAI Observability, Evaluation, and Prompt Management!…

CTO at @Databricks and CS prof at @UCBerkeley. Working on data+AI, including @ApacheSpark, @DeltaLakeOSS, @MLflow, @DSPyOSS. https://t.co/nmRYAKFsWr

Matei Zaharia

Mon Dec 08 19:01:29

RT @andykonwinski: The open frontier only moves forward if we work together. We’re bringing the leaders of open research in AI into one roo…

CTO at @Databricks and CS prof at @UCBerkeley. Working on data+AI, including @ApacheSpark, @DeltaLakeOSS, @MLflow, @DSPyOSS. https://t.co/nmRYAKFsWr

Matei Zaharia

Sun Dec 07 02:07:43

Newest first — browse tweet threads

Explore

Newest first — browse tweet threads

OfficeQA stands in contrast to "superintelligence" benchmarks that test esoteric or abstract knowledge but do not necessarily translate into better performance on real work. One way to view it is "can ASI make it through one day at the office?"

OfficeQA is neat because we believe any new grad can do the tasks reliably, but it highlights the challenges enterprises have with AI. Elaborate agents with our latest document AI tools do a bit better, but there is still plenty of headroom. We hope researchers find this useful!

LLMs are claimed to reach PhD intelligence, but still fail mundane tasks. To understand this challenge, Databricks launched OfficeQA, a benchmark of useful tasks that require reliability&diligence, not specialized knowledge. We're also doing a competition! https://t.co/W8PFESKXAF

RT @bemikelive: We released OfficeQA today -- a hard benchmark for evaluating agents on grounded reasoning tasks. More details in our blog…

RT @MLflow: MLflow 3.7.0 is here and it brings major features and improvements for GenAI Observability, Evaluation, and Prompt Management!…

RT @andykonwinski: The open frontier only moves forward if we work together. We’re bringing the leaders of open research in AI into one roo…