TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks [52.5]
私たちは小さなソフトウェア企業環境を模倣したデータによる自己完結型環境を構築します。最も競争力のあるエージェントでは、タスクの24%が自律的に完了できます。これは、LMエージェントによるタスク自動化に関するニュアンスな絵を描く。
論文参考訳（メタデータ） (Wed, 18 Dec 2024 18:55:40 GMT)
「TheAgentCompany measures the progress of these LLM agents’ performance on performing real-world professional tasks, by providing an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers.」というベンチマーク。現状、Claude 3.5 Sonnetの性能が高い結果になっているが、o1やo3での結果が気になるところ。
プロジェクトサイトはTheAgentCompany、リーダーボードはTheAgentCompany

コメントを残す

コメントを残す コメントをキャンセル