EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies [61.3]
対話型経済における継続的計画・実行意思決定のためのベンチマークであるEcoGymを紹介する。 EcoGymは、透明性のある長期的なエージェント評価のためのオープンなテストベッドとしてリリースされ、現実的な経済環境下でのコントロール可能性とユーティリティのトレードオフを研究するためのものだ。
論文参考訳（メタデータ） (Wed, 11 Feb 2026 08:59:16 GMT)
「EcoGym, a generalizable benchmark for continuous plan-and-execute decision making in interactive economies.」というベンチマーク。「Experiments across eleven leading LLMs expose a systematic tension: no single model dominates across all three scenarios. Critically, we find that models exhibit significant suboptimality in either high-level strategies or efficient actions executions.」というのは興味深く得意・不得意があるよう（安定性が良くないという指摘もある）
リポジトリはGitHub – OPPO-PersonalAI/EcoGym: Official Repo for “EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies”

コメントを残す

コメントを残す コメントをキャンセル