Interactive Benchmarks – arXiv最新論文の紹介

Interactive Benchmarks [45.7]
予算制約下でのインタラクティブなプロセスにおけるモデルの推論能力を評価する統一評価パラダイムであるInteractive Benchmarksを提案する。このフレームワークを2つの設定でインスタンス化する: 対話的証明(Interactive Proofs) — モデルは判断者と相互作用し、論理と数学の客観的な真実や答えを推論する。
論文参考訳（メタデータ） (Thu, 05 Mar 2026 02:18:26 GMT)
「By actively collecting information, the agent can update its beliefs and make better decisions under uncertainty. To evaluate a model’s ability to reason while actively acquiring information, we draw inspiration from the concept of Interactive Proofs in computational complexity theory (Goldwasser et al , 2019) and propose a unified evaluation paradigm, which we call Interactive Benchmarks.」という行動しながら答えを見出すタイプのベンチマーク。現実的に重要なタスク。（汎用モデルで）このような動作が可能になってきているのも感慨深いものがある。
リポジトリはGitHub – interactivebench/InteractiveBench: Official Project Page for Interactive Benchmarks · GitHub

コメントを残す

コメントを残す コメントをキャンセル