CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios

CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios [30.2]
大規模な言語モデルが外部ツールを利用する能力により、ますます多様なタスクに対処できるようになった。タスクがより複雑で長期的になると、複雑なツール利用プロセスが様々な予期せぬエラーを引き起こす可能性がある。このようなエラーの特定、診断、回復など、効果的に対処する方法が、ツール学習を進める上で重要な研究方向として現れている。
論文参考訳（メタデータ） (Wed, 11 Jun 2025 17:59:18 GMT)
「ICTOOL, the first self-critique evaluation benchmark for tool utilization of LLMs. Distinct from prior result-oriented evaluation methods, we categorize error patterns more finely and evaluate models from multiple perspectives, enabling a deeper exploration of LLMs’ tool-use capabilities in errorprone scenarios.」というベンチマーク。最新モデルでの結果が気になるところ。
リポジトリはGitHub – Shellorley0513/CriticTool

コメントを残す

コメントを残す コメントをキャンセル