Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions

Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions [77.8]
LLM エージェントによる評価プロセス全体を自動化した LLM の自動アリーナを提案する。最新のLLM17実験において,オートアリーナは人間の嗜好と最も高い相関関係を示した。
論文参考訳（メタデータ） (Thu, 30 May 2024 17:19:19 GMT)
LLMの評価手法の提案、「By using LLM agents to generate questions, employing LLM candidates in peer battles, and evaluating responses using LLM committee discussions, Auto-Arena produces less-contaminated, robust, and trustworthy evaluation results.」というエージェント的手法。自動評価ができるということは自動改善もできそうな気がするが、合議制で良いデータを作りfine tuningをしていくとどのくらいまで性能が上がるんだろうか。
プロジェクトサイト・リーダーボードはEmbedded Streamlit App (auto-arena.github.io)、英語と中国語でランキングがかなり異なるのが面白い。

コメントを残す

コメントを残す コメントをキャンセル