METAL: Towards Multilingual Meta-Evaluation – arXiv最新論文の紹介

METAL: Towards Multilingual Meta-Evaluation [12.9]
本研究では,多言語シナリオにおいて,Large Language Models (LLMs) を評価対象としてエンド・ツー・エンド評価を行うためのフレームワークを提案する。要約作業のための母国語話者判定を含む10言語を対象としたデータセットを作成する。 GPT-3.5-Turbo, GPT-4, PaLM2を用いたLCM評価器の性能の比較を行った。
論文参考訳（メタデータ） (Tue, 02 Apr 2024 06:14:54 GMT)
マルチリンガルなLLM評価フレームワークの提案、GPT-4はやはり優秀。だが「Finally, we analyze human and LLM reasoning and observe that LLMs often provide incorrect justifications for their scores, thus showing that more research is needed to be able to use LLM-based evaluators with confidence in the multilingual setting.」・・・。わりとよく言われていることではある・・・。
リポジトリはhadarishav/METAL: Code and data repo for NAACL’24 findings paper “METAL: Towards Multilingual Meta Evaluation” (github.com)

コメントを残す

コメントを残す コメントをキャンセル