Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation? [20.5]
大規模言語モデル(LLM)は、自然言語処理(NLP)タスクにおいて素晴らしいパフォーマンスを示している。現在の評価技術では、適切なベンチマーク、メトリクス、コスト、人間のアノテーションへのアクセスが欠如している。本稿では,LLMに基づく評価器が多言語評価のスケールアップに有効かどうかを検討する。
論文参考訳（メタデータ） (Thu, 14 Sep 2023 06:41:58 GMT)
LLMがNLPの評価器として多言語設定でうまくいくか評価した論文。「We see that the PA between the annotators and GPT is lowest compared to the PA between the human annotators for Japanese and Czech」（PA: Percentage Agreement ）「Our work indicates that LLMbased evaluators need to be used cautiously in the multilingual setting, particularly on languages on which LLMs are known to perform poorly.」とのこと。
GPT-4とかだと英語で有効だった手法が日本語でも動く（ように見える）わけだが、正しく動作しているかどうか検証する必要がある、という当然と言えば当然の結果。

コメントを残す

コメントを残す コメントをキャンセル