マルチリンガルなLLM評価フレームワークの提案、GPT-4はやはり優秀。だが「Finally, we analyze human and LLM reasoning and observe that LLMs often provide incorrect justifications for their scores, thus showing that more research is needed to be able to use LLM-based evaluators with confidence in the multilingual setting.」・・・。わりとよく言われていることではある・・・。