LLM as a judge – arXiv最新論文の紹介

Audio-Aware Large Language Models as Judges for Speaking Styles

Audio-Aware Large Language Models as Judges for Speaking Styles [123.4]
音声認識型大言語モデル(ALLM)を自動判断器として用いて音声の話し方を評価する。 4つの音声言語モデル(SLM)を用いて2つのタスクを完了し、人間とALMを用いてSLMの応答を判断する。以上の結果から,現在のSLM,GPT-4o-audioでさえも,発話スタイルの制御や自然な対話生成に改善の余地があることが示唆された。
論文参考訳（メタデータ） (Fri, 06 Jun 2025 11:05:48 GMT)
「By comparing the evaluation results from human and ALLM judges, we find that ALLMs can be used as automatic judges on these two tasks and achieve agreement with human judges comparable to the agreement within human judges.」とのこと。ALLM＝Audio-aware large language models
認識できる以上、Judgeもできるのはそうだろうと思うが、有用な結果。LLM as a judge関連でマルチリンガル設定の制限が報告されているが、本件でも同様なのかは気になるところ。

LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge

LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge [44.6]
大規模言語モデル(LLM)は、様々なタスクにまたがる顕著な知性を示してきた。これらのシステムは、評価結果を操作できる敵攻撃の影響を受けやすい。 LLMに基づく審査員による既存の評価手法は、しばしば断片的であり、包括的な評価のための統一された枠組みが欠如している。
論文参考訳（メタデータ） (Wed, 11 Jun 2025 06:48:57 GMT)
「This work presents the first scalable and fully automated framework to evaluate the robustness and reliability of LLM-as-a-Judge systems across multiple attack scenarios. We systematically benchmarked state-of-the-art LLM-based evaluators under various adversarial settings and found that they are vulnerable to manipulation, often producing biased or incorrect judgments when exposed to crafted inputs.」とのこと。LLM-as-a-Judgeシステムの堅牢性を体系的に評価するために設計されたRobustJudgeというフレームワークで評価を行っている。
リポジトリはGitHub – S3IC-Lab/RobustJudge

Quantitative LLM Judges

Quantitative LLM Judges [48.7]
本研究では,既存のLLM審査員の評価スコアを,与えられた領域における人間の評価スコアと整合させる定量的LLM判定者を提案する。モデルは、裁判官のテキスト評価とスコアを用いて、原判事のスコアを改善するために訓練される。実験により, 定量的な判断は, ポストホックモデリングにより, 既存の判断の予測力を効果的に向上できることが示された。
論文参考訳（メタデータ） (Tue, 03 Jun 2025 14:44:23 GMT)
「We introduce quantitative judges, a family of LLM judges that disentangle qualitative reasoning from quantitative score prediction in LLM-as-a-judge. Our approach has two stages: the qualitative stage, where a frozen LLM judge generates an evaluation, and the quantitative stage, where these outputs are used by a lightweight model to predict a human score.」というアプローチによる定量評価
現実的な設計方針に思える。

How Reliable is Multilingual LLM-as-a-Judge?

How Reliable is Multilingual LLM-as-a-Judge? [11.6]
25言語を含む5つの多種多様なタスクにおいて、異なるモデルファミリーから5つのモデルを評価する。一貫性は言語によって大きく異なり、低リソース言語では特にパフォーマンスが劣っていることが分かりました。実世界のアプリケーションにおける多言語判断の整合性を改善するアンサンブル戦略を提案する。
論文参考訳（メタデータ） (Sun, 18 May 2025 02:32:35 GMT)

マルチリンガル設定でのLLM as a judgeの性能評価。GPT-4oも苦労している印象の結果。「we find that powerful open-source models, such as Qwen- 2.5, achieve comparable performance to OpenAI models in multilingual judgment tasks.」や「Aya fails to demonstrate noticeable improvements. This suggests that fine- tuning with multilingual data may not directly enhance a model’s ability to perform accurate multi- lingual judgments.」など興味深い記載も多い。

J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning

J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning [69.1]
このようなモデルをトレーニングするための強化学習アプローチであるJ1を紹介する。本手法は,判断バイアスを軽減し,思考にインセンティブを与える検証可能な報酬を用いて,検証可能なプロンプトと検証不可能なプロンプトの両方を判断タスクに変換する。評価基準を概説し、自己生成した基準回答と比較し、モデル応答の正しさを再評価することにより、モデルがより良い判断を下すことが判明した。
論文参考訳（メタデータ） (Thu, 15 May 2025 14:05:15 GMT)
Thinking-LLM-as-a-Judge modelsを構築するための強化学習レシピの提案。
「our approach outperforms all other existing 8B or 70B models when trained at those sizes, including models distilled from DeepSeek-R1. J1 also outperforms o1-mini, and even R1 on some benchmarks, despite training a smaller model.」とのこと。
Assessing Judging Bias in Large Reasoning Models: An Empirical Study – arXiv最新論文の紹介など、LLM as a judgeなタスクでのLRM適用に効果があるという指摘はあったのでそれらと整合的な結果であるように思う。

Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators

Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators [66.8]
本稿では,テスト時間スケーリングベンチマークの判定評価について紹介する。 3つのタスク設定の下で、3つのドメイン(推論、コード生成、命令従)での判定性能を評価する。我々のベンチマークは、審査員が再評価において結果報酬モデルと競合する一方で、ビームサーチにおけるプロセス報酬モデルよりも一貫して悪いことを示している。
論文参考訳（メタデータ） (Mon, 21 Apr 2025 17:33:23 GMT)
「we seek to understand the feasibility of using LLM-judges in place of typically used RMs in testtime compute procedures.」というモチベーションでの「we introduce the Judge Evaluation for Test-Time Scaling (JETTS) benchmark, which evaluates judge performance in three domains (math reasoning, code generation, and instruction following) under three task settings: response reranking, step-level beam search, and critique-based response refinement.」というベンチマークの提案。「We find that weak judges can help strong generators in easier tasks, such as instruction following, but not in reasoning-intensive tasks like coding or math. Larger judges bring the most benefit for math and instruction following tasks, but no evaluated judges are able to reliably improve generator performance for coding. Lastly, while natural language critiques are touted as a defining advantage of judges over RMs, we find that such critiques have significant room for improvement in terms of utility.」となかなか厳しい結果。
リポジトリはGitHub – SalesforceAIResearch/jetts-benchmark: Code repository for the paper “Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators”

Assessing Judging Bias in Large Reasoning Models: An Empirical Study

Assessing Judging Bias in Large Reasoning Models: An Empirical Study [99.9]
DeepSeek-R1やOpenAI-o1のような大きな推論モデル(LRM)は、顕著な推論能力を示している。本稿では、主観的嗜好アライメントデータセットと客観的事実ベースデータセットの両方において、LLMとLRMの偏りを判定するベンチマークを示す。
論文参考訳（メタデータ） (Mon, 14 Apr 2025 07:14:27 GMT)
LRMにおけるJudge時のバイアスに関する検証
基本的にLRMのJudgeに関する性能は高く「Through investigation of bandwagon, authority, position, and distraction biases, we uncover four key findings: (1) despite their advanced reasoning capabilities, LRMs remain susceptible to the above biases; (2) LRMs demonstrate better robustness than LLMs specifically on fact-related datasets; (3) LRMs exhibit notable position bias, preferring options in later positions; and (4) we identify a novel “superficial reflection bias” where phrases mimicking reasoning (e g , “wait, let me think…”) significantly influence model judgments.」とのこと。
「We identify a novel “superficial reflection bias” in LRMs, where phrases mimicking reasoning significantly influence judging outcomes, demonstrating how reasoning mechanisms can introduce new vulnerabilities in automated evaluation.」という点、おそらく学習過程によるものであろうということが興味深い。

LLM-guided Plan and Retrieval: A Strategic Alignment for Interpretable User Satisfaction Estimation in Dialogue

LLM-guided Plan and Retrieval: A Strategic Alignment for Interpretable User Satisfaction Estimation in Dialogue [5.1]
PRAISEは効果的なユーザ満足度予測のための解釈可能なフレームワークである。 3つのモジュールを通して動作する。ユーザ満足度推定タスクの3つのベンチマークで最先端のパフォーマンスを達成する。
論文参考訳（メタデータ） (Thu, 06 Mar 2025 18:12:33 GMT)
ユーザ満足度を推定するためのフレームワーク「PRAISE (Plan and Retrieval Alignment for Interpretable Satisfaction Estimation)」の提案。AgenticなアプローチでStrategy Planner、Feature Retriever、Score Analyzerで構成。
興味深い結果だが、LLM（API）が若干古いような気がしなくもない。最新のAPIだとどのような結果になるのだろうか。

Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models

Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models [68.9]
Retrieval-Augmented Generation (RAG) は、Large Language Models (LLM) に対する幻覚を緩和する効果を証明している。既存の自動評価メトリクスは、トレーニングと評価の間にRAGモデルによって生成されたアウトプットを正確に評価することはできない。本稿では,RAGモデルのより正確な評価を実現するため,LCMの強化を目的とした判断一貫性(ConsJudge)手法を提案する。
論文参考訳（メタデータ） (Wed, 26 Feb 2025 04:50:43 GMT)
RAGを対象とした評価手法、「 Judge-Consistency (ConsJudge), a method that enhances LLM-based judgment models to generate more accurate evaluations for RAG models in a self-improvement framework.」の提案。
リポジトリはGitHub – OpenBMB/ConsJudge

Judging the Judges: A Collection of LLM-Generated Relevance Judgements

Judging the Judges: A Collection of LLM-Generated Relevance Judgements [37.1]
本稿では,SIGIR 2024におけるLLMJudgeの大規模自動妥当性評価の結果をベンチマークし,報告する。 8つの国際チームが作成したTREC 2023ディープラーニングトラック関連判定のラベルを42 LLMで作成し、ベンチマークする。
論文参考訳（メタデータ） (Wed, 19 Feb 2025 17:40:32 GMT)
「This paper benchmarks and reports on the results of a large-scale automatic relevance judgment evaluation, the LLMJudge challenge at SIGIR 2024, where different relevance assessment approaches were proposed.」とのことでいろいろ検証なアプローチのまとめ。

月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31