2025年6月9日 – arXiv最新論文の紹介

Quantitative LLM Judges

Quantitative LLM Judges [48.7]
本研究では,既存のLLM審査員の評価スコアを,与えられた領域における人間の評価スコアと整合させる定量的LLM判定者を提案する。モデルは、裁判官のテキスト評価とスコアを用いて、原判事のスコアを改善するために訓練される。実験により, 定量的な判断は, ポストホックモデリングにより, 既存の判断の予測力を効果的に向上できることが示された。
論文参考訳（メタデータ） (Tue, 03 Jun 2025 14:44:23 GMT)
「We introduce quantitative judges, a family of LLM judges that disentangle qualitative reasoning from quantitative score prediction in LLM-as-a-judge. Our approach has two stages: the qualitative stage, where a frozen LLM judge generates an evaluation, and the quantitative stage, where these outputs are used by a lightweight model to predict a human score.」というアプローチによる定量評価
現実的な設計方針に思える。

How much do language models memorize? [104.2]
我々は記憶を2つの構成要素に分けている:「文体記憶」と「文体一般化」である。一般化を完全に排除すると、モデルキャパシティを見積もるトータル・メモリ化を計算することができる。サイズが大きくなるデータセット上で言語モデルをトレーニングし、キャパシティが満たされるまでモデルを記憶し、その時点での「グルーキング」が始まり、モデルが一般化し始めるにつれて意図しない記憶が減少するのを観察する。
論文参考訳（メタデータ） (Fri, 30 May 2025 17:34:03 GMT)
AGIを目指すうえでとても重要な記憶に関する報告、「We formally separate memorization into two components: unintended memorization, the information a model contains about a specific dataset, and generalization, the information a model contains about the true data-generation process. When we completely eliminate generalization, we can compute the total memorization, which provides an estimate of model capacity: our measurements estimate that GPT-style models have a capacity of approximately 3.6 bits per parameter.」とのこと。
引用されているが、Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws – arXiv最新論文の紹介など、この手の研究は本当に興味深い。

The Automated but Risky Game: Modeling Agent-to-Agent Negotiations and Transactions in Consumer Markets [12.1]
消費者と商店双方がAIエージェントを承認し、交渉と取引を完全に自動化する将来のシナリオについて検討する。我々の発見によると、AIによる取引は本質的に不均衡なゲームであり、異なるエージェントがユーザーに対して著しく異なる結果をもたらす。ユーザーはAIエージェントにビジネス上の決定を委譲する際に注意を払わなければならない。
論文参考訳（メタデータ） (Thu, 29 May 2025 17:41:39 GMT)
AI vs AIな検証。「In this paper, we designed an experimental framework to investigate potential issues and risks in Agent-to-Agent negotiations and transactions. Our analysis reveals that Agent-to-Agent negotiation and transaction is naturally an imbalanced game where users using less capable agents will face significant financial loss against stronger agents.」は予想されていることではあるが論文でも指摘されている通り格差拡大を招きかねない結果。
リポジトリはGitHub – ShenzheZhu/A2A-NT: Official code of “The Automated but Risky Game: Modeling Agent-to-Agent Negotiations and Transactions in Consumer Markets”

Community Moderation and the New Epistemology of Fact Checking on Social Media [124.3]
ソーシャルメディアプラットフォームは伝統的に、誤解を招くコンテンツを識別しフラグを立てるために、独立した事実チェック組織に依存してきた。 X(元Twitter)とMetaは、クラウドソースのファクトチェックの独自のバージョンを立ち上げて、コミュニティ主導のコンテンツモデレーションに移行した。主要なプラットフォーム間での誤情報検出の現在のアプローチについて検討し,コミュニティ主導型モデレーションの新たな役割を探求し,大規模クラウドチェックの約束と課題の両方を批判的に評価する。
論文参考訳（メタデータ） (Mon, 26 May 2025 14:50:18 GMT)
コミュニティで現実に行われているファクトチェック（および類似のチェック）に関する調査・評価