2025年1月6日 – arXiv最新論文の紹介

Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs

Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs [34.2]
大規模言語モデル(LLM)は自己生成応答を補正することができるが、自己補正後の精度の低下も観察されている。自己訂正能力は、自信(回答を正す自信)と批判(間違った回答を正しいものにする)に分解します。我々の戦略は両方の能力においてバニラSFTより優れており、自己補正後の精度ははるかに高い。
論文参考訳（メタデータ） (Fri, 27 Dec 2024 08:09:11 GMT)
Confidence scoreとCriticの分析、および、自己修正能力を高める手法の提案
「Confidence prompt/ICL example can lead higer CL and lower CS; critique prompt/ICL example can cause lower CL and higher CS.」（Confidence Level (CL) and Critique Score (CS)）とトレードオフの関係にあるとのこと。
両者を改善するために「Critique Improvement Tuning (CCT), which can be divided into Confidence Level Improvement Tuning (CLT) and Critique Score Improvement Tuning (CST).」を提案
リポジトリはGitHub – Zhe-Young/SelfCorrectDecompose: Code for paper “Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs”

Large Concept Models: Language Modeling in a Sentence Representation Space [62.7]
本稿では,概念を命名した明示的な高レベルな意味表現に基づくアーキテクチャの試みを行う。概念は言語とモダリティに依存しないものであり、フローにおけるより高いレベルの考えや行動を表している。本モデルでは,多くの言語に対して,ゼロショットの一般化性能が顕著であることを示す。
論文参考訳（メタデータ） (Sun, 15 Dec 2024 21:20:12 GMT)
トークン単位ではなくコンセプト単位に言語を扱ったモデルの提案、「In this study, as proof of feasibility, we assume that a concept corresponds to a sentence, and use an existing sentence embedding space, SONAR, which supports up to 200 languages in both text and speech modalities. The Large Concept Model is trained to perform autoregressive sentence prediction in an embedding space.」という設定で「The LCM outperforms Llama-3.1-8B-IT on English and on the average over foreign languages officially supported by the LLM.」との興味深い結果。一方で「We acknowledge that there is still a long path to reach the performance of current flagship LLMs.」との記載も。
リポジトリはGitHub – facebookresearch/large_concept_model: Large Concept Models: Language modeling in a sentence representation space

StructTest: Benchmarking LLMs’ Reasoning through Compositional Structured Outputs [78.8]
StructTestは、構造化されたアウトプットを生成する能力に基づいて、大きな言語モデルを評価する新しいベンチマークである。 StructTestが一般的な推論能力のよいプロキシであることを示す。
論文参考訳（メタデータ） (Mon, 23 Dec 2024 22:08:40 GMT)
構造化出力のベンチマーク、「programmatically verifiable benchmark for evaluating instructionfollowing capabilities through structured outputs.」
現時点でデータは公開されていない・・・？