staka – ページ 50 – arXiv最新論文の紹介

Pitfalls in Evaluating Language Model Forecasters

Pitfalls in Evaluating Language Model Forecasters [45.4]
我々はコミュニティとして、大きな言語モデルを評価するような結論に注意する必要があると論じている。 1) 時間的リークによる評価結果の信頼の難しさ,(2) 評価性能から実世界の予測への外挿の難しさ,の2つのカテゴリを識別する。
論文参考訳（メタデータ） (Sat, 31 May 2025 21:49:17 GMT)
LLMの評価に関する落とし穴をまとめた論文
「We identify two broad categories of issues: (1) difficulty in trusting evaluation results due to many forms of temporal leakage, and (2) difficulty in extrapolating from evaluation performance to real-world forecasting. Through systematic analysis and concrete examples from prior work, we demonstrate how evaluation flaws can raise concerns about current and future performance claims.」というまとめだが、評価は本当に難しい。

Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging

Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging [104.0]
モデルマージは、複数のエキスパートモデルを単一のモデルにまとめることを目的としており、ストレージとサービスコストを削減している。これまでの研究は主に、コードと数学のタスクに視覚分類モデルやLLM(Large Language Models)を統合することに焦点を当ててきた。本稿では,VQA,Geometry,Chart,OCR,Gundingといった複数のタスクを含むMLLMのモデルマージベンチマークを紹介する。
論文参考訳（メタデータ） (Mon, 26 May 2025 12:23:14 GMT)
マルチモーダルなモデルマージに関するベンチマークの紹介。
リポジトリはGitHub – WalkerWorldPeace/MLLMerging: Official implementation of “Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging”.

XToM: Exploring the Multilingual Theory of Mind for Large Language Models

XToM: Exploring the Multilingual Theory of Mind for Large Language Models [58.0]
LLMにおける既存の心の理論の評価は英語に限られている。 XToMは5言語にまたがってToMを評価する,厳格に検証された多言語ベンチマークである。以上の結果から,LLMが言語的文脈にまたがって人間的なメンタライゼーションを再現する能力に限界があることが判明した。
論文参考訳（メタデータ） (Tue, 03 Jun 2025 05:23:25 GMT)
多言語でのLLM比較、「LLMs are equipped with multilingual understanding ability but fail in multi- lingual ToM reasoning tasks.」と結論。深い部分での言語間差異は残っているよう（とはいえ、一昔前に比べて差異は縮小しているようにも見える）
リポジトリはGitHub – HKUST-KnowComp/XToM: Data and Code for paper “X-ToM: Exploring the Multilingual Theory of Mind for Large Language Models”

A Survey of LLM × DATA

A Survey of LLM $\times$ DATA [72.0]
大規模言語モデル(LLM)とデータ管理(Data4LLM)の統合は、両方のドメインを急速に再定義しています。一方、Data data4LLMは、事前トレーニング、後トレーニング、検索強化生成、エージェント生成などの段階に必要なデータの高品質、多様性、タイムラインをLLMに提供する。一方、LLMはデータ管理のための汎用エンジンとして登場しつつある。
論文参考訳（メタデータ） (Sat, 24 May 2025 01:57:12 GMT)
データを軸としたサーベイ。
リポジトリとしてGitHub – weAIDB/awesome-data-llm: Official Repository of “LLM × DATA” Survey Paperがあり、数多くの論文がリンクされている。

Self-Challenging Language Model Agents

Self-Challenging Language Model Agents [98.6]
本稿では,エージェントが自ら生成する高品質なタスクについて,エージェントを訓練するためのセルフチェンジフレームワークを提案する。このフレームワークは、Llama-3.1-8B-Instructの2倍の改善を実現している。
論文参考訳（メタデータ） (Mon, 02 Jun 2025 14:23:33 GMT)
「we present the Self-Challenging Agent (SCA) method for self-improvement of general multi-turn tool-use LLM agents. SCA can create its own tasks to challenge itself and learn from them. To do this, it utilizes the Code-as-Task (CaT) formulation which ensures high quality synthetic tasks. Through RL on these self-generated synthetic tasks, SCA can be used to train a Llama-3.1-8B model to achieve an average relative success rate improvement of 95.8% on existing test tasks across four different multi-turn tool-use environments.」とのこと。。。AGIに近づいている感のある未来を感じる報告。（「While SCA serves as a preliminary step, there remains many research questions for building an effective self-improvement flywheel for general LLM agents.」とあるとおり、実態上はまだいろいろ壁はあるのだろうが）
コード生成を効果的に使っているのも興味深いが、形式言語で表されるようなタスクは解ける段階というのは意外と早く来るのだろうか。。。

OpenThoughts: Data Recipes for Reasoning Models

OpenThoughts: Data Recipes for Reasoning Models [215.2]
OpenThoughtsプロジェクトは、推論モデルをトレーニングするためのオープンソースのデータセットを作成することだ。 OpenThoughts2-1Mデータセットは、公開推論データに基づいてトレーニングされた最初のモデルであるOpenThinker2-32Bに導かれた。 OpenThinker3-7Bモデル。
論文参考訳（メタデータ） (Wed, 04 Jun 2025 17:25:39 GMT)
LRM構築のためのオープンデータセット。データ拡張の方向性としても参考になる。
プロジェクトサイトはOpen Thoughts

PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding

PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding [114.5]
画素レベルの部分接地のために設計された LMM ベンチマークである PartONOMY を紹介する。我々はいくつかの部分中心LMMをトレーニングし、セグメント化トークンの代わりにスパンタグを使用する新しいセグメント化LMMであるPLUMを提案する。我々の研究は、LMMにおけるきめ細かい基礎的な視覚的理解を実現するための新たな道を開く。
論文参考訳（メタデータ） (Tue, 27 May 2025 06:03:56 GMT)
「Unfortunately, Large Multimodal Models (LMMs), the backbones of today’s multimodal systems, lack strong part recognition abilities 」とのことで、それを検証するベンチマークと改善モデルPLUM: Part-Level Understanding LMMを提案。
リポジトリはGitHub – AnselBlume/partonomy: Repository for “Partonomy: Large Multimodal Models with Part-Level Visual Understanding”

Ctrl-Crash: Controllable Diffusion for Realistic Car Crashes

Ctrl-Crash: Controllable Diffusion for Realistic Car Crashes [26.7]
Ctrl-Crashはコントロール可能なカークラッシュビデオ生成モデルで、バウンディングボックス、クラッシュタイプ、初期画像フレームなどの信号を条件付けする。提案手法は,入力の微妙な変化がクラッシュ結果の劇的な変化を引き起こすような,現実的なシナリオ生成を可能にする。
論文参考訳（メタデータ） (Fri, 30 May 2025 21:04:38 GMT)
「we introduce Ctrl-Crash, a controllable video diffusion framework for generating realistic crash videos from a single initial frame. Our method operates with inputs and outputs in pixel space, as opposed to using computer graphics primitives and explicit models of physics.」
様々なシチュエーションを考える上では有効そうには思う
リポジトリはCtrl-Crash: Controllable Diffusion for Realistic Car Crashes

Quantitative LLM Judges

Quantitative LLM Judges [48.7]
本研究では,既存のLLM審査員の評価スコアを,与えられた領域における人間の評価スコアと整合させる定量的LLM判定者を提案する。モデルは、裁判官のテキスト評価とスコアを用いて、原判事のスコアを改善するために訓練される。実験により, 定量的な判断は, ポストホックモデリングにより, 既存の判断の予測力を効果的に向上できることが示された。
論文参考訳（メタデータ） (Tue, 03 Jun 2025 14:44:23 GMT)
「We introduce quantitative judges, a family of LLM judges that disentangle qualitative reasoning from quantitative score prediction in LLM-as-a-judge. Our approach has two stages: the qualitative stage, where a frozen LLM judge generates an evaluation, and the quantitative stage, where these outputs are used by a lightweight model to predict a human score.」というアプローチによる定量評価
現実的な設計方針に思える。

How much do language models memorize?

How much do language models memorize? [104.2]
我々は記憶を2つの構成要素に分けている:「文体記憶」と「文体一般化」である。一般化を完全に排除すると、モデルキャパシティを見積もるトータル・メモリ化を計算することができる。サイズが大きくなるデータセット上で言語モデルをトレーニングし、キャパシティが満たされるまでモデルを記憶し、その時点での「グルーキング」が始まり、モデルが一般化し始めるにつれて意図しない記憶が減少するのを観察する。
論文参考訳（メタデータ） (Fri, 30 May 2025 17:34:03 GMT)
AGIを目指すうえでとても重要な記憶に関する報告、「We formally separate memorization into two components: unintended memorization, the information a model contains about a specific dataset, and generalization, the information a model contains about the true data-generation process. When we completely eliminate generalization, we can compute the total memorization, which provides an estimate of model capacity: our measurements estimate that GPT-style models have a capacity of approximately 3.6 bits per parameter.」とのこと。
引用されているが、Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws – arXiv最新論文の紹介など、この手の研究は本当に興味深い。

2026年1月
月	火	水	木	金	土	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31