staka – ページ 27 – arXiv最新論文の紹介

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements [87.6]
科学的進歩への重要な能力は、既存の作品を再現する能力である。アクティブな研究領域においてAIエージェントが結果を再現する能力を評価するために,自動LLM高速化ベンチマークを導入する。最近のLSMとSoTAの足場を組み合わせると、ベンチマークですでに知られているイノベーションを再実装するのに苦労していることが分かりました。
論文参考訳（メタデータ） (Fri, 27 Jun 2025 17:44:32 GMT)
「We find that recent reasoning LLMs combined with SoTA scaffolds struggle to reimplement already-known innovations in our benchmark, even when given detailed hints.」というやや意外な結果。
リポジトリはGitHub – facebookresearch/llm-speedrunner: The Automated LLM Speedrunning Benchmark measures how well LLM agents can reproduce previous innovations and discover new ones in language modeling.

Large Language Models in Argument Mining: A Survey

Large Language Models in Argument Mining: A Survey [15.0]
Argument Mining (AM) はテキストから議論的構造を抽出することに焦点を当てている。 LLM(Large Language Models)の出現は、AMを大きく変化させ、高度な文脈内学習を可能にした。本研究は, LLM駆動型AMの最近の進歩を体系的に合成する。
論文参考訳（メタデータ） (Thu, 19 Jun 2025 15:12:58 GMT)
LLMを活用したArgument Mining のサーベイ

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent [53.8]
我々は,セグメント内のテキストを読み,上書き戦略を用いてメモリを更新する新しいエージェントワークフローであるMemAgentを紹介した。 MemAgentは、32Kテキストでトレーニングされた8Kコンテキストから3.5M QAタスクへの外挿が可能で、パフォーマンスが5%低下し、512K RULERテストで95%以上を実現している。
論文参考訳（メタデータ） (Thu, 03 Jul 2025 03:11:50 GMT)
長文を扱うためのAgenticなフレームワークの提案、下記が特徴とのこと（プロジェクトサイトより引用）
- 1 Novel memory mechanism: The agent reads text in segments and efficiently updates memory through an overwriting strategy. This design enables the model to process arbitrarily long inputs within a fixed context window, fundamentally overcoming the window length limitations of traditional Transformer architectures.
- 2 O(n) complexity: By decoupling computation from text length, the complexity of processing long texts is transformed from quadratic growth to linear growth.
- 3 RL-driven extrapolation: We enhance the DAPO algorithm to support multi-turn training over context-independent conversations. Based on this, the trained model exhibits unprecedented extrapolation performance.
プロジェクトサイトはMemAgent: Reshaping Long-Context LLM with Multi-Conv RL based Memory Agent

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning [27.2]
SPIRALは、モデルをマルチターン、ゼロサムゲームで学習し、自身のバージョンを継続的に改善するセルフプレイフレームワークである。 SPIRALを用いることで、ゼロサムゲーム上でのセルフプレイは、広く移動する推論能力を生み出す。分析により, この伝達は, 系統的分解, 期待値計算, ケース・バイ・ケース分析という3つの認知的パターンを通じて起こることが明らかとなった。
論文参考訳（メタデータ） (Mon, 30 Jun 2025 17:58:13 GMT)
人への依存を少なくするため「We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision.」というフレームワークを提案、効果を確認とのこと。「Key Findings. Training on zero-sum games produces reasoning capabilities that transfer broadly.」としている。「Our empirical results show that training on Kuhn Poker alone improves mathematical reasoning by 8.7% average and Minerva Math by 18.1%, surpassing models trained on 25,000 expert demonstrations」とSFTを上回っているのは若干驚き。
リポジトリはGitHub – spiral-rl/spiral: SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

Graphs Meet AI Agents: Taxonomy, Progress, and Future Opportunities

Graphs Meet AI Agents: Taxonomy, Progress, and Future Opportunities [117.5]
データ構造化は、複雑で非組織的なデータをよく構造化された形式に変換することで、有望な役割を果たす。この調査では、グラフがAIエージェントにどのように権限を与えるかを、初めて体系的にレビューする。
論文参考訳（メタデータ） (Sun, 22 Jun 2025 12:59:12 GMT)
グラフとエージェントに関するサーベイ
リポジトリはGitHub – YuanchenBei/Awesome-Graphs-Meet-Agents: A curated list of resources on graph-empowered agents and agent-facilitated graph learning (Graphs Meet Agents).

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Establishing Best Practices for Building Rigorous Agentic Benchmarks [94.7]
多くのエージェントベンチマークではタスク設定や報酬設計が問題となっている。このような問題は、相対的な用語で、過小評価または過大評価エージェントのパフォーマンスを最大100%向上させる可能性がある。我々はベンチマーク構築経験から要約したガイドラインの集合であるAgentic Benchmark Checklist (ABC)を紹介した。
論文参考訳（メタデータ） (Thu, 03 Jul 2025 17:35:31 GMT)
構築が難しいエージェント系ベンチマークの注意点をまとめた論文。
「the issues found in τ-bench-Airline, some other example issues we found are: (1) an agent can score 100% on SWE-Lancer without resolving any tasks;」のような問題は相応にある気がするし、「Based on ABC, we assessed ten widely used agentic benchmarks and identified significant evaluation issues that cases up to 100% errors (in relative terms) when estimating agents’ performance.」も驚愕という感じではない。
リポジトリはGitHub – uiuc-kang-lab/agentic-benchmarks

MultiGen: Using Multimodal Generation in Simulation to Learn Multimodal Policies in Real

MultiGen: Using Multimodal Generation in Simulation to Learn Multimodal Policies in Real [128.8]
MultiGenは、大規模な生成モデルを従来の物理シミュレータに統合するフレームワークである。容器や液体を注ぐ現実世界への効果的なゼロショット転送を実証する。
論文参考訳（メタデータ） (Thu, 03 Jul 2025 17:59:58 GMT)
「In this work, we introduced MULTIGEN, a novel framework for integrating generative multimodal simulation into robot learning. By augmenting physics-based simulators with large-scale generative models, we demonstrated that sim-to-real policy learning can leverage rich sensory feedback beyond vision and proprioception.」というフレームワークの提案
音声合成データを併用するのが興味深いところ。

ERNIE4.5, Kwai Keye-VL, Ovis-U1, GLM-4.1V-Thinking, Confucius3-Math

ERNIE4.5（GitHub – bigdavidone/ERNIE4_5: The official repository for ERNIE 4.5 and ERNIEKit – its industrial-grade development toolkit based on PaddlePaddle.）の登場の他、公開モデルも色々と出ている。効率的な構造、一定の特化を行うことで商用モデルに迫る性能を達成しているものも多い。

ERNIE 4.5 Technical Report
本報告では、10種類の異なるバリアントからなる新しい大規模マルチモーダルモデル「ERNIE 4.5」を紹介しています。このモデルは、47Bおよび3Bのアクティブパラメータを持つMixture-of-Experts（MoE）アーキテクチャを採用し、テキスト関連タスクの性能を向上させつつマルチモーダル理解を強化します。全てのモデルはApache 2.0の下で公開され、研究や開発の支援を目的としたオープンソースの開発ツールキットも提供されています。論文Publication | ERNIE Blog

Kwai Keye-VL Technical Report [80.5]
ショートビデオ理解のためのマルチモーダル基盤モデルである textbfKwai Keye-VL を紹介する。 Keye-VLの開発は,ビデオに重点を置いた大規模で高品質なデータセットと,革新的なトレーニングレシピという,2つのコア柱に留まっている。提案手法の有効性を検証するため,我々は,Kee-VLが公開ビデオベンチマークにおける最先端の成果を達成し,一般的な画像ベースタスクにおいて高い競争力を保っていることを示す,広範囲な評価を行う。
論文参考訳（メタデータ） (Wed, 02 Jul 2025 17:57:28 GMT)
プロジェクトサイトはKwai Keye

Ovis-U1 Technical Report [17.2]
我々は,マルチモーダル理解,テキスト・ツー・イメージ生成,画像編集機能を統合した統一モデルであるOvis-U1を紹介する。テキスト・画像生成では、それぞれ DPG-Bench と GenEval のベンチマークで 83.72 と 0.89 のスコアを出力する。画像編集では、ImgEdit-BenchとGEdit-Bench-ENでそれぞれ4.00と6.42を達成している。
論文参考訳（メタデータ） (Sun, 29 Jun 2025 00:40:17 GMT)
GitHub – AIDC-AI/Ovis-U1: An unified model that seamlessly integrates multimodal understanding, text-to-image generation, and image editing within a single powerful framework.

GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning [112.5]
GLM-4.1V-9B-Thinkingは、汎用マルチモーダル理解と推論を促進するために設計された視覚言語モデル(VLM)である。モデルの潜在能力を最大限に活用するために,カリキュラムサンプリングを用いた強化学習を提案する。オープンソースのGLM-4.1V-9B-Thinkingは、同等の大きさのモデル間で最先端のパフォーマンスを達成する。
論文参考訳（メタデータ） (Wed, 02 Jul 2025 15:53:43 GMT)
GLMシリーズのマルチモーダルモデル。高性能。
GitHub – THUDM/GLM-4.1V-Thinking: GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning.

Confucius3-Math: A Lightweight High-Performance Reasoning LLM for Chinese K-12 Mathematics Learning [4.6]
Confucius3-Mathは,1つのコンシューマグレードGPU上で効率的に動作する14Bパラメータを備えた,オープンソースの大規模言語モデルである。このレポートでは、開発レシピ、直面する課題、それらを克服するために開発するテクニックを共有します。
論文参考訳（メタデータ） (Wed, 25 Jun 2025 10:49:23 GMT)
一定の特化を行うことで高性能を実現した事例
GitHub – netease-youdao/Confucius3-Math

LEDOM: An Open and Fundamental Reverse Language Model

LEDOM: An Open and Fundamental Reverse Language Model [100.5]
最初の純粋逆言語モデルであるLEDOMを導入し,2Bおよび7Bパラメータの435Bトークンに対して自己回帰訓練を行った。本稿では, 一般的なタスクにまたがる基盤モデルとして, 興味深い事例と洞察のセットを伴って, 逆言語モデルを提示する。 LEDOMをベースにした新しいアプリケーションであるReverse Rewardを紹介します。
論文参考訳（メタデータ） (Wed, 02 Jul 2025 03:52:00 GMT)
「We introduce LEDOM, the first purely reverse language model, trained autoregressively on 435B tokens with 2B and 7B parameter variants, which processes sequences in reverse temporal order through previous token prediction.」という逆言語モデル。面白い発想。
「Given a known answer and the corresponding supporting reasons, LEDOM can produce natural, well-formed ques- tions. It is helpful for automatically creating QA datasets and educational content, where starting from answers or known concepts is often more practical than designing questions manually.」というのも興味深いが、「We propose Reverse reward, a novel strategy that uses LEDOM to guide forward model outputs via reranking, leading to consistent performance improvements in mathematical reasoning.」とタスクによっては効果があるよう。
BERTのBのように双方向が有効なことはあるし、ダブルチェックの上で有効そうという印象。

RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies

RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies [125.4]
本稿では,実世界における汎用ロボットポリシーのスケーラブルな評価手法であるRoboArenaを提案する。固定タスク,環境,場所に関する評価を標準化する代わりに,評価者の分散ネットワークにまたがるクラウドソース評価を提案する。我々は、DROIDロボットプラットフォームを用いて、7つの学術機関における評価者のネットワークにアプローチをインスタンス化する。
論文参考訳（メタデータ） (Sun, 22 Jun 2025 18:13:31 GMT)
「In this work, we propose RoboArena, a new approach for scalable evaluation of generalist robot policies in the real world.」というrobot policyにフォーカスした評価フレームワークの提案。
プロジェクトサイトはRoboArena

2025年10月
月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31