staka – ページ 12 – arXiv最新論文の紹介

Large Language Models in Argument Mining: A Survey

Large Language Models in Argument Mining: A Survey [15.0]
Argument Mining (AM) はテキストから議論的構造を抽出することに焦点を当てている。 LLM(Large Language Models)の出現は、AMを大きく変化させ、高度な文脈内学習を可能にした。本研究は, LLM駆動型AMの最近の進歩を体系的に合成する。
論文参考訳（メタデータ） (Thu, 19 Jun 2025 15:12:58 GMT)
LLMを活用したArgument Mining のサーベイ

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent [53.8]
我々は,セグメント内のテキストを読み,上書き戦略を用いてメモリを更新する新しいエージェントワークフローであるMemAgentを紹介した。 MemAgentは、32Kテキストでトレーニングされた8Kコンテキストから3.5M QAタスクへの外挿が可能で、パフォーマンスが5%低下し、512K RULERテストで95%以上を実現している。
論文参考訳（メタデータ） (Thu, 03 Jul 2025 03:11:50 GMT)
長文を扱うためのAgenticなフレームワークの提案、下記が特徴とのこと（プロジェクトサイトより引用）
- 1 Novel memory mechanism: The agent reads text in segments and efficiently updates memory through an overwriting strategy. This design enables the model to process arbitrarily long inputs within a fixed context window, fundamentally overcoming the window length limitations of traditional Transformer architectures.
- 2 O(n) complexity: By decoupling computation from text length, the complexity of processing long texts is transformed from quadratic growth to linear growth.
- 3 RL-driven extrapolation: We enhance the DAPO algorithm to support multi-turn training over context-independent conversations. Based on this, the trained model exhibits unprecedented extrapolation performance.
プロジェクトサイトはMemAgent: Reshaping Long-Context LLM with Multi-Conv RL based Memory Agent

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning [27.2]
SPIRALは、モデルをマルチターン、ゼロサムゲームで学習し、自身のバージョンを継続的に改善するセルフプレイフレームワークである。 SPIRALを用いることで、ゼロサムゲーム上でのセルフプレイは、広く移動する推論能力を生み出す。分析により, この伝達は, 系統的分解, 期待値計算, ケース・バイ・ケース分析という3つの認知的パターンを通じて起こることが明らかとなった。
論文参考訳（メタデータ） (Mon, 30 Jun 2025 17:58:13 GMT)
人への依存を少なくするため「We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision.」というフレームワークを提案、効果を確認とのこと。「Key Findings. Training on zero-sum games produces reasoning capabilities that transfer broadly.」としている。「Our empirical results show that training on Kuhn Poker alone improves mathematical reasoning by 8.7% average and Minerva Math by 18.1%, surpassing models trained on 25,000 expert demonstrations」とSFTを上回っているのは若干驚き。
リポジトリはGitHub – spiral-rl/spiral: SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

Graphs Meet AI Agents: Taxonomy, Progress, and Future Opportunities

Graphs Meet AI Agents: Taxonomy, Progress, and Future Opportunities [117.5]
データ構造化は、複雑で非組織的なデータをよく構造化された形式に変換することで、有望な役割を果たす。この調査では、グラフがAIエージェントにどのように権限を与えるかを、初めて体系的にレビューする。
論文参考訳（メタデータ） (Sun, 22 Jun 2025 12:59:12 GMT)
グラフとエージェントに関するサーベイ
リポジトリはGitHub – YuanchenBei/Awesome-Graphs-Meet-Agents: A curated list of resources on graph-empowered agents and agent-facilitated graph learning (Graphs Meet Agents).

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Establishing Best Practices for Building Rigorous Agentic Benchmarks [94.7]
多くのエージェントベンチマークではタスク設定や報酬設計が問題となっている。このような問題は、相対的な用語で、過小評価または過大評価エージェントのパフォーマンスを最大100%向上させる可能性がある。我々はベンチマーク構築経験から要約したガイドラインの集合であるAgentic Benchmark Checklist (ABC)を紹介した。
論文参考訳（メタデータ） (Thu, 03 Jul 2025 17:35:31 GMT)
構築が難しいエージェント系ベンチマークの注意点をまとめた論文。
「the issues found in τ-bench-Airline, some other example issues we found are: (1) an agent can score 100% on SWE-Lancer without resolving any tasks;」のような問題は相応にある気がするし、「Based on ABC, we assessed ten widely used agentic benchmarks and identified significant evaluation issues that cases up to 100% errors (in relative terms) when estimating agents’ performance.」も驚愕という感じではない。
リポジトリはGitHub – uiuc-kang-lab/agentic-benchmarks

MultiGen: Using Multimodal Generation in Simulation to Learn Multimodal Policies in Real

MultiGen: Using Multimodal Generation in Simulation to Learn Multimodal Policies in Real [128.8]
MultiGenは、大規模な生成モデルを従来の物理シミュレータに統合するフレームワークである。容器や液体を注ぐ現実世界への効果的なゼロショット転送を実証する。
論文参考訳（メタデータ） (Thu, 03 Jul 2025 17:59:58 GMT)
「In this work, we introduced MULTIGEN, a novel framework for integrating generative multimodal simulation into robot learning. By augmenting physics-based simulators with large-scale generative models, we demonstrated that sim-to-real policy learning can leverage rich sensory feedback beyond vision and proprioception.」というフレームワークの提案
音声合成データを併用するのが興味深いところ。

ERNIE4.5, Kwai Keye-VL, Ovis-U1, GLM-4.1V-Thinking, Confucius3-Math

ERNIE4.5（GitHub – bigdavidone/ERNIE4_5: The official repository for ERNIE 4.5 and ERNIEKit – its industrial-grade development toolkit based on PaddlePaddle.）の登場の他、公開モデルも色々と出ている。効率的な構造、一定の特化を行うことで商用モデルに迫る性能を達成しているものも多い。

ERNIE 4.5 Technical Report
本報告では、10種類の異なるバリアントからなる新しい大規模マルチモーダルモデル「ERNIE 4.5」を紹介しています。このモデルは、47Bおよび3Bのアクティブパラメータを持つMixture-of-Experts（MoE）アーキテクチャを採用し、テキスト関連タスクの性能を向上させつつマルチモーダル理解を強化します。全てのモデルはApache 2.0の下で公開され、研究や開発の支援を目的としたオープンソースの開発ツールキットも提供されています。論文Publication | ERNIE Blog

Kwai Keye-VL Technical Report [80.5]
ショートビデオ理解のためのマルチモーダル基盤モデルである textbfKwai Keye-VL を紹介する。 Keye-VLの開発は,ビデオに重点を置いた大規模で高品質なデータセットと,革新的なトレーニングレシピという,2つのコア柱に留まっている。提案手法の有効性を検証するため,我々は,Kee-VLが公開ビデオベンチマークにおける最先端の成果を達成し,一般的な画像ベースタスクにおいて高い競争力を保っていることを示す,広範囲な評価を行う。
論文参考訳（メタデータ） (Wed, 02 Jul 2025 17:57:28 GMT)
プロジェクトサイトはKwai Keye

Ovis-U1 Technical Report [17.2]
我々は,マルチモーダル理解,テキスト・ツー・イメージ生成,画像編集機能を統合した統一モデルであるOvis-U1を紹介する。テキスト・画像生成では、それぞれ DPG-Bench と GenEval のベンチマークで 83.72 と 0.89 のスコアを出力する。画像編集では、ImgEdit-BenchとGEdit-Bench-ENでそれぞれ4.00と6.42を達成している。
論文参考訳（メタデータ） (Sun, 29 Jun 2025 00:40:17 GMT)
GitHub – AIDC-AI/Ovis-U1: An unified model that seamlessly integrates multimodal understanding, text-to-image generation, and image editing within a single powerful framework.

GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning [112.5]
GLM-4.1V-9B-Thinkingは、汎用マルチモーダル理解と推論を促進するために設計された視覚言語モデル(VLM)である。モデルの潜在能力を最大限に活用するために,カリキュラムサンプリングを用いた強化学習を提案する。オープンソースのGLM-4.1V-9B-Thinkingは、同等の大きさのモデル間で最先端のパフォーマンスを達成する。
論文参考訳（メタデータ） (Wed, 02 Jul 2025 15:53:43 GMT)
GLMシリーズのマルチモーダルモデル。高性能。
GitHub – THUDM/GLM-4.1V-Thinking: GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning.

Confucius3-Math: A Lightweight High-Performance Reasoning LLM for Chinese K-12 Mathematics Learning [4.6]
Confucius3-Mathは,1つのコンシューマグレードGPU上で効率的に動作する14Bパラメータを備えた,オープンソースの大規模言語モデルである。このレポートでは、開発レシピ、直面する課題、それらを克服するために開発するテクニックを共有します。
論文参考訳（メタデータ） (Wed, 25 Jun 2025 10:49:23 GMT)
一定の特化を行うことで高性能を実現した事例
GitHub – netease-youdao/Confucius3-Math

LEDOM: An Open and Fundamental Reverse Language Model

LEDOM: An Open and Fundamental Reverse Language Model [100.5]
最初の純粋逆言語モデルであるLEDOMを導入し,2Bおよび7Bパラメータの435Bトークンに対して自己回帰訓練を行った。本稿では, 一般的なタスクにまたがる基盤モデルとして, 興味深い事例と洞察のセットを伴って, 逆言語モデルを提示する。 LEDOMをベースにした新しいアプリケーションであるReverse Rewardを紹介します。
論文参考訳（メタデータ） (Wed, 02 Jul 2025 03:52:00 GMT)
「We introduce LEDOM, the first purely reverse language model, trained autoregressively on 435B tokens with 2B and 7B parameter variants, which processes sequences in reverse temporal order through previous token prediction.」という逆言語モデル。面白い発想。
「Given a known answer and the corresponding supporting reasons, LEDOM can produce natural, well-formed ques- tions. It is helpful for automatically creating QA datasets and educational content, where starting from answers or known concepts is often more practical than designing questions manually.」というのも興味深いが、「We propose Reverse reward, a novel strategy that uses LEDOM to guide forward model outputs via reranking, leading to consistent performance improvements in mathematical reasoning.」とタスクによっては効果があるよう。
BERTのBのように双方向が有効なことはあるし、ダブルチェックの上で有効そうという印象。

RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies

RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies [125.4]
本稿では,実世界における汎用ロボットポリシーのスケーラブルな評価手法であるRoboArenaを提案する。固定タスク,環境,場所に関する評価を標準化する代わりに,評価者の分散ネットワークにまたがるクラウドソース評価を提案する。我々は、DROIDロボットプラットフォームを用いて、7つの学術機関における評価者のネットワークにアプローチをインスタンス化する。
論文参考訳（メタデータ） (Sun, 22 Jun 2025 18:13:31 GMT)
「In this work, we propose RoboArena, a new approach for scalable evaluation of generalist robot policies in the real world.」というrobot policyにフォーカスした評価フレームワークの提案。
プロジェクトサイトはRoboArena

Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games, How large language models judge and influence human cooperation

Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games [87.6]
大規模言語モデルは、アライメント、堅牢性、安全なデプロイメントを保証する上で、いかに自己関心と集合的幸福のバランスをとるかが重要な課題である。我々は、行動経済学から制度的に選択した公共財ゲームに適応し、異なるLLMがいかに社会的ジレンマをナビゲートするかを観察することができる。意外なことに、o1シリーズのようなLRMの推論は、協調にかなり苦労している。
論文参考訳（メタデータ） (Sun, 29 Jun 2025 15:02:47 GMT)
「our findings reveal a surprising pattern: while traditional LLMs demonstrate robust cooperation comparable to human outcomes, reasoning- enhanced models frequently struggle to sustain cooperation.」という興味深い結果。reasoningモデルだからなのか、モデルサイズや学習結果の問題なのかとても興味があるところ。
リポジトリはGitHub – davidguzmanp/SanctSim

How large language models judge and influence human cooperation [82.1]
我々は、最先端の言語モデルが協調行動をどのように判断するかを評価する。我々は、善良な相手との協力を評価する際、顕著な合意を守ります。モデル間の差異が協調の頻度に大きく影響を及ぼすことを示す。
論文参考訳（メタデータ） (Mon, 30 Jun 2025 09:14:42 GMT)
LLMが協調的な行動をとるか検証した論文。傾向を分析するのが難しい結果ではあるが「With some exceptions, most LLM families we tested tend to move from IS towards SS as versions and parameter size increases, indicating a shift towards a higher complexity social norm which makes use of more context, specifically assigned reputations. Moreover, different versions of the same family can have vastly distinct social norms, such as Claude 3.5 Haiku [47] and Claude 3.7 Sonnet [48], despite their similar ethical goals [49].」とのこと。（IS, cooperating is good, defection is bad、SS, cooperating is always good, defecting against bad individuals is also good）
「These results highlight an important concern: LLMs are not explicitly designed with a given social norm in mind, instead emerging as a by-product of their training [4]. While these norms may occasionally align with those of humans, they are neither designed to maintain cooperation and minimize disagreement, nor are they co-created with communities from diverse cultures to reflect their norms and needs [3].」というのが実際のところだと思うが、意思決定支援に使うという話は相応にあったりするわけで注意が必要だと思う。

2025年8月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31