arXiv – ページ 11 – arXiv最新論文の紹介

LM Fight Arena: Benchmarking Large Multimodal Models via Game Competition

LM Fight Arena: Benchmarking Large Multimodal Models via Game Competition [104.8]
本稿では,Mortal Kombat IIにおける大規模マルチモーダルモデルを評価する新しいフレームワークであるLM Fight Arenaを紹介する。静的評価とは異なり、LM Fight Arenaは完全に自動化され、再現可能で、LMMの戦略的推論能力の客観的評価を提供する。
論文参考訳（メタデータ） (Fri, 10 Oct 2025 02:19:21 GMT)
「Unlike static evaluations, LM Fight Arena provides a fully automated, reproducible, and objective assessment of an LMM’s strategic reasoning capabilities in a dynamic setting. This work introduces a challenging and engaging benchmark that bridges the gap between AI evaluation and interactive entertainment.」とのことだが、なぜにMortal Kombat…
Claude 3.5 Sonnetがとても強いらしい。

LightMem: Lightweight and Efficient Memory-Augmented Generation

LightMem: Lightweight and Efficient Memory-Augmented Generation [72.2]
我々は、メモリシステムの性能と効率のバランスをとるLightMemという新しいメモリシステムを紹介した。人間の記憶のアトキンソン・シフリンモデルにインスパイアされたLightMemは、メモリを3つの相補的なステージにまとめる。 GPTとQwenのバックボーンを用いたLongMemEvalの実験では、LightMemは高いベースライン(最大10.9%のゲイン)を上回り、トークンの使用量を最大117倍に削減している。
論文参考訳（メタデータ） (Tue, 21 Oct 2025 17:58:17 GMT)
軽量かつ効率的なメモリーフレームワーク。「Inspired by the Atkinson–Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition- inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep- time update employs an offline procedure that decouples consolidation from online inference.」と3モジュール構成
リポジトリはGitHub – zjunlp/LightMem: LightMem: Lightweight and Efficient Memory-Augmented Generation

ChatGPT Atlas, Ring-1T, DeepSeek OCR, olmOCR 2

先週はChatGPT Atlas（ChatGPT Atlas）の話題が多かった。GUI Agent（より正確にはブラウザエージェント）のように人が操作しているようにUIを使うエージェントには期待大。

Ring-1TはAnt groupによるLRM、1TパラメータのMoE構成で性能も高い。

また、DeepSeek OCRもバズっていた。OCR性能というよりもコンテキストとして画像データを使う有効性が興味深い。OCRとしてはOlmoOCRのv2も出ていてOSSの動きも盛ん。

Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model [100.9]
Ring-1Tは、数兆のパラメータを持つ最初のオープンソースの最先端の思考モデルである。総パラメータは1兆で、1トークンあたり約500億を活性化する。
論文参考訳（メタデータ） (Tue, 21 Oct 2025 17:46:14 GMT)
大規模なLRM、規模が大きいということもあるがDeepSeek V3.1など既存の公開モデルを超える性能を主張
リポジトリはGitHub – inclusionAI/Ring-V2: Ring-V2 is a reasoning MoE LLM provided and open-sourced by InclusionAI.。モデルはinclusionAI/Ring-1T · Hugging Face

DeepSeek-OCR: Contexts Optical Compression [15.6]
我々は,DeepSeek-OCRを,光学的2次元マッピングによる長期コンテキストの圧縮の実現可能性に関する最初の調査として紹介する。 DeepSeek-OCRはDeepEncoderとDeepSeek3B-MoE-A570Mの2つのコンポーネントで構成されている。実験により、テキストトークンの数がビジョントークンの10倍以内であれば、モデルがデコード(OCR)精度を97%達成できることが示された。
論文参考訳（メタデータ） (Tue, 21 Oct 2025 02:41:44 GMT)
ドキュメントの画像をコンテキストとした扱う構成のLLM、「In this technical report, we propose DeepSeek-OCR and preliminarily validate the feasibility of contexts optical compression through this model, demonstrating that the model can effectively decode text tokens exceeding 10 times the quantity from a small number of vision tokens. We believe this finding will facilitate the development of VLMs and LLMs in the future.」と効率的なよう。
リポジトリはGitHub – deepseek-ai/DeepSeek-OCR: Contexts Optical Compression

olmOCR 2: Unit Test Rewards for Document OCR [29.5]
olmOCR 2は、PDFのようなデジタル化された印刷文書を、クリーンで自然に順序付けられたプレーンテキストに変換する強力なOCRシステム群の最新版です。 olmOCR 2は、強化学習を用いて訓練された7B視覚言語モデル(VLM)であるolmOCR-2-7B-1025で駆動される。これらのテストケースに対するRLトレーニングは、我々の英語OCRベンチマークであるolmOCR-Benchにおける最先端のパフォーマンスをもたらすことを示す。
論文参考訳（メタデータ） (Wed, 22 Oct 2025 17:53:02 GMT)
こちらはOCR、olmOCRのバージョン2。「To scale unit test creation, we develop a pipeline for generating synthetic documents with diverse and challenging layouts, known ground-truth HTML source code, and extracted test cases.」と合成データを活用するアプローチ。
リポジトリはGitHub – allenai/olmocr: Toolkit for linearizing PDFs for LLM datasets/training

A Definition of AGI

A Definition of AGI [208.3]
人工知能の具体的な定義の欠如は、今日の専門的なAIと人間レベルの認知のギャップを曖昧にしている。そこで本研究では,AGIを認知的多目的性と熟達度に適合するものとして,これに対応するための定量的枠組みを提案する。
論文参考訳（メタデータ） (Tue, 21 Oct 2025 01:28:35 GMT)
AGIをよく教育された成人と同レベルの認知的な多様性と熟練度を持つものと定義、定量化のフレームワークを提案。「This paper introduces a quantifiable framework to address this, defining AGI as matching the cognitive versatility and proficiency of a well-educated adult. To operationalize this, we ground our methodology in Cattell-Horn-Carroll theory, the most empirically validated model of human cognition. The framework dissects general intelligence into ten core cognitive domains—including reasoning, memory, and perception—and adapts established human psychometric batteries to evaluate AI systems.」
定義やスコア（GPT-4は27%、GPT-5は58%）に対する見解は様々だと思うが、「Long-Term Memory Storage (MS): The capability to continually learn new information (associative, meaningful, and verbatim).」が最大の課題となっているように見え、そこは納得。

FinDeepResearch: Evaluating Deep Research Agents in Rigorous Financial Analysis

FinDeepResearch: Evaluating Deep Research Agents in Rigorous Financial Analysis [110.6]
HisRubricは階層的な分析構造ときめ細かいグレーディングルーブリックを備えた新しい評価フレームワークである。 FinDeepResearchは、4つの言語にまたがる8つの金融市場から64の上場企業からなるベンチマークである。 6つのDRエージェント、深い推論能力と探索能力を備えた5つのLLM、深い推論能力を持つ5つのLLMを含む16の代表的な手法を用いてFinDeepResearchに関する広範な実験を行った。
論文参考訳（メタデータ） (Wed, 15 Oct 2025 17:21:56 GMT)
金融ドメインのDeepResearchの評価。o3 deepresearchの性能が高い（Grok4やGemini 2.5 Proとは僅差）が「Our experiments suggest that even top-performing DR agents struggle to consistently balance a coherent analytical structure with factual accuracy. This imbalance remains the primary barrier to their deployment in high-stakes applications.」とのこと。。

Hierarchical Frequency Tagging Probe (HFTP): A Unified Approach to Investigate Syntactic Structure Representations in Large Language Models and the Human Brain

Hierarchical Frequency Tagging Probe (HFTP): A Unified Approach to Investigate Syntactic Structure Representations in Large Language Models and the Human Brain [11.9]
大きな言語モデル(LLM)は、人間のレベルや優れた言語能力を示している。重要な疑問は、LLMの行動能力が人間の脳に類似したメカニズムに由来するかどうかである。 GPT-2, Gemma, Gemma 2, Llama 2, Llama 3.1, GLM-4などのモデルでは, ヒトの脳は異なるシナティクスレベルにおいて異なる皮質領域に依存している。
論文参考訳（メタデータ） (Wed, 15 Oct 2025 08:04:49 GMT)
「This study advances syntactic processing by introducing the Hierarchical Frequency Tagging Probe (HFTP), a unified framework for dissecting neuron-wise sentence and phrase representations in LLMs, population-level patterns in the human brain, and generalizing seamlessly to naturalistic text. The results reveal that while LLMs, such as GPT-2, Gemma, Llama 2, and others, exhibit hierarchical syntactic processing and alignment with left-hemisphere brain activity, the mechanisms underlying their representations diverge significantly from those in human cortical regions. Notably, newer models like Gemma 2 demonstrate improved alignment, whereas others, such as Llama 3.1, show weaker human-model correlations despite enhanced task performance.」とのこと。脳との類似性が本当にあるのか（それが判断可能なレベルで情報取得＆分析できるのか）など疑問点はあるものの、面白い研究。
リポジトリはGitHub – LilTiger/HFTP: Hierarchical Frequency Tagging Probe (HFTP): A Unified Approach to Investigate Syntactic Structure Representations in Large Language Models and the Human Brain

UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG

UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG [82.8]
マルチモーダル検索拡張生成(MM-RAG)は,大規模言語モデルを現実世界の知識ベースに適用するための重要なアプローチである。 UniDoc-Benchは、70万の現実世界のPDFページから構築されたMM-RAGのための最初の大規模で現実的なベンチマークである。実験により,マルチモーダルテキスト画像融合RAGシステムは,非モーダルおよび共同マルチモーダル埋め込みに基づく検索において一貫して優れていた。
論文参考訳（メタデータ） (Thu, 09 Oct 2025 05:30:23 GMT)
マルチモーダルなRAGのためのベンチマーク。下記のように包括的で大規模（リポジトリの記載より引用）
- 70,000 real-world PDF pages across 8 diverse domains
- 1,600 multimodal QA pairs with 20% expert validation
- Four query types: factual retrieval, comparison, summarization, and logical reasoning
- Unified evaluation protocol with standardized candidate pools, prompts, and metrics
リポジトリはGitHub – SalesforceAIResearch/UniDoc-Bench

Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks

Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks [23.2]
大規模言語モデルは、長期のエージェントタスクにおいて課題に直面します。既存のワーキングメモリメソッドは、エージェントのコアポリシーから切り離された外部メカニズムに依存している。本稿では,一貫したポリシーの一部として明示的な編集操作を実行することで,エージェントが作業メモリを積極的に管理する新しいフレームワーク,Memory-as-Actionを提案する。
論文参考訳（メタデータ） (Tue, 14 Oct 2025 15:29:57 GMT)
「This work introduces Memory-as-Action, a framework that treats working memory management as an integral part of an agent’s decision-making process, rather than as an external module. By formalizing memory operations as explicit actions, a single policy can learn to interleave task reasoning with context curation.」というフレームワークの提案、作業領域管理と推論を同時管理する手法の優位性を主張。

FastUMI-100K: Advancing Data-driven Robotic Manipulation with a Large-scale UMI-style Dataset

FastUMI-100K: Advancing Data-driven Robotic Manipulation with a Large-scale UMI-style Dataset [55.7]
我々は,大規模なUMIスタイルのマルチモーダルデモデータセットであるFastUMI-100Kを提案する。 FastUMI-100Kは、現実世界のロボットデモデータの多様な要求を満たすために、よりスケーラブルで柔軟性があり、適応可能なソリューションを提供する。我々のデータセットは、エンドエフェクタ状態、多視点手首装着魚眼画像、テキストアノテーションを含むマルチモーダルストリームを統合している。
論文参考訳（メタデータ） (Thu, 09 Oct 2025 09:57:25 GMT)
「Utilizing the FastUMI data collection system [21], we in- tegrated single-arm and dual-arm configurations with adapt- able universal finger sleeves to conduct large-scale data collection. In this paper, we introduce the large-scale UMI- style multimodal dataset—FastUMI-100K, which incorpo- rates the dataset of the pioneering work FastUMI and totally comprises over 100,000 demonstration trajectories, collected using both single-arm and dual-arm grippers on the FastUMI platform, equivalent to 600 hours of interactive data.」というデータセット。
リポジトリはGitHub – MrKeee/FastUMI-100K

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization [103.7]
ロングチェーンのリフレクティブ推論は、複雑な現実世界の問題を解決するための前提条件である。我々は42の難解な合成タスクの1,260のサンプルからなるベンチマークを構築した。トレーニング後のデータを生成し、そのようなデータを活用するための学習パラダイムを探索する。
論文参考訳（メタデータ） (Thu, 09 Oct 2025 17:53:58 GMT)
「MM-HELIX contains 42 meticulously curated challeng- ing tasks from diverse online sources, categorized into four domains: Algorithm, Graph, Puzzle, and Game. Each task requires the model to perform careful visual observation, develop a deep understanding of complex rules, and generate an extended chain-of-thought that necessitates reflec- tion and backtracking.」という試行、失敗、修正のような長い思考を必要とするベンチマークの提案。GPT-5の性能が高くOSSモデルとの性能差が大きい。
プロジェクトサイトはMM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

2025年12月
月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31