staka – ページ 2 – arXiv最新論文の紹介

When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought [118.7]
MIRAは,中間画像の生成が推論の成功に不可欠であるシナリオにおいて,モデルを評価するために設計された新しいベンチマークである。 546のマルチモーダル問題を含み、中間画像と最終回答が注釈付きである。
論文参考訳（メタデータ） (Tue, 04 Nov 2025 18:00:51 GMT)
「To bridge this gap, we introduce MIRA (Multimodal Imagination for Reasoning Assessment), a benchmark designed to evaluate reasoning scenarios where generating or leveraging intermediate visual representations is essential. Each instance is constructed according to three principles: (1) requiring intermediate visual cues to answer the question, (2) pairing each instance with annotated step-wise visual clues to enable evaluation under a Visual-CoT setup, and (3) enforcing strict human annotation and cross-validation to guarantee data quality.」と視覚的・画像的な中間表現を必要とする推論のためのベンチマークの提案。フロンティアモデルでも難しいタスクになっている（が、公開モデルも健闘しているように見える）
プロジェクトサイトはWhen Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

AlphaResearch: Accelerating New Algorithm Discovery with Language Models

AlphaResearch: Accelerating New Algorithm Discovery with Language Models [60.5]
大規模言語モデルは複雑だが検証が容易な問題において大きな進歩を遂げてきたが、未知の発見に苦戦している。オープンエンド問題に対する新しいアルゴリズムの発見を目的とした,自律型研究エージェントである AlphaResearch を提示する。
論文参考訳（メタデータ） (Wed, 12 Nov 2025 02:03:05 GMT)
「The novel algorithms discovered by AlphaResearch not only surpass best-of-human performance but also significantly outperform the state-of-the-art results achieved by AlphaEvolve.」と驚く結果を報告。「Our approach demonstrates the potential of employing LLM to discover unexplored research area, enabling language models to effectively tackle complex open-ended tasks. We construct AlphaResearchComp, including 8 open-ended algorithmic problems, where AlphaResearch outperforms human researchers in 2/8 algorithmic problems but lags behind in the remaining 6 problems.」とのこと。評価は難しいが、人間を上回っても驚かないようなすごい時代になっている。
リポジトリはGitHub – answers111/alpha-research: Repo for “AlphaResearch: Accelerating New Algorithm Discovery with Language Models”

GPT-5.1, ERNIE 5, Marble, SIMA2

先週もGPT-5.1の公開（GPT-5.1: A smarter, more conversational ChatGPT | OpenAI）、ERNIE 5の公開（XユーザーのBaidu Inc.さん: 「Here comes ERNIE 5.0 — our latest natively omni-modal foundational model. It excels in omni-modal understanding, creative writing, instruction following, and more. We will continue investing in and developing more cutting-edge models to push the boundaries of intelligence. https://t.co/S3L1Tlre2n」 / X）などニュースが続いた。評価はこれから、という感じではあるが大規模展開をすぐに行っていくのがすごい。

動画生成、３D生成など生成モデルをベースとしてWorld Model構築のトライが流行っており、Marble: A Multimodal World Model | World Labsも要注目である。同じく先週発表されたSIMA 2: A Gemini-Powered AI Agent for 3D Virtual Worlds – Google DeepMindのなかでGenie3（Genie 3: A new frontier for world models – Google DeepMind）への言及がある通りAI Agentが学ぶ場としても有効に思える。AIの内心・想像の世界としても有効性が指摘されていてホットな領域。

TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models / Does TabPFN Understand Causal Structures? / TransactionGPT

TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models [76.5]
TabPFN-2.5は5万のデータポイントと2,000の機能を持つデータセット用に構築されている。チューニングされたツリーベースモデルとAutoGluon 1.4の精度を大幅に上回った。生産用として,TabPFN-2.5を小型または木製アンサンブルに変換する新しい蒸留エンジンを導入する。
論文参考訳（メタデータ） (Thu, 13 Nov 2025 01:01:46 GMT)
テーブルデータに対する基盤モデルの提案、TabArena – a Hugging Face Space by TabArenaで「TabPFN-2.5 is now the leading method for the industry standard benchmark TabArena (which contains datasets with up to 100,000 training data points), substantially outperforming tuned tree-based models and matching the accuracy of AutoGluon 1.4, a complex four-hour tuned ensemble that even includes the previous TabPFNv2. Remarkably, default TabPFN-2.5 has a 100% win rate against default XGBoost on small to medium-sized classification datasets (≤10,000 data points, 500 features) and a 87% win rate on larger datasets up to 100K samples and 2K features (85% for regression).」と高性能を主張
Prior Labs

Does TabPFN Understand Causal Structures? [40.2]
本研究では,TabPFNが内部表現に因果情報をエンコードするかどうかを検討する。学習可能なデコーダと因果トークンを用いたアダプタフレームワークを開発した。評価の結果,TabPFNの埋め込みには因果情報が含まれており,従来の因果発見アルゴリズムよりも優れていることがわかった。
論文参考訳（メタデータ） (Mon, 10 Nov 2025 15:53:15 GMT)
「We show that TabPFN’s embeddings contain causal information and that our adaptor framework outperforms traditional causal discovery algorithms when causal information is extracted from mid- range layers. This further promotes leveraging pre-trained tabular models for extracting causal structures, improving the interpretability of these models, and aiding in scientific discovery.」と興味深い性質を報告。

TransactionGPT [41.9]
TransactionGPTは、世界最大の決済ネットワーク内のコンシューマトランザクションデータの基盤モデルである。本稿では,支払いトランザクションデータの複雑なダイナミクスを捉えるために,新しい3D-Transformerアーキテクチャを提案する。
論文参考訳（メタデータ） (Thu, 13 Nov 2025 01:20:09 GMT)
Visa Researchによる基盤モデル。「TransactionGPT (TGPT), a foundation model that captures complex consumer shopping dynamics from Multi-Modal-Temporal-Tabular (MMTT) data.」、「Extensive experiments on large-scale, real-world payment data validate TGPT’s ability to learn meaningful transaction patterns, leading to significant performance improve- ments on critical downstream tasks. Furthermore, we quantify the benefits of several designs that enhance the TGPT’s efficiency and scalability.」とのこと。

Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

Intelligence per Watt: Measuring Intelligence Efficiency of Local AI [39.0]
大規模言語モデル(LLM)クエリは、主に集中型クラウドインフラストラクチャのフロンティアモデルによって処理される。小さなLMは、多くのタスクにおけるフロンティアモデルに対する競合的なパフォーマンスを実現しています。集中インフラからの需要の再分配によるローカル推論は可能か? 本稿では,局所的推論の能力と効率を評価する指標として,1ワット当たりのインテリジェンス(IPW)を提案する。
論文参考訳（メタデータ） (Wed, 12 Nov 2025 01:26:20 GMT)
「Intelligence per Watt」という指標の提案。「we show that intelligence per watt has improved 5.3× from 2023-2025 through compounding advances in both model architectures (3.1×) and hardware accelerators (1.7×), with locally-serviceable query coverage increasing from 23.2% to 71.3%.」とのこと。感覚的にも納得感のある結果。

UniLION: Towards Unified Autonomous Driving Model with Linear Group RNNs

UniLION: Towards Unified Autonomous Driving Model with Linear Group RNNs [115.9]
UniLIONは、大規模なLiDAR点雲、高解像度のマルチビュー画像、さらには時間的シーケンスを効率的に処理する。 UniLIONは、幅広いコアタスクにわたって、競争力と最先端のパフォーマンスを一貫して提供します。
論文参考訳（メタデータ） (Mon, 03 Nov 2025 17:24:19 GMT)
「We propose UniLION, a unified model that achieves both latent temporal fusion and multimodal fusion in UniLION backbone by the linear group RNN, generating the unified BEV features that serve all autonomous driving tasks, including perception, prediction, and planning.」とRNNベースのマルチモーダルモデルの提案。「Unified Heterogeneous Inputs: Leveraging the superior long-range modeling capability and linear computational complexity of linear group RNNs, UniLION integrates multi-view images, LiDAR point clouds, and temporal information into a unified 3D backbone through direct token concatenation, eliminating hand-crafted fusion modules and providing a more elegant, scalable solution.」ととてもマルチモーダル。
リポジトリはGitHub – happinesslz/UniLION

The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents

The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents [46.3]
本稿では,ソフトウェア開発エージェントを実装するツールキットであるOpenHands Software Agent SDKを紹介する。柔軟性を達成するために、デフォルトケースで数行のコードしか必要としないエージェントを実装するためのシンプルなインターフェースを設計する。セキュリティと信頼性のために、シームレスなローカル-リモート実行ポータビリティ、REST/WebSocketサービスの統合を提供する。
論文参考訳（メタデータ） (Wed, 05 Nov 2025 18:16:44 GMT)
OpenHandsの論文。「Unlike prior library-only SDKs (Anthropic, 2025a; OpenAI, 2024), OpenHands includes a built-in REST/WebSocket server for remote execution and a suite of interactive workspace interfaces—a browser-based VSCode IDE, VNC desktop, and persistent Chromium browser—for human inspection and control.」と統合された環境としても優秀。
リポジトリはGitHub – OpenHands/software-agent-sdk: A clean, modular SDK for building AI agents with OpenHands V1.

Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks [108.2]
大規模モデルを用いたマルチモーダル空間推論タスクの包括的レビューを行う。我々は、視覚言語ナビゲーションやアクションモデルを含む、具体的AIの進歩についてレビューする。我々は,新しいセンサによる空間的理解に寄与する音声やエゴセントリックビデオなどの新たなモダリティを考察する。
論文参考訳（メタデータ） (Wed, 29 Oct 2025 17:55:43 GMT)
MLLMのサーベイ。
リポジトリはGitHub – zhengxuJosh/Awesome-Multimodal-Spatial-Reasoning: This repository collects and organises state‑of‑the‑art papers on spatial reasoning for Multimodal Vision–Language Models (MVLMs).

Leveraging LLM-based agents for social science research: insights from citation network simulations

Leveraging LLM-based agents for social science research: insights from citation network simulations [132.4]
CiteAgentフレームワークを導入し、人間-行動シミュレーションに基づく引用ネットワークを生成する。 CiteAgentは、実世界の引用ネットワークにおける主要な現象を捉えている。社会科学において2つのLCMに基づく研究パラダイムを確立し,既存の理論の検証と挑戦を可能にした。
論文参考訳（メタデータ） (Wed, 05 Nov 2025 08:47:04 GMT)
「To further explore the social attributes of LLMs, we introduce the CiteAgent framework, designed to generate citation networks based on human-behavior simulation with LLM-based agents. CiteAgent successfully captures predominant phenomena in real-world citation networks, including power-law distribution, citational distortion, and shrinking diameter.」とのことだが、これでこの手のLLMを活用した社会シミュレーション的なものの有効性をいえるかというと若干疑問のような。
リポジトリはGitHub – Ji-Cather/CiteAgent: Official Implementation of CiteAgent Framework

Thought Branches: Interpreting LLM Reasoning Requires Resampling

Thought Branches: Interpreting LLM Reasoning Requires Resampling [11.0]
一つのサンプルを研究することは因果的影響と基礎となる計算を理解するのに不十分であると主張する。モデル決定のための再サンプリングを用いたケーススタディを提案する。
論文参考訳（メタデータ） (Fri, 31 Oct 2025 14:02:37 GMT)
「we can measure a partial CoT’s impact by resampling only the subsequent text. We present case studies using resampling to investigate model decisions. First, when a model states a reason for its action, does that reason actually cause the action?」と、CoTへの介入とその影響に関する報告。先行研究を含めて面白い動作分析。この報告では「We address this by repeatedly resampling to remove sentences and by measuring resilience, the number of interventions required to erase a sentence’s content from a trace. 」などCoTの過程の分布にも注目し計算コストは高いが納得性の高い手法を用いている。

2025年11月
月	火	水	木	金	土	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30