2025年10月6日 – arXiv最新論文の紹介

Sora 2, Claude Sonnet 4.5, GLM-4.6, DeepSeek v3.2-exp, HunyuanImage 3.0

先週の大きなニュースはOpenAIによるSora 2.0の発表だった（Sora 2 is here | OpenAI）。ビデオ生成モデルには様々なタスクを解ける可能性（Video models are zero-shot learners and reasoners – arXiv最新論文の紹介）やWorld modelとしての可能性（V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning – arXiv最新論文の紹介、SimVS: Simulating World Inconsistencies for Robust View Synthesis – arXiv最新論文の紹介、How Far is Video Generation from World Model: A Physical Law Perspective – arXiv最新論文の紹介など）が指摘されていてニュースリリースの中にも言及がある。

AnthropicのClaude Sonnet 4.5も発表されている（Introducing Claude Sonnet 4.5 \ Anthropic）。着実な進化と言えそうな結果。

GLM-4.6: Advanced Agentic, Reasoning and Coding Capabilities、deepseek-ai/DeepSeek-V3.2-Exp · Hugging Faceなど公開モデルのアップデートも要注目。GitHub – Tencent-Hunyuan/HunyuanImage-3.0: HunyuanImage-3.0: A Powerful Native Multimodal Model for Image GenerationについてはarXivに論文が公開されていた。

HunyuanImage 3.0 Technical Report [108.4]
HunyuanImage 3.0は、自動回帰フレームワーク内でのマルチモーダル理解と生成を統合する、ネイティブなマルチモーダルモデルである。 HunyuanImage 3.0は、これまでで最大かつ最も強力なオープンソース画像生成モデルである。
論文参考訳（メタデータ） (Sun, 28 Sep 2025 16:14:10 GMT)
非常に強力な画像系公開モデル
モデルはtencent/HunyuanImage-3.0 · Hugging Face

MuSLR: Multimodal Symbolic Logical Reasoning

MuSLR: Multimodal Symbolic Logical Reasoning [133.9]
マルチモーダルな論理的推論は、自律運転や診断などの高度な応用において重要である。形式論理規則を基礎としたマルチモーダルな記号論理的推論のための最初のベンチマーク Mu SLR を導入する。我々は,GPT-4.1のChain-of-Thought性能を14.13%向上させるモジュール型フレームワークであるLogiCAMを提案する。
論文参考訳（メタデータ） (Tue, 30 Sep 2025 06:42:20 GMT)
Multimodal symbolic logical reasoningを対象とするベンチマークMuSLRの構築。またベースラインとしてモジュラー構成のLogiCAMを提案している。現在のフロンティアなモデルでも難しいベンチマークのよう。
改善のための「First, integrating dedicated symbolic modules is essential: the LogiCAM outperforms base VLMs precisely because it extracts multimodalities based on logic and embeds explicit symbolic reasoning steps. Second, existing VLMs struggle to align and fuse visual and textual information when performing formal logic; Future work should explore tighter multimodal integration, such as cross-modal architectures trained with logic-grounded objectives, to bridge this gap.」という指摘が興味深く、現行モデルは形式的な処理に苦労しているように見える。
リポジトリはMuSLR: Multimodal Symbolic Logical Reasoning

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents [79.8]
Ferret-UI Liteは、様々なプラットフォームで動作する、コンパクトでエンドツーエンドのGUIエージェントである。 Ferret-UI Liteは、他の小規模GUIエージェントとの競合性能を達成する。
論文参考訳（メタデータ） (Tue, 30 Sep 2025 17:13:56 GMT)
AppleによるGUIエージェントの報告、「In this work, we present Ferret-UI Lite, a 3B multimodal LLM designed for GUI agentic tasks with a focus on lightweight, on-device settings. Through real and synthetic data curation, inference-time visual tool use, and a two-stage SFT–RL training strategy, Ferret-UI Lite achieves competitive grounding and navigation performance relative to larger models.」と小型のモデル。

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing [117.6]
MinerU2.5は、例外的な計算効率を維持しつつ、最先端の認識精度を実現する文書解析モデルである。提案手法では,局所的なコンテンツ認識からグローバルなレイアウト解析を分離する,粗大な2段階解析戦略を採用している。
論文参考訳（メタデータ） (Mon, 29 Sep 2025 16:41:28 GMT)
MinerU: An Open-Source Solution for Precise Document Content Extraction – arXiv最新論文の紹介の最新バージョン、強力な1.2BのVLM。汎用的・商用API、特化型モデルを上回る性能。
リポジトリはGitHub – opendatalab/MinerU: Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.、デモも存在するMinerU – a Hugging Face Space by opendatalab、高速で高性能。

RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems

RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems [99.0]
問題が発生したら、複数の抽象化を提案できるモデルをトレーニングし、続いてソリューション構築のインセンティブを与えるRLを作ります。この結果、RLトレーニングパラダイムはRLADと呼ばれ、抽象化ジェネレータとソリューションジェネレータを共同で訓練する。我々は、大規模なテスト予算で多くのソリューションを生成するよりも、より多くのテスト時間計算を抽象化の生成に割り当てることが、パフォーマンスに有益であることを示しています。
論文参考訳（メタデータ） (Thu, 02 Oct 2025 17:44:23 GMT)
「We introduce reasoning abstractions: concise representations of procedural and factual knowledge that are expressed in natural language, as a means to broaden the reasoning strategies used by LLMs」という抽象化モデルとこの処理を通すことでパフォーマンスが上がることを確認。結果も面白いが「We tried training a single model to do both abstraction generation and solution generation, after a lightweight SFT on traces showing questions paired with abstractions and corresponding solutions, but we found this approach to very quickly lose the ability of proposing abstractions over the course of RL training.」というのも興味深い。なんでなんだろう。。。
プロジェクトサイトはRLAD

月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31