2026年3月10日 – arXiv最新論文の紹介

FireRed-OCR Technical Report

FireRed-OCR Technical Report [30.0]
本稿では,汎用VLMを専門家を解析するピクセル精度構造文書に変換するフレームワークFireRed-OCRを紹介する。高品質な構造化データの不足に対処するため,Geometry + Semantics’s Data Factoryを構築した。本稿では,画素レベルの認識から論理構造生成へモデルを導く三段階プログレッシブトレーニング戦略を提案する。
論文参考訳（メタデータ） (Mon, 02 Mar 2026 13:19:23 GMT)
OCRの改善の発表が続く。本論文では「This curriculum includes: (1) Multi-task Pre-alignment to ground the model’s understanding of document structure; (2) Specialized SFT for standardizing full- image Markdown output; and (3) Format-Constrained Group Relative Policy Optimization (GRPO), which utilizes reinforcement learning to enforce strict syntactic validity and structural integrity (e g , table closure, formula syntax). 」というアプローチでMLLMを強化。
リポジトリはGitHub – FireRedTeam/FireRed-OCR · GitHub

ACE-Brain-0: Spatial Intelligence as a Shared Scaffold for Universal Embodiments [135.0]
本稿では,空間推論,自律運転,体操を統一する一般基礎脳であるACE-Brain-0を紹介する。我々の重要な洞察は、空間的知性は様々な物理的具体化の普遍的な足場として機能するということである。そこで我々は,まず共有空間基盤を確立し,次にドメイン特化専門家を育成し,最後にデータフリーモデルマージにより調和させるScaffold-specize-Reconcile(SSR)パラダイムを提案する。
論文参考訳（メタデータ） (Tue, 03 Mar 2026 17:53:45 GMT)
「we introduce ACE-Brain-0, a generalist foundation brain that unifies spatial reasoning, autonomous driving, and embodied manipulation within a single multimodal large language model (MLLM). Our key insight is that spatial intelligence serves as a universal scaffold across diverse physical embodiments: although vehicles, robots, and UAVs differ drastically in morphology, they share a common need for modeling 3D mental space, making spatial cognition a natural, domain-agnostic foundation for cross- embodiment transfer.」とのこと。何か共通要素で紐づけられるものとしてspatial intelligenceを上げている。
プロジェクトサイトはACE-Brain Homepage

A Very Big Video Reasoning Suite [155.7]
ビデオモデルの急速な普及は視覚的品質を捉えており、その推論能力は未解明のままである。 Very Big Video Reasoning(VBVR)データセットは、200のキュレートされた推論タスクにまたがる、前例のない大規模なリソースである。 VBVR-Benchは、ルールベースのヒューマンアライメントスコアラーによるモデルベースの判断を超えて、検証可能な評価フレームワークである。
論文参考訳（メタデータ） (Tue, 24 Feb 2026 17:59:15 GMT)
「we present the VBVR suite, centered on an unprecedentedly large-scale and continually growing dataset for video reasoning, VBVR-Dataset, together with a verifiable, human-aligned evaluation toolkit, VBVR-Bench.」とのこと、とても規模が大きい。ベンチマークとしては「Proprietary models perform better overall, led by Sora 2 (0.546) and Veo 3.1 (0.480), particularly in Abstraction and Transformation categories. Fine-tuning Wan2.2-I2V-A14B on VBVR-Dataset yields VBVR-Wan2.2, which achieves a new state-of-the-art with an overall score of 0.685, representing an 84.6% relative improvement over its base model. 」とfine tuningの効果は大きいよう。
プロジェクトサイトはA Very Big Video Reasoning Suite