Document Understanding – arXiv最新論文の紹介

MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns

MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns [80.1]
MonkeyOCR v1.5は、2段階の解析パイプラインを通じてレイアウト理解とコンテンツ認識の両方を強化する、統一されたビジョン言語フレームワークである。複雑なテーブル構造に対処するために,レンダリング・アンド・コンペアアライメントによる認識品質の評価を行う視覚的一貫性に基づく強化学習手法を提案する。組込み画像を含むテーブルの信頼性の高い解析と、ページや列を横断するテーブルの再構築を可能にするために、2つの特別なモジュール、Image-Decoupled Table ParsingとType-Guided Table Mergingが導入されている。
論文参考訳（メタデータ） (Fri, 14 Nov 2025 01:48:44 GMT)
MonkeyOCRのアップデート、「Comprehensive experiments on OmniDocBench v1.5 demonstrate that MonkeyOCR v1.5 achieves state-of-the-art performance, outperforming PPOCR-VL and MinerU 2.5 while showing exceptional robustness in visually complex document scenarios.」とのこと。
リポジトリはGitHub – Yuliang-Liu/MonkeyOCR: A lightweight LMM-based Document Parsing Model

OmniLayout: Enabling Coarse-to-Fine Learning with LLMs for Universal Document Layout Generation

OmniLayout: Enabling Coarse-to-Fine Learning with LLMs for Universal Document Layout Generation [39.3]
Omni-1Mは、文書レイアウトの最初の100万スケールデータセットである。 2段階学習パラダイムを設計した0.5BモデルであるOmni-LLMを紹介する。私たちのコード、モデル、データセットは公開されます。
論文参考訳（メタデータ） (Thu, 30 Oct 2025 07:39:54 GMT)
文書レイアウトのデータセットOmniLayout-1M及びOmniLayout-LLMの提案。
「Our code, models, and dataset will be publicly released.」とのこと

ChatGPT Atlas, Ring-1T, DeepSeek OCR, olmOCR 2

先週はChatGPT Atlas（ChatGPT Atlas）の話題が多かった。GUI Agent（より正確にはブラウザエージェント）のように人が操作しているようにUIを使うエージェントには期待大。

Ring-1TはAnt groupによるLRM、1TパラメータのMoE構成で性能も高い。

また、DeepSeek OCRもバズっていた。OCR性能というよりもコンテキストとして画像データを使う有効性が興味深い。OCRとしてはOlmoOCRのv2も出ていてOSSの動きも盛ん。

Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model [100.9]
Ring-1Tは、数兆のパラメータを持つ最初のオープンソースの最先端の思考モデルである。総パラメータは1兆で、1トークンあたり約500億を活性化する。
論文参考訳（メタデータ） (Tue, 21 Oct 2025 17:46:14 GMT)
大規模なLRM、規模が大きいということもあるがDeepSeek V3.1など既存の公開モデルを超える性能を主張
リポジトリはGitHub – inclusionAI/Ring-V2: Ring-V2 is a reasoning MoE LLM provided and open-sourced by InclusionAI.。モデルはinclusionAI/Ring-1T · Hugging Face

DeepSeek-OCR: Contexts Optical Compression [15.6]
我々は,DeepSeek-OCRを,光学的2次元マッピングによる長期コンテキストの圧縮の実現可能性に関する最初の調査として紹介する。 DeepSeek-OCRはDeepEncoderとDeepSeek3B-MoE-A570Mの2つのコンポーネントで構成されている。実験により、テキストトークンの数がビジョントークンの10倍以内であれば、モデルがデコード(OCR)精度を97%達成できることが示された。
論文参考訳（メタデータ） (Tue, 21 Oct 2025 02:41:44 GMT)
ドキュメントの画像をコンテキストとした扱う構成のLLM、「In this technical report, we propose DeepSeek-OCR and preliminarily validate the feasibility of contexts optical compression through this model, demonstrating that the model can effectively decode text tokens exceeding 10 times the quantity from a small number of vision tokens. We believe this finding will facilitate the development of VLMs and LLMs in the future.」と効率的なよう。
リポジトリはGitHub – deepseek-ai/DeepSeek-OCR: Contexts Optical Compression

olmOCR 2: Unit Test Rewards for Document OCR [29.5]
olmOCR 2は、PDFのようなデジタル化された印刷文書を、クリーンで自然に順序付けられたプレーンテキストに変換する強力なOCRシステム群の最新版です。 olmOCR 2は、強化学習を用いて訓練された7B視覚言語モデル(VLM)であるolmOCR-2-7B-1025で駆動される。これらのテストケースに対するRLトレーニングは、我々の英語OCRベンチマークであるolmOCR-Benchにおける最先端のパフォーマンスをもたらすことを示す。
論文参考訳（メタデータ） (Wed, 22 Oct 2025 17:53:02 GMT)
こちらはOCR、olmOCRのバージョン2。「To scale unit test creation, we develop a pipeline for generating synthetic documents with diverse and challenging layouts, known ground-truth HTML source code, and extracted test cases.」と合成データを活用するアプローチ。
リポジトリはGitHub – allenai/olmocr: Toolkit for linearizing PDFs for LLM datasets/training

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing [117.6]
MinerU2.5は、例外的な計算効率を維持しつつ、最先端の認識精度を実現する文書解析モデルである。提案手法では,局所的なコンテンツ認識からグローバルなレイアウト解析を分離する,粗大な2段階解析戦略を採用している。
論文参考訳（メタデータ） (Mon, 29 Sep 2025 16:41:28 GMT)
MinerU: An Open-Source Solution for Precise Document Content Extraction – arXiv最新論文の紹介の最新バージョン、強力な1.2BのVLM。汎用的・商用API、特化型モデルを上回る性能。
リポジトリはGitHub – opendatalab/MinerU: Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.、デモも存在するMinerU – a Hugging Face Space by opendatalab、高速で高性能。

Hunyuan3D-Omni, Qwen3-Omni, LongCat-Flash-Thinking, EmbeddingGemma, Logics-Parsing

公開モデルの開発はとても盛んで、先週はQwen3 Omniが話題になることが多かったように思う。arXivではQwen3 Omini以外にも有望なモデルの発表が相次いでいる。

Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets [34.7]
Hunyuan3D-Omniは、Hunyuan3D 2.1上に構築されたきめ細かい制御可能な3Dアセット生成のための統一されたフレームワークである。我々のモデルは単一のクロスモーダルアーキテクチャで全ての信号を統一する。実験により、これらの追加制御により生成精度が向上し、幾何認識変換が可能となり、生産の堅牢性も向上することが示された。
論文参考訳（メタデータ） (Thu, 25 Sep 2025 14:39:17 GMT)
3Dにフォーカスした実装
リポジトリはGitHub – Tencent-Hunyuan/Hunyuan3D-Omni: Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets

Qwen3-Omni Technical Report [105.1]
Qwen3-Omniは、テキスト、画像、オーディオ、ビデオ間で最先端のパフォーマンスを維持する単一のマルチモーダルモデルである。 Qwen3-OmniはQwenシリーズ内の同一サイズのシングルモーダルモデルのパフォーマンスと一致し、特にオーディオタスクに優れる。 119言語でのテキストインタラクション、19言語での音声理解、および10言語での音声生成をサポートする。
論文参考訳（メタデータ） (Mon, 22 Sep 2025 13:26:24 GMT)
Qwen系のマルチモーダルモデル
リポジトリはGitHub – QwenLM/Qwen3-Omni: Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.

LongCat-Flash-Thinking Technical Report [116.8]
LongCat-Flash-ThinkingはオープンソースのMixture-of-Experts (MoE)推論モデルである。高度な能力は、巧妙に製作された訓練プロセスを通じて育成される。 LongCat-Flash-Thinkingは、複雑な推論タスクのスイート上で、オープンソースモデル間の最先端のパフォーマンスを達成する。
論文参考訳（メタデータ） (Tue, 23 Sep 2025 10:25:48 GMT)
MoEなLRM、OSSなモデルでのSoTAを主張
リポジトリはmeituan-longcat/LongCat-Flash-Thinking · Hugging Face

EmbeddingGemma: Powerful and Lightweight Text Representations [42.4]
EmbeddingGemmaはGemma 3言語ファミリに基づいた、新しい軽量でオープンなテキスト埋め込みモデルである。スプレッドアウト正規化器を用いてモデル頑健性と表現性を向上する。さらなる研究を促進するため、コミュニティに EmbeddingGemma をリリースします。
論文参考訳（メタデータ） (Wed, 24 Sep 2025 17:56:51 GMT)
小規模、強力なEmbeddingモデル
リポジトリはEmbeddingGemma – a google Collection

Logics-Parsing Technical Report [9.0]
我々は、強化学習を付加したエンドツーエンドのLVLMモデルであるLogics-Parsingを提案する。本モデルでは、複雑なレイアウト解析と読み出し順序推定を最適化するために、厳密に設計された報酬機構を組み込んでいる。 LogicsParsingBenchは、9つの主要なカテゴリと20以上のサブカテゴリにまたがる1,078ページレベルのPDFイメージのキュレートされたセットである。
論文参考訳（メタデータ） (Wed, 24 Sep 2025 04:54:37 GMT)
Document Understandingに有効なLVLM
リポジトリはGitHub – alibaba/Logics-Parsing

DocR1: Evidence Page-Guided GRPO for Multi-Page Document Understanding

DocR1: Evidence Page-Guided GRPO for Multi-Page Document Understanding [97.4]
本稿では,新しいRLフレームワークであるEvidence Page-Guided GRPOで学習したMLLMであるDocR1を紹介する。 EviGRPOには、粗大な推論戦略を促進するエビデンス対応報酬機構が組み込まれている。我々は,DocR1が複数ページのタスクに対して最先端のパフォーマンスを達成し,シングルページのベンチマークにおいて強い結果を維持していることを示す。
論文参考訳（メタデータ） (Sun, 10 Aug 2025 12:03:45 GMT)
多くのページがあるドキュメント読解のためのフレームワークの提案。
「When engaging in multi-page reading comprehension, humans typically begin by identifying the pages likely to contain the answer, and then focus on locating the specific regions that correspond to the question and answer within those pages. Inspired by this “coarse-to-fine” reading strategy, EviGRPO mimics the human approach by first selecting a small set of potentially relevant pages at a coarse level, followed by fine-grained reasoning over the selected content.」とのことだが、このようなドメイン（タスク）特化のアプローチはいまだ有効なんだろうか。。

Docopilot: Improving Multimodal Models for Document-Level Understanding

Docopilot: Improving Multimodal Models for Document-Level Understanding [87.6]
マルチモーダル文書の詳細な理解を支援するために,高品質な文書レベルデータセットDoc-750Kを提案する。このデータセットには、さまざまなドキュメント構造、広範なクロスページ依存関係、および元のドキュメントから派生した実際の質問と回答のペアが含まれている。データセットに基づいて、RAGに頼ることなく、文書レベルの依存関係を正確に処理できるネイティブなマルチモーダルモデルであるDocopilotを開発する。
論文参考訳（メタデータ） (Sat, 19 Jul 2025 16:03:34 GMT)
大規模なマルチモーダルDocumentUnderstanding用データの構築とInternVL2ベースのモデル構築。「The proposed Docopilot-8B shows a notable improvement over baseline models [73], achieving a +19.9% accuracy gain compared to InternVL2-8B and surpassing InternVL2-26B with less than 31% of the inference latency. Additionally, Docopilot-2B uses fewer parameters (less than 10%) while exhibiting comparable performance to the 10× larger InternVL2-26B.」と性能向上。
リポジトリはOpenGVLab/Docopilot: [CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding

A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends

A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends [11.4]
Visually-Rich Document Understanding (VRDU)は、複雑なビジュアル、テキスト、レイアウト情報を含む文書を自動的に処理する必要があるため、重要な分野として登場した。この調査はMLLMベースのVRDUの最近の進歩をレビューし、3つのコアコンポーネントを強調した。
論文参考訳（メタデータ） (Mon, 14 Jul 2025 02:10:31 GMT)
図やレイアウトの取り扱いを含むDocument Understandingのサーベイ

DocSAM: Unified Document Image Segmentation via Query Decomposition and Heterogeneous Mixed Learning

DocSAM: Unified Document Image Segmentation via Query Decomposition and Heterogeneous Mixed Learning [39.1]
文書画像のセグメンテーションは、文書解析と認識に不可欠である。既存のメソッドはこれらのタスクを別々に処理し、その結果、一般化とリソースの浪費が制限される。本稿では,様々な文書画像セグメンテーションタスク用に設計されたトランスフォーマーベースの統合フレームワークであるDocSAMを紹介する。
論文参考訳（メタデータ） (Sat, 05 Apr 2025 07:14:53 GMT)
MLLM全盛の現状でも重要なDocument image segmentationについて「DocSAM integrates layout analysis, multi-grained text segmentation, and table structure decomposition into a single model, reducing the need for specialized models and enhancing efficiency.」という手法の提案。
リポジトリはGitHub – xhli-git/DocSAM

PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction

PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction [4.2]
本稿では,文書フォーマットの異なる23種類のレイアウト領域の認識において,高い精度と効率を実現するPP-Docを提案する。この研究は、文書レイアウト解析の最先端技術に加えて、高品質なトレーニングデータを構築するための堅牢なソリューションも提供する。
論文参考訳（メタデータ） (Fri, 21 Mar 2025 15:20:47 GMT)
「we present PPDocLayout, which achieves high precision and efficiency in recognizing 23 types of layout regions across diverse document formats.」と多様なデータに対応可能なレイアウト認識モデルの提案。
リポジトリはPaddleX/README_en.md at release/3.0-rc · PaddlePaddle/PaddleX · GitHub

2025年11月
月	火	水	木	金	土	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30