MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns

MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns [80.1]
MonkeyOCR v1.5は、2段階の解析パイプラインを通じてレイアウト理解とコンテンツ認識の両方を強化する、統一されたビジョン言語フレームワークである。複雑なテーブル構造に対処するために,レンダリング・アンド・コンペアアライメントによる認識品質の評価を行う視覚的一貫性に基づく強化学習手法を提案する。組込み画像を含むテーブルの信頼性の高い解析と、ページや列を横断するテーブルの再構築を可能にするために、2つの特別なモジュール、Image-Decoupled Table ParsingとType-Guided Table Mergingが導入されている。
論文参考訳（メタデータ） (Fri, 14 Nov 2025 01:48:44 GMT)
MonkeyOCRのアップデート、「Comprehensive experiments on OmniDocBench v1.5 demonstrate that MonkeyOCR v1.5 achieves state-of-the-art performance, outperforming PPOCR-VL and MinerU 2.5 while showing exceptional robustness in visually complex document scenarios.」とのこと。
リポジトリはGitHub – Yuliang-Liu/MonkeyOCR: A lightweight LMM-based Document Parsing Model

コメントを残す

コメントを残す コメントをキャンセル