2025年11月18日 – arXiv最新論文の紹介

MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns

MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns [80.1]
MonkeyOCR v1.5は、2段階の解析パイプラインを通じてレイアウト理解とコンテンツ認識の両方を強化する、統一されたビジョン言語フレームワークである。複雑なテーブル構造に対処するために,レンダリング・アンド・コンペアアライメントによる認識品質の評価を行う視覚的一貫性に基づく強化学習手法を提案する。組込み画像を含むテーブルの信頼性の高い解析と、ページや列を横断するテーブルの再構築を可能にするために、2つの特別なモジュール、Image-Decoupled Table ParsingとType-Guided Table Mergingが導入されている。
論文参考訳（メタデータ） (Fri, 14 Nov 2025 01:48:44 GMT)
MonkeyOCRのアップデート、「Comprehensive experiments on OmniDocBench v1.5 demonstrate that MonkeyOCR v1.5 achieves state-of-the-art performance, outperforming PPOCR-VL and MinerU 2.5 while showing exceptional robustness in visually complex document scenarios.」とのこと。
リポジトリはGitHub – Yuliang-Liu/MonkeyOCR: A lightweight LMM-based Document Parsing Model

Music Flamingo: Scaling Music Understanding in Audio Language Models [98.9]
Music Flamingoは、基礎的なオーディオモデルにおける音楽理解を促進するために設計された、新しい大きなオーディオ言語モデルである。 MF-Skillsはマルチステージパイプラインを通じてラベル付けされたデータセットで、調和、構造、音色、歌詞、文化的な文脈をカバーする豊富なキャプションと質問応答ペアを生成する。 MF-Thinkは音楽理論に基づく新しいチェーン・オブ・シンク・データセットで、続いてGRPOベースの強化学習とカスタム報酬を取り入れた。
論文参考訳（メタデータ） (Fri, 14 Nov 2025 01:43:47 GMT)
「Unlike speech or environmental sounds, music is inherently layered, expressive, and structured, combining surface- level acoustic attributes (tempo, key, timbre) with mid-level organization (harmony, form, rhythm) and higher-level dimensions (lyrics, style, affect, cultural context). Capturing this multi-faceted nature of music requires models that can move beyond surface-level recognition toward reasoning and interpretation more akin to a trained musician.」と非常に難しいタスクである音楽理解のためのモデルの提案。
プロジェクトサイトはMusic Flamingo: Scaling Music Understanding in Audio Language Models – NVIDIA ADLR

When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought [118.7]
MIRAは,中間画像の生成が推論の成功に不可欠であるシナリオにおいて,モデルを評価するために設計された新しいベンチマークである。 546のマルチモーダル問題を含み、中間画像と最終回答が注釈付きである。
論文参考訳（メタデータ） (Tue, 04 Nov 2025 18:00:51 GMT)
「To bridge this gap, we introduce MIRA (Multimodal Imagination for Reasoning Assessment), a benchmark designed to evaluate reasoning scenarios where generating or leveraging intermediate visual representations is essential. Each instance is constructed according to three principles: (1) requiring intermediate visual cues to answer the question, (2) pairing each instance with annotated step-wise visual clues to enable evaluation under a Visual-CoT setup, and (3) enforcing strict human annotation and cross-validation to guarantee data quality.」と視覚的・画像的な中間表現を必要とする推論のためのベンチマークの提案。フロンティアモデルでも難しいタスクになっている（が、公開モデルも健闘しているように見える）
プロジェクトサイトはWhen Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

AlphaResearch: Accelerating New Algorithm Discovery with Language Models [60.5]
大規模言語モデルは複雑だが検証が容易な問題において大きな進歩を遂げてきたが、未知の発見に苦戦している。オープンエンド問題に対する新しいアルゴリズムの発見を目的とした,自律型研究エージェントである AlphaResearch を提示する。
論文参考訳（メタデータ） (Wed, 12 Nov 2025 02:03:05 GMT)
「The novel algorithms discovered by AlphaResearch not only surpass best-of-human performance but also significantly outperform the state-of-the-art results achieved by AlphaEvolve.」と驚く結果を報告。「Our approach demonstrates the potential of employing LLM to discover unexplored research area, enabling language models to effectively tackle complex open-ended tasks. We construct AlphaResearchComp, including 8 open-ended algorithmic problems, where AlphaResearch outperforms human researchers in 2/8 algorithmic problems but lags behind in the remaining 6 problems.」とのこと。評価は難しいが、人間を上回っても驚かないようなすごい時代になっている。
リポジトリはGitHub – answers111/alpha-research: Repo for “AlphaResearch: Accelerating New Algorithm Discovery with Language Models”