2025年1月24日 – arXiv最新論文の紹介

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning [72.6]
テキスト認識のための大規模バイリンガルテキスト中心ベンチマークであるOCRBench v2を紹介する。その結果,22 LMM中20 LMMは50点未満(合計100点)で,5種類の制限があることがわかった。
論文参考訳（メタデータ） (Tue, 31 Dec 2024 07:32:35 GMT)
MLLMを対象としたOCRベンチマーク、「After carefully benchmarking state-of-the-art LMMs on OCRBench v2, we find that 36 out of 38 LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, finegrained perception, layout perception, complex element parsing, and logical reasoning.」とのこと。
リポジトリはhttps://github.com/YuliangLiu/MultimodalOCR

Benchmarking Large and Small MLLMs [71.8]
大規模なマルチモーダル言語モデル(MLLM)は、マルチモーダルコンテンツの理解と生成において顕著な進歩を遂げている。しかし、そのデプロイメントは、遅い推論、高い計算コスト、デバイス上のアプリケーションに対する非現実性など、重大な課題に直面している。 LLavaシリーズモデルとPhi-3-Visionによって実証された小さなMLLMは、より高速な推論、デプロイメントコストの削減、ドメイン固有のシナリオを扱う能力を備えた有望な代替手段を提供する。
論文参考訳（メタデータ） (Sat, 04 Jan 2025 07:44:49 GMT)
MLLMの包括的評価。
「GPT-4o establishes a new standard for multimodal understanding and reasoning across diverse input types, setting a benchmark in versatility and cognitive capacity.」のほか、「Although LLaVA-NeXT and Phi-3-Vision excel in specialized recognition tasks, they exhibit limitations in advanced reasoning and temporal sequence processing.」とのこと。
MSの調査でもあり、Phi4でのアップデートにも期待。microsoft/phi-4 · Hugging Face

Foundations of Large Language Models [50.0]
本書は4つの主要な章で構成されており、それぞれが事前学習、生成モデル、プロンプト技術、アライメント方法という重要な領域を探求している。自然言語処理や関連分野の大学生、専門家、実践者を対象としている。
論文参考訳（メタデータ） (Thu, 16 Jan 2025 01:03:56 GMT)
200ページ超でLLMの教科書という内容。
ライセンスはDeed – Attribution-NonCommercial 4.0 International – Creative Commons　で商用利用できない点に注意が必要。