2025年4月2日 – arXiv最新論文の紹介

PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction

PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction [4.2]
本稿では,文書フォーマットの異なる23種類のレイアウト領域の認識において,高い精度と効率を実現するPP-Docを提案する。この研究は、文書レイアウト解析の最先端技術に加えて、高品質なトレーニングデータを構築するための堅牢なソリューションも提供する。
論文参考訳（メタデータ） (Fri, 21 Mar 2025 15:20:47 GMT)
「we present PPDocLayout, which achieves high precision and efficiency in recognizing 23 types of layout regions across diverse document formats.」と多様なデータに対応可能なレイアウト認識モデルの提案。
リポジトリはPaddleX/README_en.md at release/3.0-rc · PaddlePaddle/PaddleX · GitHub

AdaWorld: Learning Adaptable World Models with Latent Actions [76.5]
我々は,効率的な適応を実現する革新的な世界モデル学習手法であるAdaWorldを提案する。主要なアイデアは、世界モデルの事前トレーニング中にアクション情報を統合することである。次に、これらの潜伏行動を条件とした自己回帰的世界モデルを開発する。
論文参考訳（メタデータ） (Mon, 24 Mar 2025 17:58:15 GMT)
「We present AdaWorld, an autoregressive world model that is highly adaptable across various environments. It can readily transfer actions to different contexts and allows efficient adaptation with limited interactions.」というAdaWorldの提案。「AdaWorld consists of two key components: a latent action autoencoder that extracts actions from unlabeled videos, and an autoregressive world model that takes the extracted actions as conditions.」という構成。
リポジトリはAdaWorld

MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models [101.7]
MMFM(Multimodal foundation model)は、自律運転、ヘルスケア、バーチャルアシスタントなど、様々なアプリケーションにおいて重要な役割を果たす。既存のマルチモーダルモデルのベンチマークは、主にこれらのモデルの有用性を評価するか、公平性やプライバシといった限られた視点にのみフォーカスする。 MMFMの安全性と信頼性を総合的に評価するために,最初の統合プラットフォームMMDT(Multimodal DecodingTrust)を提案する。
論文参考訳（メタデータ） (Wed, 19 Mar 2025 01:59:44 GMT)
Multimodal foundation modelsの信頼性評価フレームワークの提案。主な対象はsafety, hallucination, fairness, privacy, adversarial robustness, out-of-distribution (OOD) robustness。MMFMsということでT2I、I2Tの両方が含まれる。
プロジェクトサイトはMMDecodingTrust Benchmark、リーダーボードも存在するMMDecodingTrust Benchmark。公開モデルより商用モデルの方が平均的にはスコアが高そうだが、評価軸によって状況が大きく異なるのが興味深い。