マルチモーダル – arXiv最新論文の紹介

MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation

MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation [89.7]
MultiFinBenは、グローバルファイナンシャルドメインに合わせた最初のマルチリンガルおよびマルチモーダルベンチマークである。我々は,最初のOCR組み込み財務QAタスクである EnglishOCR と SpanishOCR の2つの新しいタスクを紹介する。本稿では,動的で難易度の高い選択機構を提案し,コンパクトでバランスの取れたベンチマークをキュレートする。
論文参考訳（メタデータ） (Mon, 16 Jun 2025 22:01:49 GMT)
金融ドメインのマルチモーダル、マルチリンガルベンチマーク。日本語データも含まれているよう。
リポジトリはGitHub – xueqingpeng/MultiFinBen、データはHuggingFaceで公開されている（TheFinAI/PolyFiQA-Easy · Datasets at Hugging Faceなど）

Holmes: Automated Fact Check with Large Language Models

Holmes: Automated Fact Check with Large Language Models [31.8]
本研究では,Large Language Models (LLMs) を用いて自動偽情報検出を行う。新たなエビデンス検索手法を特徴とするエンドツーエンドフレームワークであるHolmesを提案する。提案手法では,(1)LLMを用いた要約を用いてオープンソースから鍵情報を抽出し,(2)エビデンスの品質を評価するための新しいアルゴリズムと指標を提案する。
論文参考訳（メタデータ） (Tue, 06 May 2025 03:19:51 GMT)
ファクトチェックに関する論文で丁寧な記載とFIndingsがととても参考になる。
- 「Finding 1: LLMs CANNOT accurately verify the truth- fulness of the claim directly.」、「Finding 2: LLMs have shortcomings in searching for claim-relevant public information and their responses may include hallucinated links that weaken result trust- worthiness.」、「Finding 3: Human-written evidence enhances LLMs’ ability to verify multimodal claims and generate coherent justifications.」
上記をもとにHolmesを設計、有効性を確認とのこと

Qilin: A Multimodal Information Retrieval Dataset with APP-level User Sessions

Qilin: A Multimodal Information Retrieval Dataset with APP-level User Sessions [39.2]
検索・レコメンデーション(S&R)を伴う複雑なシステムにおけるユーザエクスペリエンス向上の課題は、学術と産業の両方から大きな注目を集めている。本稿では,新しいマルチモーダル情報検索データセット,すなわちQilinを提案する。データセットはXiaohongshuから収集されている。Xiaohongshuは3億人の月間アクティブユーザーがいて、平均的な検索浸透率は70%を超えている。
論文参考訳（メタデータ） (Sat, 01 Mar 2025 14:15:00 GMT)
マルチモーダルなsearch and recommendationを対象としたデータセット
リポジトリはGitHub – RED-Search/Qilin: Resources and code for the Qilin dataset.

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency [63.2]
CoT (Chain-of-Thought) は,Large Language Models (LLMs) の推論能力を大幅に向上させた。我々は,LMMのCoT推論性能を評価する特別ベンチマークであるMME-CoTを紹介する。我々は最先端のLMMの詳細な分析を行い、いくつかの重要な知見を明らかにした。
論文参考訳（メタデータ） (Thu, 13 Feb 2025 18:59:46 GMT)
「we introduce MMECoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs, spanning six domains: math, science, OCR, logic, space-time, and general scenes.」というベンチマーク
プロジェクトサイトはMME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency、LeaderboardトップがKimi k1.5でGPT-4oを超えているという驚きの結果。

MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents [26.4]
この研究はMMDocIRと呼ばれる新しいベンチマークを導入し、ページレベルとレイアウトレベルの検索という2つの異なるタスクを含んでいる。 MMDocIRベンチマークは,1,685問の注釈付きラベルと173,843問の自己ストラップ付きラベルを備えた,豊富なデータセットで構成されている。
論文参考訳（メタデータ） (Wed, 15 Jan 2025 14:30:13 GMT)
マルチモーダル、長い文書への検索ベンチマーク、document page-level and layout-level retrievalの２つがあるのが特徴的。
リポジトリはMMDocIR (MMDocIR)

Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues

Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues [56.0]
我々の目的は、連続手話から音声言語テキストへの翻訳である。署名ビデオと追加のコンテキストキューを組み込む。文脈的アプローチが翻訳の質を著しく向上させることを示す。
論文参考訳（メタデータ） (Thu, 16 Jan 2025 18:59:03 GMT)
「(i) we propose a new LLM-based model that integrates visual signing and text features with contextual information, including video background descriptions and previous sentence translations;」というようにコンテキスト情報を活用した手話への機械翻訳アプローチの提案
リポジトリはLost in Translation, Found in Context

M$^3$oralBench: A MultiModal Moral Benchmark for LVLMs

M$^3$oralBench: A MultiModal Moral Benchmark for LVLMs [66.8]
LVLMのための最初のMultiModal Moral BenchmarkであるM$3$oralBenchを紹介する。 M$3$oralBench は Moral Foundations Vignettes (MFVs) の日常的なモラルシナリオを拡張し、テキストから画像への拡散モデル SD3.0 を用いて対応するシナリオイメージを作成する。道徳基礎理論(MFT)の6つの道徳的基礎にまたがって道徳的評価を行い、道徳的判断、道徳的分類、道徳的対応の課題を含む。
論文参考訳（メタデータ） (Mon, 30 Dec 2024 05:18:55 GMT)
マルチモーダルなモラルベンチマーク、「Care/Harm (dislike for suffering of others), Fairness/Cheating (proportional fairness, Loyalty/Betrayal (group loyalty), Authority/Subversion (respect for authority and tradition), Sanctity/Degradation (concerns for purity and contamination), Liberty/Oppression (concerns on oppression and coercion)」の6つの道徳的基礎がベース
リポジトリはGitHub – BeiiiY/M3oralBench: The official Github page for “M³oralBench: A MultiModal Moral Benchmark for LVLMs”

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey [93.7]
Next Token Prediction (NTP)は、機械学習タスクの多目的な学習目標である。本調査では,マルチモーダル学習における理解と生成を一体化する包括的分類法を導入する。提案した分類法は,マルチモーダルトークン化,MMNTPモデルアーキテクチャ,統合タスク表現,データセットと評価,オープンチャレンジの5つの重要な側面を網羅している。
論文参考訳（メタデータ） (Mon, 30 Dec 2024 03:00:30 GMT)
一般的なテクニックとなったNext token predictionのサーベイ、マルチモーダルな学習を対象にしている。
リポジトリはGitHub – LMM101/Awesome-Multimodal-Next-Token-Prediction: Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

AnySat: An Earth Observation Model for Any Resolutions, Scales, and Modalities

AnySat: An Earth Observation Model for Any Resolutions, Scales, and Modalities [5.8]
本稿では,JEPAと分解能適応型空間エンコーダに基づくマルチモーダルモデルであるAnySatを提案する。この統一アプローチの利点を示すために、5ドルのマルチモーダルデータセットのコレクションであるGeoPlexをコンパイルする。次に、これらの多様なデータセット上で、単一の強力なモデルを同時にトレーニングします。
論文参考訳（メタデータ） (Wed, 18 Dec 2024 18:11:53 GMT)
様々な Earth observationデータを統合的に扱える基盤モデルの提案。「We have presented AnySat, a versatile architecture designed to address the diversity of EO data in terms of resolutions, scales, and modalities.」ということで効果も検証されている。
リポジトリはGitHub – gastruc/AnySat

TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action

TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action [103.6]
複雑・多段階・多モードタスクの性能向上を目的とした多モード大規模アクションモデルであるTACOを提案する。推論中、TACOはチェーン・オブ・シント・アンド・アクション(CoTA)を生成し、OCR、深さ推定、電卓などの外部ツールを呼び出すことで中間ステップを実行する。このデータセットにより、TACOは複雑な推論とアクションパスを学習し、直接回答だけでチューニングデータに基づいてトレーニングされた既存のモデルを上回ることができる。
論文参考訳（メタデータ） (Sat, 07 Dec 2024 00:42:04 GMT)
「Our TACO model is able to output a Chain-of Thought-and-Action (CoTA) and answer challenging questions based on the thoughts and action outputs」というモデルの提案。マルチモーダルなAction付きのモデル。GPT-4oなどを使って構築した合成データを活用とのこと。
プロジェクトサイトはTACO

2025年7月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31