arXiv最新論文の紹介

CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers?

CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers? [36.8]
CLAIMCHECKは、NeurIPS 2023と2024のアノテートデータセットであり、OpenReviewから抽出されたレビューである。 CLAIMCHECKは、レビューの弱点に関するMLの専門家によって豊富な注釈が付けられており、論文は、それらが矛盾していると主張しており、また、識別された弱点の妥当性、客観性、タイプに関するきめ細かいラベルも主張している。我々は,CLAIMCHECK が支援する3つのクレーム中心タスクについて,(1) 紛争のクレームに弱点を関連付けること,(2) 弱点のきめ細かいラベルを予測し,その特異性を高めるために弱点を書き換えること,(3) 根拠付き推論で論文のクレームを検証すること,の3つについて,LCM をベンチマークする。
論文参考訳（メタデータ） (Thu, 27 Mar 2025 17:29:45 GMT)
「This work has introduced CLAIMCHECK—a benchmark of reviewer-identified weaknesses in NeurIPS 2023 and 2024 submissions, richly annotated with descriptive labels by experts and grounded in the claims that they dispute in the reviewed papers. Further, we benchmark various LLMs on three novel tasks enabled by CLAIMCHECK—Weakness Labeling and Editing (WLE), Claim Association (CA), and Claim Verification (CV)—all aimed at assisting reviewers during the peer review process.」というベンチマークの提案。現在のLLMにとって難しいタスクとなっている。
リポジトリはhttps://github.com/JHU-CLSP/CLAIMCHECKとのこと

Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities

Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities [101.8]
近年のLRM(Large Reasoning Models)の進歩は、特殊推論タスクにおいて顕著な性能を示している。議論的推論能力の獲得は, LRMの基礎的能力を大幅に低下させることを示す。適応推論(Zero-Thinking, Less-Thinking, Summary-Thinking)がこれらの欠点を効果的に軽減できることを示します。
論文参考訳（メタデータ） (Sun, 23 Mar 2025 08:18:51 GMT)
「The overall results of different LRMs under the Zero-Thinking, Summary-Thinking and Summary-Thinking-Plus mode for the evaluation of foundational capabilities.」の表5の結果が非常に興味深い。推論にパワーをかければよいというわけでもなく適応型戦略の重要性がよくわかる。
リポジトリはGitHub – SCIR-SC-Qiaoban-Team/FreeEvalLM

PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction

PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction [4.2]
本稿では,文書フォーマットの異なる23種類のレイアウト領域の認識において,高い精度と効率を実現するPP-Docを提案する。この研究は、文書レイアウト解析の最先端技術に加えて、高品質なトレーニングデータを構築するための堅牢なソリューションも提供する。
論文参考訳（メタデータ） (Fri, 21 Mar 2025 15:20:47 GMT)
「we present PPDocLayout, which achieves high precision and efficiency in recognizing 23 types of layout regions across diverse document formats.」と多様なデータに対応可能なレイアウト認識モデルの提案。
リポジトリはPaddleX/README_en.md at release/3.0-rc · PaddlePaddle/PaddleX · GitHub

AdaWorld: Learning Adaptable World Models with Latent Actions

AdaWorld: Learning Adaptable World Models with Latent Actions [76.5]
我々は,効率的な適応を実現する革新的な世界モデル学習手法であるAdaWorldを提案する。主要なアイデアは、世界モデルの事前トレーニング中にアクション情報を統合することである。次に、これらの潜伏行動を条件とした自己回帰的世界モデルを開発する。
論文参考訳（メタデータ） (Mon, 24 Mar 2025 17:58:15 GMT)
「We present AdaWorld, an autoregressive world model that is highly adaptable across various environments. It can readily transfer actions to different contexts and allows efficient adaptation with limited interactions.」というAdaWorldの提案。「AdaWorld consists of two key components: a latent action autoencoder that extracts actions from unlabeled videos, and an autoregressive world model that takes the extracted actions as conditions.」という構成。
リポジトリはAdaWorld

MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models

MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models [101.7]
MMFM(Multimodal foundation model)は、自律運転、ヘルスケア、バーチャルアシスタントなど、様々なアプリケーションにおいて重要な役割を果たす。既存のマルチモーダルモデルのベンチマークは、主にこれらのモデルの有用性を評価するか、公平性やプライバシといった限られた視点にのみフォーカスする。 MMFMの安全性と信頼性を総合的に評価するために,最初の統合プラットフォームMMDT(Multimodal DecodingTrust)を提案する。
論文参考訳（メタデータ） (Wed, 19 Mar 2025 01:59:44 GMT)
Multimodal foundation modelsの信頼性評価フレームワークの提案。主な対象はsafety, hallucination, fairness, privacy, adversarial robustness, out-of-distribution (OOD) robustness。MMFMsということでT2I、I2Tの両方が含まれる。
プロジェクトサイトはMMDecodingTrust Benchmark、リーダーボードも存在するMMDecodingTrust Benchmark。公開モデルより商用モデルの方が平均的にはスコアが高そうだが、評価軸によって状況が大きく異なるのが興味深い。

Can LLMs Automate Fact-Checking Article Writing?

Can LLMs Automate Fact-Checking Article Writing? [69.9]
我々は、一般的なファクトチェックパイプラインを拡張し、フルファクトチェック記事の自動生成の必要性を論じる。我々は,人間のファクトチェッカーの筆記ワークフローを模倣した LLM ベースのエージェントフレームワーク QRAFT を開発した。
論文参考訳（メタデータ） (Sat, 22 Mar 2025 07:56:50 GMT)
いわゆる普通のファクトチェックではなく「QRAFT as a multi-agent collaboration that mimics the factchecking article writing process of human experts」というフレームワークQRAFTの提案。
他手法よりも性能はよいものの「Our evaluation shows that while QRAFT outperforms several previously proposed text-generation approaches, it lags considerably behind expert-written articles.」というのは残念

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models [51.3]
大規模言語モデル(LLM)は複雑なタスクにおいて顕著な機能を示した。 OpenAI o1とDeepSeek-R1の最近の進歩は、System-2推論ドメインのパフォーマンスをさらに改善した。
論文参考訳（メタデータ） (Thu, 20 Mar 2025 17:59:38 GMT)
overthinkingの防止、効率的な推論に関するサーベイ
リポジトリはGitHub – Eclipsess/Awesome-Efficient-Reasoning-LLMs

Survey on Evaluation of LLM-based Agents

Survey on Evaluation of LLM-based Agents [28.9]
LLMベースのエージェントの出現は、AIのパラダイムシフトを表している。本稿では,これらのエージェントに対する評価手法に関する総合的な調査を初めて実施する。
論文参考訳（メタデータ） (Thu, 20 Mar 2025 17:59:23 GMT)
「We systematically analyze evaluation benchmarks and frameworks across four critical dimensions: (1) fundamental agent capabilities, including planning, tool use, self-reflection, and memory; (2) applicationspecific benchmarks for web, software engineering, scientific, and conversational agents; (3) benchmarks for generalist agents; and (4) frameworks for evaluating agents.」とエージェントの評価に関するサーベイ

Analyzable Chain-of-Musical-Thought Prompting for High-Fidelity Music Generation

Analyzable Chain-of-Musical-Thought Prompting for High-Fidelity Music Generation [10.6]
音楽生成に適した新しいチェーン・オブ・シークレット(CoT)プロンプト技術であるMusiCoTを紹介する。 MusiCoTは、オーディオトークンを生成する前に、ARモデルに音楽構造全体を概説する権限を与える。実験結果から,MusiCoTは主観的,主観的両指標で常に優れた性能を発揮することが示された。
論文参考訳（メタデータ） (Tue, 25 Mar 2025 12:51:21 GMT)
「this paper presents MusiCoT, a novel chain-of-thought prompting technique that enhances high-fidelity music generation by aligning the creative processes of AR models with musical thought.」と音楽生成にもCoT…
リポジトリはMusiCoT

Gemini 2.5, Deepseek V3, MCP …

週刊LLMが続いている。Gemini 2.5はGoogle Deepmindの最新モデルで非常に性能が高い（Gemini 2.5: Our newest Gemini model with thinking）。Humanity’s Last Examで18.8%と非常に難しいデータセットに対しても性能が上がっていっているのがすごい。Deepseek V3もアップデートが出ており当初のバージョンよりも性能が上がっている（DeepSeek-V3-0324 Release | DeepSeek API Docs、deepseek-ai/DeepSeek-V3-0324 · Hugging Face）。Gemma 3やQwen2.5 Omniのテクニカルレポートにも要注目。

LLM以外でもOpenAIのMCP対応（Model context protocol (MCP) – OpenAI Agents SDK）や画像生成AI（Introducing 4o Image Generation | OpenAI）などバズるニュースが多い。Reve AI | Next-Gen AI Image Generator with Reve Image 1.0など新たな動きもあり、本当に活発な分野である。

Gemma 3 Technical Report [198.3]
Gemma 3は、軽量オープンモデルのGemmaファミリに対するマルチモーダルな追加である。このバージョンでは、視覚理解能力、より広範な言語カバレッジ、より長いコンテキストが導入されている。また、長いコンテキストで爆発しがちなKVキャッシュメモリを減らすために、モデルのアーキテクチャを変更します。
論文参考訳（メタデータ） (Tue, 25 Mar 2025 15:52:34 GMT)

Qwen2.5-Omni Technical Report [31.0]
本稿では,テキスト,画像,音声,ビデオなど多様なモーダル性を認識するために,テキストと自然な音声応答を同時生成するエンド・ツー・エンドのマルチモーダルモデルを提案する。 Qwen2.5-OmniはOmni-Benchのようなマルチモーダルベンチマークで最先端のパフォーマンスを実現している。
論文参考訳（メタデータ） (Wed, 26 Mar 2025 04:17:55 GMT)

2026年7月
月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31