2025年5月 – arXiv最新論文の紹介

DD-Ranking: Rethinking the Evaluation of Dataset Distillation

DD-Ranking: Rethinking the Evaluation of Dataset Distillation [223.3]
本稿では,統合評価フレームワークであるDD-Rankingと,異なる手法によって達成された真の性能改善を明らかにするための新しい総合評価指標を提案する。 DD-Rankingは、蒸留データセットの実際の情報強化に再焦点をあてることで、将来の研究の進展に対してより包括的で公正な評価基準を提供する。
論文参考訳（メタデータ） (Mon, 19 May 2025 16:19:50 GMT)
データセット蒸留に対するベンチマークの提案。「It aims to provide a fair evaluation scheme for DD methods that can decouple the impacts from knowledge distillation and data augmentation to reflect the real informativeness of the distilled data. Under the finding that the test accuracy no longer fits the need for fair and comprehensive evaluation, we design new metrics for both the label representation and data augmentation.」とのこと。モチベーションの一つになっているものだが「DD-Ranking demonstrate that previous performance improvements commonly originate from the enhanced model training techniques instead of the distilled dataset.」という指摘も興味深い。
リポジトリはGitHub – NUS-HPC-AI-Lab/DD-Ranking: Data distillation benchmark

Bidirectional LMs are Better Knowledge Memorizers? A Benchmark for Real-world Knowledge Injection

Bidirectional LMs are Better Knowledge Memorizers? A Benchmark for Real-world Knowledge Injection [48.2]
人間の介入を必要とせず、時間とともに継続的に進化する新しい、現実的で大規模な知識注入ベンチマークを導入する。 WikiDYKはウィキペディアの「Did You Know…」エントリから最近追加された人文的な事実を活用する。 WikiDYKには12,290の事実と77,180の質問が含まれている。
論文参考訳（メタデータ） (Sun, 18 May 2025 08:39:05 GMT)
「Our extensive experiments reveal a critical limitation: under continued pre-training, Causal Language Models (CLMs) exhibit significantly weaker knowledge memorization compared to Bidirectional Language Models (BiLMs). To address this gap, we proposed a modular collaborative framework that integrates BiLMs as dynamic external knowledge repositories with LLMs.」とのこと。今はCausal LM全盛という感じだが、BiLMの活用はありえるのだろうか。速度的な問題次第・・・？
リポジトリはGitHub – zhang-yu-wei/WikiDYK

Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge

Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge [102.8]
本課題は,多様な音響シーンに対する対話型質問応答における音声モデルをテストするための3つのQAサブセットを定義する。開発セットの予備的な結果を比較し、モデルとサブセット間で強い変動を示す。この課題は、音声モデルの音声理解と推論能力を人間レベルに向上することを目的としている。
論文参考訳（メタデータ） (Mon, 12 May 2025 09:04:16 GMT)
Audio Question Answeringベンチマーク、DCASE 2025 Challengeの説明。audio captioningタスクより一歩進んだもので重要性が増すタスクだと思う。
リポジトリはPeacefulData/2025_DCASE_AudioQA_Official · Datasets at Hugging Face

Generative AI for Autonomous Driving: Frontiers and Opportunities

Generative AI for Autonomous Driving: Frontiers and Opportunities [145.6]
この調査は、自律運転スタックにおけるGenAIの役割の包括的合成を提供する。まず、VAE、GAN、拡散モデル、および大規模言語モデルを含む、現代の生成モデリングの原則とトレードオフを蒸留することから始めます。我々は、合成データ一般化、エンドツーエンド駆動戦略、高忠実なデジタルツインシステム、スマートトランスポートネットワーク、具体化されたAIへのクロスドメイン転送など、実用的な応用を分類する。
論文参考訳（メタデータ） (Tue, 13 May 2025 17:59:20 GMT)
生成AI＆自動運転のサーベイ。プレイヤーもタスクも多い領域。
リポジトリはGitHub – taco-group/GenAI4AD: a comprehensive and critical synthesis of the emerging role of GenAI across the autonomous driving stack

Visual Planning: Let’s Think Only with Images

Visual Planning: Let’s Think Only with Images [30.7]
我々は、特に空間的情報や幾何学的情報を含むタスクにおいて、言語が推論において最も自然で効果的なモダリティであるとは限らないと論じる。そこで本研究では,テキストから独立して,純粋に視覚的な表現によるプランニングを可能にする,ビジュアルプランニングという新たなパラダイムを提案する。このパラダイムでは、計画は視覚領域におけるステップバイステップの推論を符号化する一連の画像を通して実行される。
論文参考訳（メタデータ） (Fri, 16 May 2025 16:17:22 GMT)
「By enabling models to operate entirely through visual state transitions without textual mediation, we demonstrate that purely visual representations can lead to more effective and intuitive planning,」とのこと。テキストは強力だが万能というわけではなくタスクによっては計画レベルで画像が有効なことがあるのは納得感がある。とても面白い。GRITでも思ったが画像の力を使っていくアプローチはとても有望に思える。
リポジトリはGitHub – yix8/VisualPlanning: Visual Planning: Let’s Think Only with Images

GRIT: Teaching MLLMs to Think with Images [22.7]
Grounded Reasoning with Images and Texts (GRIT) はMLLMを画像で考えるための新しい手法である。 GRITは自然言語と明示的な境界ボックス座標をインターリーブする推論連鎖を生成する。 GRITは例外的なデータ効率を実現し、既存のデータセットから20のイメージクエスト・アンサートレットを必要とする。
論文参考訳（メタデータ） (Wed, 21 May 2025 17:54:49 GMT)
プロジェクトサイトはGRIT: Teaching MLLMs to Think with Images

Think Only When You Need with Large Hybrid-Reasoning Models

Think Only When You Need with Large Hybrid-Reasoning Models [121.6]
LHRM(Large Hybrid-Reasoning Model) ユーザクエリのコンテキスト情報に基づいて思考を行うか否かを適応的に決定できるモデル。実験の結果, LHRMsは, 様々な難易度, 種別の問合せに対して, 適応的にハイブリッド思考を行うことができた。
論文参考訳（メタデータ） (Wed, 21 May 2025 05:17:34 GMT)
LLM, LRMハイブリッドな手法の提案。「We begin with a hybrid-formatted supervised fine-tuning stage named Hybrid Fine-Tuning (HFT) that integrates both reasoning-intensive (Thinking) and direct-answer (No-Thinking) data. This approach mitigates the instability often observed in cold-start scenarios [GYZ+25], and establishes a robust initialization for next stage reinforcement learning.」という第一ステージを挟んでいるのが面白い。
LHRMという略語が定着する可能性があるのかは若干気になる。
リポジトリはAdvancing AI for Humanity

Let LLMs Break Free from Overthinking via Self-Braking Tuning [60.1]
大きな推論モデル(LRM)は思考の長い連鎖を生成することによって推論能力を著しく向上させた。この性能向上は、生成プロセス中の冗長な推論を大幅に増加させるコストが伴う。本稿では、モデルが独自の推論プロセスを制御することを許容する観点から、過度に検討する新しいフレームワーク、Self-Braking Tuning(SBT)を提案する。
論文参考訳（メタデータ） (Tue, 20 May 2025 16:53:40 GMT)
「we propose a novel endogenous approach, Self-Braking Tuning (SBT), to mitigating overthinking in large language models.」とtoken節約という意味では近い内容。
リポジトリはGitHub – ZJU-REAL/Self-Braking-Tuning: Let LLMs Break Free from Overthinking via Self-Braking Tuning

XRAG: Cross-lingual Retrieval-Augmented Generation

XRAG: Cross-lingual Retrieval-Augmented Generation [21.5]
XRAGは,LLMの生成能力を評価するために設計されている。 XRAGは最近のニュース記事から構築されており、質問に答えるために外部の知識が必要であることを保証している。
論文参考訳（メタデータ） (Thu, 15 May 2025 08:47:55 GMT)
クロスリンガル設定のRAGベンチマーク。LLMが内部知識からは答えられないように構築されている。
「(3) We find that in the monolingual retrieval setting, all evaluated LLMs face issues with Response Language Correctness an issue that has received little attention from the research community. (4) In the multilingual retrieval setting, the primary challenge for LLMs does not lie in non- English generation, but in reasoning over retrieved information across languages.」と意外に難しく、興味深い結果になっている。
データを見てみたいところ。

NExT-Search: Rebuilding User Feedback Ecosystem for Generative AI Search

NExT-Search: Rebuilding User Feedback Ecosystem for Generative AI Search [108.4]
NExT-Searchは、きめ細かいプロセスレベルのフィードバックを生成AI検索に再導入するために設計された次世代パラダイムである。 NExT-Searchは2つの補完モードを統合している。
論文参考訳（メタデータ） (Tue, 20 May 2025 17:59:13 GMT)
生成AI時代の検索（it disrupts the feedback-driven improvement loop that has historically powered the evolution of traditional Web search.）について、フィードバックの在り方の提案。

Large Language Models for Computer-Aided Design: A Survey

Large Language Models for Computer-Aided Design: A Survey [33.4]
大規模言語モデル(LLM)は近年急速に進歩している。現代のデザインの複雑さが増すにつれ、LLMがコンピュータ支援設計(CAD)を効率化し、効率化する可能性が高まっている。本稿では,LLMとCADの交点を探索する最初の体系的な調査について述べる。
論文参考訳（メタデータ） (Tue, 13 May 2025 00:19:04 GMT)
LLM & CADのサーベイ。

How Reliable is Multilingual LLM-as-a-Judge?

How Reliable is Multilingual LLM-as-a-Judge? [11.6]
25言語を含む5つの多種多様なタスクにおいて、異なるモデルファミリーから5つのモデルを評価する。一貫性は言語によって大きく異なり、低リソース言語では特にパフォーマンスが劣っていることが分かりました。実世界のアプリケーションにおける多言語判断の整合性を改善するアンサンブル戦略を提案する。
論文参考訳（メタデータ） (Sun, 18 May 2025 02:32:35 GMT)

マルチリンガル設定でのLLM as a judgeの性能評価。GPT-4oも苦労している印象の結果。「we find that powerful open-source models, such as Qwen- 2.5, achieve comparable performance to OpenAI models in multilingual judgment tasks.」や「Aya fails to demonstrate noticeable improvements. This suggests that fine- tuning with multilingual data may not directly enhance a model’s ability to perform accurate multi- lingual judgments.」など興味深い記載も多い。

2025年5月
月	火	水	木	金	土	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31