arXiv最新論文の紹介

SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?

SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories? [32.7]
SWE-Perfは、認証されたリポジトリコンテキスト内のコードパフォーマンス最適化タスクにおいて、LLM(Large Language Models)を評価するために設計された最初のベンチマークである。 SWE-Perfは140の慎重にキュレートされたインスタンスで構成されており、それぞれが人気のあるGitHubリポジトリのパフォーマンス改善プルリクエストに由来する。
論文参考訳（メタデータ） (Wed, 16 Jul 2025 17:05:17 GMT)
パフォーマンス最適化能力を測るベンチマークの提案。Claude-4-sonnet > Gemini-2.5-pro > OpenAI-o3ではあるものの全体的に厳しい結果。
プロジェクトサイトはSWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?

AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research

AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research [33.8]
AbGenは、科学研究のためのアブレーション研究を設計する際のLLMの能力を評価するために設計された最初のベンチマークである。そこで我々は,一般的な自動評価システムの信頼性を評価するメタ評価ベンチマークAbGen-Evalを開発した。
論文参考訳（メタデータ） (Thu, 17 Jul 2025 17:09:22 GMT)
Ablation Studyを生成できるか、および、Ablation Studyを評価できるかを検証するためのベンチマークの提案。現状のLLMはいずれも厳しい結果。
リポジトリはyale-nlp/AbGen · Datasets at Hugging Face、GitHub – yale-nlp/AbGen: Data and code for the ACL 2025 paper “AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research”

NeoBabel: A Multilingual Open Tower for Visual Generation

NeoBabel: A Multilingual Open Tower for Visual Generation [32.8]
我々は,新しい多言語画像生成フレームワークNeoBabelを紹介する。英語、中国語、オランダ語、フランス語、ヒンディー語、ペルシア語という6つの言語をサポートしている。それは、強い英語能力を維持しながら、最先端の多言語のパフォーマンスを達成する。
論文参考訳（メタデータ） (Tue, 08 Jul 2025 16:19:45 GMT)
「This paper introduces NeoBabel, a novel multilingual image generation framework that represents the first scalable solution for direct text-to-image synthesis across six languages. Through meticulous curation of high-quality multilingual vision-language datasets and end-to-end training, NeoBabel establishes direct cross-lingual mappings between textual descriptions and visual outputs across all supported languages.」という翻訳を介さない多言語対応画像生成モデルの提案。文化に関わる単語を翻訳するのは困難であり、このようなモデルは重要。
リポジトリはNeoBabel: A Multilingual Open Tower for Visual Generation

Robust Multimodal Large Language Models Against Modality Conflict

Robust Multimodal Large Language Models Against Modality Conflict [94.1]
マルチモーダル大言語モデル(MLLM)は、現実のシナリオにおいて幻覚を起こす傾向がある。我々は、MLLMをジレンマに配置し、幻覚に直接導く異なるモダリティからの入力における固有の矛盾について研究する。モダリティ衝突による幻覚を緩和する3つの方法が提案されている。
論文参考訳（メタデータ） (Wed, 09 Jul 2025 11:18:38 GMT)
MLLM特有のハルシネーション（モダリティ間の不整合に関連するもの）に対する対策の整理「Multimodal Modality Conflict (MMMC) 」というデータセットも作成し検証。検証の中ではプロンプトエンジニアリング、SFT、強化学習でのハルシネーション軽減を試し「Our results show that the reinforcement learning method achieves the best performance in mitigating the hallucination under modality conflict, while the supervised fine- tuning method shows promising and stable performance.」とのこと。
リポジトリはGitHub – zmzhang2000/MMMC: Official repository for Robust Multimodal Large Language Models Against Modality Conflict

Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in LLMs

Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in LLMs [45.8]
大規模言語モデル(LLM)は、幅広いタスクを解くことができる汎用エージェントへと急速に進歩してきた。彼らは、タスクの複雑さに関わらず、固定推論時間計算を適用し、しばしば難しいことを考えながら単純な問題を過小評価する。本調査では, LLM推論の計算効率向上を目的とした, 効率的なテスト時間計算戦略の総合的なレビューを行う。
論文参考訳（メタデータ） (Wed, 02 Jul 2025 18:27:42 GMT)
「This survey presents a comprehensive review of efficient test-time compute (TTC) strategies, which aim to improve the computational efficiency of LLM reasoning. We introduce a two-tiered taxonomy that distinguishes between L1 controllability—methods that operate under fixed compute budgets—and L2 adaptiveness—methods that dynamically scale inference based on input difficulty or model confidence.」というサーベイ。
商用モデルでのハイブリッドアプローチも流行っていて色々と苦労している部分なんだろうなと思う。

Predicting thinking time in Reasoning models [42.6]
推論モデルは長く隠れた思考の連鎖を生み出します。ユーザーは、答えを返す前にモデルが推論にどれくらいの時間を費やすかについての洞察がほとんどない。
論文参考訳（メタデータ） (Sun, 29 Jun 2025 15:01:01 GMT)
LRMにおける推論時間の予測に関する報告。
「In this paper, we explore methods for online prediction of thinking time in reasoning models. Our experiments demonstrate that current models encode a notion of progress in their internal representations, with an mlp probe achieving 45% accuracy over 10 classes, moreover the errors appear highly local (MAE 1).」

VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots

VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots [45.0]
本研究では,シミュレータや実環境で実行する前に,タスクプランを自動的に検証するアーキテクチャを提案する。このモジュールは、Large Language Modelsの推論機能を使用して、論理的一貫性を評価し、計画の潜在的なギャップを特定する。我々は,タスク計画の信頼性と効率の向上に寄与し,自律システムにおける堅牢な事前実行検証の必要性に対処する。
論文参考訳（メタデータ） (Mon, 07 Jul 2025 15:31:36 GMT)
タスク計画の検証のため「In this paper, we propose an architecture for automatically verifying high-level task plans before their execution in simulator or real-world environments. Leveraging Large Language Models (LLMs), our approach consists of two key steps: first, the conversion of natural language instructions into Linear Temporal Logic (LTL), followed by a comprehensive analysis of action sequences.」と形式言語を併用するアプローチの提案。
リポジトリはVerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots

Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset

Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset [112.5]
4000時間以上の対面インタラクション映像の大規模な収集であるSeamless Interactionデータセットを紹介した。このデータセットは、ダイドの具体的ダイナミクスを理解するAIテクノロジの開発を可能にする。そこで我々は,このデータセットを用いて,人間の発話に適応した動作ジェスチャーと表情を生成するモデル群を開発した。
論文参考訳（メタデータ） (Fri, 27 Jun 2025 18:09:49 GMT)
「we introduce the Seamless Interaction Dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage from over 4,000 participants in diverse contexts. This dataset enables the development of AI technologies that understand dyadic embodied dynamics, unlocking breakthroughs in virtual agents, telepresence experiences, and multimodal content analysis tools.」というデータセット。
リポジトリはGitHub – facebookresearch/seamless_interaction: Foundation Models and Data for Human-Human and Human-AI interactions.

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents [105.4]
VLM2Vec-V2は、様々な視覚形態にまたがる埋め込みを学習するための統一的なフレームワークである。まず、MMEBを5つの新しいタスクタイプで拡張する包括的なベンチマークであるMMEB-V2を紹介する。次に、テキスト、画像、ビデオ、ビジュアルドキュメント入力をサポートする汎用埋め込みモデルであるVLM2Vec-V2を訓練する。
論文参考訳（メタデータ） (Mon, 07 Jul 2025 00:51:57 GMT)
「MMEB-V2, an advanced multimodal embedding dataset designed to train and evaluate embedding models across three key visual modalities: images, videos, and visual documents.」と、それを活用した埋め込みモデルVLM2Vec-V2の提案。かなり汎用的な2vec
プロジェクトサイトはVLM2Vec

ReTimeCausal: EM-Augmented Additive Noise Models for Interpretable Causal Discovery in Irregular Time Series

ReTimeCausal: EM-Augmented Additive Noise Models for Interpretable Causal Discovery in Irregular Time Series [32.2]
本稿では, 金融, 医療, 気候科学などの高度領域における不規則サンプル時系列における因果発見について検討する。 ReTimeCausalは,物理誘導型データ計算と疎因性推論を統一する付加雑音モデル(ANM)と期待最大化(EM)の新たな統合である。
論文参考訳（メタデータ） (Fri, 04 Jul 2025 05:39:50 GMT)
不規則にサンプリングされた時系列データを対象としたcausal discovery の報告。「we propose ReTimeCausal (Recovery for Irregular Time- series Causal Discovery). ReTimeCausal integrates Additive Noise Models (ANMs) with an Expectation-Maximization (EM) framework to jointly perform noise-aware data imputation and causal structure learning.」とのこと。

A Survey on Autonomy-Induced Security Risks in Large Model-Based Agents

A Survey on Autonomy-Induced Security Risks in Large Model-Based Agents [45.5]
大規模言語モデル(LLM)の最近の進歩は、自律型AIエージェントの台頭を触媒している。これらの大きなモデルエージェントは、静的推論システムからインタラクティブなメモリ拡張エンティティへのパラダイムシフトを示す。
論文参考訳（メタデータ） (Mon, 30 Jun 2025 13:34:34 GMT)
AIエージェントとセキュリティリスクに関するサーベイ。
検討ポイントが多い。。

2025年8月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31