arXiv最新論文の紹介

WideSearch: Benchmarking Agentic Broad Info-Seeking

WideSearch: Benchmarking Agentic Broad Info-Seeking [22.3]
大規模コレクションタスクにおいてエージェントの信頼性を評価するために設計された新しいベンチマークであるWideSearchを紹介する。ベンチマークでは、実際のユーザクエリに基づいて、15以上のさまざまなドメインから200の質問を手作業でキュレートする。我々は、シングルエージェント、マルチエージェントフレームワーク、エンドツーエンドの商用システムを含む、10以上の最先端のエージェント検索システムをベンチマークする。
論文参考訳（メタデータ） (Mon, 11 Aug 2025 14:03:09 GMT)
LLM- Agent 、特に情報収集タスクに関するベンチマークの提案。OpenAI o3の清野が高いがKimi K2も良い性能。
プロジェクトサイトはWideSearch: Benchmarking Agentic Broad Info-Seeking

Deep Think with Confidence

Deep Think with Confidence [33.2]
私たちはDeep Think with Conf(DeepConf)という,テスト時の推論効率とパフォーマンスを両立させる,シンプルかつ強力な手法を紹介します。 DeepConfは、生成時に低品質な推論トレースを動的にフィルタリングし、トークン生成を削減しながら精度を維持または向上させます。評価の結果、DeepConfはAIME 2025などの課題で99.9%の精度を達成し、従来の方法に比べて84.7%のトークン削減を実現しました。
論文参考訳（メタデータ） (Thu, 21 Aug 2025 05:48:38 GMT)
モデル内の確信度を活用して推論を制御する手法の提案。シンプルだが強力とのこと。
リポジトリはDeep Think with Confidence

A Survey on Parallel Text Generation: From Parallel Decoding to Diffusion Language Models

A Survey on Parallel Text Generation: From Parallel Decoding to Diffusion Language Models [50.0]
トークン・バイ・トークン生成のボトルネックを突破することを目的とした並列テキスト生成技術。既存のアプローチをARベースのパラダイムと非ARベースのパラダイムに分類する。速度、品質、効率の観点から理論上のトレードオフを評価します。
論文参考訳（メタデータ） (Tue, 12 Aug 2025 07:56:04 GMT)
主として高速化を目的としたParallel Text Generationのサーベイ。
AR-based、Non-AR-basedの両面での調査となっている。

LLM-Driven Self-Refinement for Embodied Drone Task Planning

LLM-Driven Self-Refinement for Embodied Drone Task Planning [29.2]
SRDroneは産業用ドローンの自己補充作業計画のために設計された新しいシステムである。継続的状態評価手法を取り入れて、タスクの成果を堅牢かつ正確に決定する。また、BT(hierarchical Behavior Tree)修正モデルを実装している。
論文参考訳（メタデータ） (Thu, 21 Aug 2025 12:29:01 GMT)
ドローンの行動計画生成、self-evolving BTs（behavior tree）と、ミッション実行中の継続的な状態評価と細かい動作ツリー（BT）による計画修正を行う点が特徴的。
リポジトリはGitHub – ZXiiiC/SRDrone: Implementation of SRDrone

MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models

MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models [108.6]
MME-Emotionは,MLLMの感情的理解と推論能力の両方を評価するシステムベンチマークである。 MME-Emotionには6000以上のキュレートされたビデオクリップとタスク固有の質問回答(QA)ペアが含まれており、8つの感情的なタスクを定式化するための広いシナリオにまたがっている。マルチエージェントシステムフレームワークを通じて分析された、感情認識と推論のためのハイブリッドメトリクスを備えた総合評価スイートが組み込まれている。
論文参考訳（メタデータ） (Mon, 11 Aug 2025 03:14:55 GMT)
「In this paper, we introduced MME-Emotion, a comprehensive multi-task benchmark for evaluating emotional intelligence in MLLMs, accompanied by a holistic evaluation suite. The assessment process was fully automated within a multi-agent system framework and thoroughly validated by human experts.」という感情に焦点を当てたベンチマークの提案。
プロジェクトサイトはhttps://mme-emotion.github.io/とのこと。

INTIMA: A Benchmark for Human-AI Companionship Behavior

INTIMA: A Benchmark for Human-AI Companionship Behavior [7.4]
AIとの感情的な絆を形成する「AIの伴侶性」が注目され、特にユーザーとの関係の質が重要視されている。新たに提案されたINTIMAは、31の行動カテゴリから成るタクソノミーを持ち、AIの反応を評価する方法を提供する。この評価手法は、AIとの感情的なやり取りにおける一貫したアプローチの必要性を示唆しており、ユーザーの幸福に寄与するための境界設定と感情的支援の重要性を浮き彫りにしている。
論文参考訳（メタデータ） (Mon, 04 Aug 2025 08:25:38 GMT)
「NTIMA To evaluate how language models respond to emotionally and relationally charged user behaviors, we introduce IN- TIMA: the Interactions and Machine Attachment Benchmark. INTIMA contains 368 benchmark prompts and is de- signed to assess whether LLMs reinforce, resist, or misinterpret companionship-seeking interactions, based on empirical patterns from real-world user data from Reddit and grounded in psychological and social science theory.」というベンチマーク。興味深い一方でこの手のタスクを測らないといけないくらい進化していることに驚く最近。
リポジトリはAI-companionship/INTIMA · Datasets at Hugging Face

Command A Reasoning, DeepSeek V3.1, Gemma 3 270M, Nemotron Nano 2, Dream 7B

LLM/LRM関連の話題は本当に多い。先週はCohere’s Command A Reasoning Model | Cohere（モデルはCohere’s Command A Reasoning Model | Cohere、CC-BY-NC）の公開、DeepSeek V3.1の公開（DeepSeek-V3.1 Release | DeepSeek API Docs、モデルはdeepseek-ai/DeepSeek-V3.1 · Hugging Face）が大きなニュースだった。フロンティアまたはそれに近いモデルが公開される意義は大きい。また、Intern-S1からはテクニカルレポートが公開されている。

小型モデル関連でもGemma 3 270M（Introducing Gemma 3 270M: The compact model for hyper-efficient AI – Google Developers Blog、モデルはgoogle/gemma-3-270m · Hugging Face）は超小型であることが興味深い。性能的には疑問があるとはいえ特化用途にPost trainingするなど使える場面はありそう。NVIDIA のMemtron Nano2も注目である（Nanoという名前で9B）。

HuaweiからはDiffusion系のDream 7Bの論文が出ていた。LLaDAを超え、同規模のAutoregressiveなモデルに負けていなさそうと高い性能。

Intern-S1: A Scientific Multimodal Foundation Model [185.4]
Intern-S1は、一般的な理解と推論機能を備えた専門的なジェネラリストである。 Intern-S1はオフラインおよびオンライン強化学習(RL)をInternBootCampで実施する。 Intern-S1は、オープンソースモデル間の一般的な推論タスクにおける競合性能を示す。
論文参考訳（メタデータ） (Thu, 21 Aug 2025 17:58:00 GMT)
Qwen3-Coder, Intern-S1, Step-Audio2, TeleChat2 – arXiv最新論文の紹介で取り上げたモデルのテクニカルレポート

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model [176.4]
Nemotron-Nano-9B-v2は、推論処理のスループットを向上させるために設計されたハイブリッドのMamba-Transformer言語モデルである。 Nemotron-Nano-9B-v2はNemotron-Hアーキテクチャをベースにしており、共通のTransformerアーキテクチャの自己保持層の大部分をMamba-2層に置き換えている。
論文参考訳（メタデータ） (Thu, 21 Aug 2025 04:18:04 GMT)
nvidia/NVIDIA-Nemotron-Nano-9B-v2 · Hugging Face

Dream 7B: Diffusion Large Language Models [85.3]
これまでで最も強力なオープン拡散大言語モデルであるDream 7Bを紹介します。我々のモデルは、一般的な、数学的、コーディングタスクにおいて、既存の拡散言語モデルよりも一貫して優れています。
論文参考訳（メタデータ） (Thu, 21 Aug 2025 12:09:58 GMT)
「Dream 7B achieves competitive performance with Qwen 2.5 on standard benchmarks (general language understanding, mathematical reasoning, and code generation) while exhibiting superior planning abilities and novel inference flexibility features that naturally emerge from the diffusion modeling paradigm.」とのこと。
リポジトリはGitHub – DreamLM/Dream: Dream 7B, a large diffusion language model、モデルはDream 7B – a Dream-org Collection

FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction

FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction [84.4]
FutureXは、将来の予測のための最大かつ最も多様なライブベンチマークである。リアルタイムの日次更新をサポートし、質問収集と回答収集のための自動パイプラインを通じてデータの汚染を取り除く。推論,検索機能,外部ツールの統合などを含む25のLLM/エージェントモデルを評価した。
論文参考訳（メタデータ） (Sat, 16 Aug 2025 08:54:08 GMT)
未来予測のためのライブベンチマーク。「we introduce FutureX, a dynamic and live evaluation benchmark specifically designed for LLM agents performing future prediction tasks. FutureX is built upon a semi-automated pipeline that continuously collects future-oriented questions from 195 diverse websites, curated from a pool of 2,008 sites covering areas such as politics, economics, technology, sports, healthcare, and more.」とドメインも広い。
結果として「LLM agents still lag behind humans」ではあるものの、レベル２は人を上回っているエージェントがいるのが興味深いところ。（あとレベル分けは若干違和感がある。。。）
- The Basic tier (Level 1) contains single-choice events with options fewer than 4.
- The Wide Search tier (Level 2) comprises multi-choice events with several correct answers.
- The Deep Search tier (Level 3) contains open-ended events whose underlying facts are relatively stable (with low volatility).
- The Super Agent tier (Level4) covers high-volatility, open-ended events.

Never Compromise to Vulnerabilities: A Comprehensive Survey on AI Governance

Never Compromise to Vulnerabilities: A Comprehensive Survey on AI Governance [211.1]
本研究は,本質的セキュリティ,デリバティブ・セキュリティ,社会倫理の3つの柱を中心に構築された,技術的・社会的次元を統合した包括的枠組みを提案する。我々は,(1)防衛が進化する脅威に対して失敗する一般化ギャップ,(2)現実世界のリスクを無視する不適切な評価プロトコル,(3)矛盾する監視につながる断片的な規制,の3つの課題を特定する。私たちのフレームワークは、研究者、エンジニア、政策立案者に対して、堅牢でセキュアなだけでなく、倫理的に整合性があり、公的な信頼に値するAIシステムを開発するための実用的なガイダンスを提供します。
論文参考訳（メタデータ） (Tue, 12 Aug 2025 09:42:56 GMT)
「This paper offers a comprehensive overview of AI governance, addressing challenges across intrinsic security, derivative security, and social ethics.」とガバナンスについて概要がまとまった論文。リポジトリもあって良い感じ（だが、リポジトリの論文リストは更新中？）
リポジトリはGitHub – ZTianle/Awesome-AI-SG: Awesome papers and resources related to the AI Safety and Governance

Beyond “Not Novel Enough”: Enriching Scholarly Critique with LLM-Assisted Feedback

Beyond “Not Novel Enough”: Enriching Scholarly Critique with LLM-Assisted Feedback [81.0]
本稿では,3段階を通して専門家レビューアの動作をモデル化する,自動ノベルティ評価のための構造化アプローチを提案する。本手法は,人文のノベルティレビューを大規模に分析した結果から得られたものである。 182 ICLR 2025 の提出で評価されたこの手法は、人間の推論と86.5%の一致と、新規性の結論に関する75.3%の合意を達成している。
論文参考訳（メタデータ） (Thu, 14 Aug 2025 16:18:37 GMT)
論文等の新規性を評価するフレームワークの提案、「document processing and content extraction, related work retrieval and ranking, and structured novelty assessment.」という３ステージ構成。
リポジトリはBeyond “Not Novel Enough”: Enriching Scholarly Critique with LLM-Assisted Feedback

2026年3月
月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31