arXiv最新論文の紹介

HyperTree Planning: Enhancing LLM Reasoning via Hierarchical Thinking

HyperTree Planning: Enhancing LLM Reasoning via Hierarchical Thinking [109.1]
提案するHyperTree Planning(HTP)は,高木構造プランニングアウトラインを構成する新しい推論パラダイムである。実験ではHTPの有効性を実証し、Gemini-1.5-ProによるTravelPlannerベンチマークで最先端の精度を実現し、o1-previewよりも3.6倍の性能向上を実現した。
論文参考訳（メタデータ） (Mon, 05 May 2025 02:38:58 GMT)
「Compared to previous tree planning methods such as ToT (Yao et al , 2024) and RAP (Hao et al , 2023), HTP introduces structural innovations that enable each edge to connect multiple child nodes, making it suitable for a divide-and-conquer strategy.」という特徴を持つHyperTreeを使った行動計画の提案。
効果が高いよう。通常のツリーよりも強力な構造であるのは確かだろうがLLMも扱いやすいという点が面白い。（いろいろ書ける）自然言語に似ている・・・？

Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks

Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks [229.7]
この調査は、大規模言語モデルの台頭が評価に役立っている中核的な課題を調査する。 i) タスク固有のものから能力に基づく評価へと、知識、推論、指示に従うこと、マルチモーダル理解、安全性といったコア能力に関するベンチマークを再編成する。この問題と、上記の2つのトランジションの中核的な課題を、メソッド、データセット、評価器、メトリクスの観点から検討する。
論文参考訳（メタデータ） (Sat, 26 Apr 2025 07:48:52 GMT)
ベンチマークに関するサーベイ。「Fig6 Illustration of capability-based benchmark taxonomy involving: knowledge, reasoning, instruction following, multimodal, and safety.」が視覚的にとても分かりやすい。
リポジトリはGitHub – ALEX-nlp/Benchmark-of-core-capabilities、

$\textit{New News}$: System-2 Fine-tuning for Robust Integration of New Knowledge

$\textit{New News}$: System-2 Fine-tuning for Robust Integration of New Knowledge [6.1]
我々は,複数のドメインにまたがる仮説的かつ妥当なニュースからなるデータセットである$textitNew News$を紹介した。我々は,文脈を伴わないモデルから知識を抽出し,文脈を伴わないモデルの重みに組み込むための,セルフプレイデータ生成プロトコルのスイートを探索する。以上の結果から,Sys2-FTの自己QAプロトコルは,モデルによるニュースの重み付け学習を大幅に改善することが示された。
論文参考訳（メタデータ） (Sat, 03 May 2025 12:49:35 GMT)
ICLとFTのギャップに関する分析とSys2-FTという手法の提案。「Our results demonstrate that the self-QA protocol of Sys2-FT significantly improves models’ in-weight learning of the news.」とのこと。
ICLとFTの差異はとても興味深いし実用上も重要。

EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning

EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning [108.7]
MLLM(Multimodal large language model)は、テキスト、視覚、音声にまたがる高度な認識を持つが、構造化されたクロスモーダル推論に苦慮する。 MLLMにおけるそのような推論を強化する強化学習フレームワークであるEchoInk-R1を紹介する。
論文参考訳（メタデータ） (Wed, 07 May 2025 17:59:49 GMT)
マルチモーダルなReasoningモデル構築フレームワークの提案。「we adopt the Group Relative Policy Optimiza- tion (GRPO) reinforcement learning framework to the task of audio-image multiple-choice question answering in mul- timodal large language models (MLLMs)」
リポジトリはGitHub – HarryHsing/EchoInk: EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning [🔥The Exploration of R1 for General Audio-Visual Reasoning with Qwen2.5-Omni]

Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions

Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions [55.2]
メモリは、大規模言語モデル(LLM)ベースのエージェントを支える、AIシステムの基本コンポーネントである。コンソリデーション、更新、インデックス付け、フォッティング、検索、圧縮の6つの基本的なメモリ操作を紹介します。この調査は、AIのメモリに関する研究、ベンチマークデータセット、ツールに関する構造化された動的視点を提供する。
論文参考訳（メタデータ） (Thu, 01 May 2025 17:31:33 GMT)
LLM、エージェントにとって重要なメモリのサーベイ。
「In this survey, we first categorize memory representations into parametric, contextual structured, and contextual unstructured and then introduce six fundamental memory operations: Consolidation, Updating, Indexing, Forgetting, Retrieval, and Compression.」という軸設定。

Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models

Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models [75.9]
SAGE(Sentient Agent as a Judge)は、大規模言語モデルの評価フレームワークである。 SAGEは人間のような感情の変化や内的思考をシミュレートするSentient Agentをインスタンス化する。 SAGEは、真に共感的で社会的に適応的な言語エージェントへの進捗を追跡するための、原則付き、スケーラブルで解釈可能なツールを提供する。
論文参考訳（メタデータ） (Thu, 01 May 2025 19:06:10 GMT)
「SAGE instantiates a Sentient Agent that simulates human- like emotional changes and inner thoughts during interaction, providing a more realistic evaluation of the tested model in multi-turn conversations. At every turn, the agent reasons about (i) how its emotion changes, (ii) how it feels, and (iii) how it should reply, yielding a numerical emotion trajectory and interpretable inner thoughts.」（SAGE=Sentient Agent as a Judge）という評価フレームワークの提案。「rankings produced by SAGE diverge markedly from Arena results, confirming that social cognition is orthogonal to generic helpfulness. 」とのこと。
リポジトリはdigitalhuman/SAGE at main · Tencent/digitalhuman · GitHub

Humanizing LLMs: A Survey of Psychological Measurements with Tools, Datasets, and Human-Agent Applications

Humanizing LLMs: A Survey of Psychological Measurements with Tools, Datasets, and Human-Agent Applications [25.4]
大規模言語モデル(LLM)は、人間中心のタスクでますます使われるようになっている。彼らの心理的特徴を評価することは、彼らの社会的影響を理解し、信頼できるAIアライメントを確保するために不可欠である。本研究は,LLMのより解釈しやすく,堅牢で,一般化可能な心理的アセスメントフレームワークを開発するための今後の方向性を提案することを目的とする。
論文参考訳（メタデータ） (Wed, 30 Apr 2025 06:09:40 GMT)
「(1) assessment tools; (2) LLM-specific datasets; (3) evaluation metrics (consistency and stability); (4) empirical findings; (5) personality simulation methods; and (6) LLM-based behavior simulation.」を軸としたレビュー。

HiPerRAG: High-Performance Retrieval Augmented Generation for Scientific Insights

HiPerRAG: High-Performance Retrieval Augmented Generation for Scientific Insights [72.8]
HiPerRAGは360万以上の科学論文から知識をインデクシングし取り出すワークフローである。コアとなるのはマルチモーダル文書解析のための高スループットモデルであるOreoと、クエリ対応エンコーダの微調整アルゴリズムであるColTrastだ。 HiPerRAGは、既存の科学的質問応答ベンチマークと、この研究で導入された2つの新しいベンチマークで堅牢なパフォーマンスを提供する。
論文参考訳（メタデータ） (Wed, 07 May 2025 22:50:23 GMT)
「Despite the widespread adoption of RAG, it faces three significant technical challenges that hinder its ability to scale to millions of documents.」はまさにその通りで、大規模RAGの構築にとって参考になる論文。
かなり凝ったことも行っている。（分野によっては）実用上もこのようなアプローチが必要になるんだろうか…

Holmes: Automated Fact Check with Large Language Models

Holmes: Automated Fact Check with Large Language Models [31.8]
本研究では,Large Language Models (LLMs) を用いて自動偽情報検出を行う。新たなエビデンス検索手法を特徴とするエンドツーエンドフレームワークであるHolmesを提案する。提案手法では,(1)LLMを用いた要約を用いてオープンソースから鍵情報を抽出し,(2)エビデンスの品質を評価するための新しいアルゴリズムと指標を提案する。
論文参考訳（メタデータ） (Tue, 06 May 2025 03:19:51 GMT)
ファクトチェックに関する論文で丁寧な記載とFIndingsがととても参考になる。
- 「Finding 1: LLMs CANNOT accurately verify the truth- fulness of the claim directly.」、「Finding 2: LLMs have shortcomings in searching for claim-relevant public information and their responses may include hallucinated links that weaken result trust- worthiness.」、「Finding 3: Human-written evidence enhances LLMs’ ability to verify multimodal claims and generate coherent justifications.」
上記をもとにHolmesを設計、有効性を確認とのこと

RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation

RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation [90.8]
RoBridgeは、一般的なロボット操作のための階層的なインテリジェントアーキテクチャである。大規模事前学習型視覚言語モデル(VLM)に基づくハイレベル認知プランナー(HCP)で構成されている。強化学習の手続き的スキルを解き放ち、認知と実行のギャップを効果的に埋める。
論文参考訳（メタデータ） (Sat, 03 May 2025 06:17:18 GMT)
大規模なVLMが中心となるロボット操作のためのアーキテクチャ。VLM based real agentsのような印象。
プロジェクトサイトはRoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation

2025年6月
月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30