2025年8月 – ページ 4 – arXiv最新論文の紹介

Refine-n-Judge: Curating High-Quality Preference Chains for LLM-Fine-Tuning

Refine-n-Judge: Curating High-Quality Preference Chains for LLM-Fine-Tuning [14.3]
大規模言語モデル(LLM)は、好みに基づく微調整を通じて顕著な進歩を見せている。本稿では、1つのLCMを精細化と判定の両方に活用し、データセットの品質を向上させる自動反復手法であるRefine-n-Judgeを紹介する。本研究では,5つのコーパスにまたがる公開データセットにまたがるRefine-n-Judgeの有効性を示す。
論文参考訳（メタデータ） (Sun, 03 Aug 2025 01:56:03 GMT)
「Bringing these capabilities together, we propose Refine-n-Judge, a fully automated dataset curation pipeline, summarized in Figure 2. In this framework, an LLM model serves as both the refiner- generating improved outputs- and the judge-comparing the refined output against the original and selecting the preferred version.」という高品質化フレームワークの提案。
judge 部分なしでは十分な効果がなかったという結果が興味深い。改善とは異なるタスクとしてjudge をLLMに解かせるというのが重要なんだろうか。

Beyond Chunks and Graphs: Retrieval-Augmented Generation through Triplet-Driven Thinking

Beyond Chunks and Graphs: Retrieval-Augmented Generation through Triplet-Driven Thinking [31.7]
Retrieval-augmented Generation(RAG)は、幻覚を減らし、外部知識をLarge Language Models(LLM)に組み込むために重要である。 T$2$RAGは、原子三重項の単純でグラフのない知識ベースで動作する新しいフレームワークである。実験結果から,T$2$RAGは最先端のマルチラウンド法とグラフRAG法を著しく上回ることがわかった。
論文参考訳（メタデータ） (Mon, 04 Aug 2025 13:50:44 GMT)
「We introduce a novel RAG framework that leverages triplets as the fundamental unit for indexing, retrieval, and reasoning, moving beyond the limitations of chunk-based and explicit graph-based approaches」とトリプレットベースのRAGアプローチの提案。グラフ構造を上回るのはやや意外だが、コンポーネントとしては「both the iterative process and the use of chunks are important. The iterative reasoning module proves to be a critical component.」ということでシンプルな構成であることも有利だったりするのだろうか。
リポジトリはrockcor/T2RAG: Official code of paper “Beyond Chunks and Graphs: Retrieval-Augmented Generation through Triplet-Driven Thinking”

CoAct-1: Computer-using Agents with Coding as Actions

CoAct-1: Computer-using Agents with Coding as Actions [95.0]
CoAct-1はGUIベースの制御と直接プログラム実行を組み合わせた新しいマルチエージェントシステムである。我々は、CoAct-1が60.76%の最先端の成功率を達成したOSWorldベンチマークで、我々のシステムを評価した。
論文参考訳（メタデータ） (Tue, 05 Aug 2025 21:33:36 GMT)
「CoAct-1 features an Orchestrator that dynamically delegates subtasks to either a conventional GUI Operator or a specialized Programmer agent, which can write and execute Python or Bash scripts. This hybrid approach allows the agent to bypass inefficient GUI action sequences for tasks like file management and data processing, while still leveraging visual interaction when necessary.」とコード生成をうまく使うGUIエージェントの提案。OS WorldでSoTAを主張。
プロジェクトサイトはCoAct-1

MLP Memory: Language Modeling with Retriever-pretrained External Memory

MLP Memory: Language Modeling with Retriever-pretrained External Memory [26.0]
そこで本研究では,事前学習可能な外部メモリを用いてデコーダから切り離すことを提案する。私たちのアーキテクチャは、下流のタスクに強い難易度とパフォーマンスを示します。 3つの幻覚ベンチマークと9つのメモリ集約タスクにおいて優れた性能を示す。
論文参考訳（メタデータ） (Sun, 03 Aug 2025 16:40:53 GMT)
「In this work, we propose an external memory for LLM that is pretrained to mimic a retriever on the entire pretraining dataset. Specifically, following the RAG setting in kNN-LM [27], this memory learns to map the LLM hidden state at a certain step to a vocabulary distribution matching the output of the kNN retriever. During inference, the LLM’s native output is interpolated with the retriever-pretrained output from the external memory.」と記憶（知識）部分を切り離したアーキテクチャの提案
これがうまく動作するのであれば面白いなと思う一方で、知識と思考が切り離せるのかはやや疑問で思考・生成部分への影響が気になるところ。

Tabular Data Understanding with LLMs: A Survey of Recent Advances and Challenges

Tabular Data Understanding with LLMs: A Survey of Recent Advances and Challenges [22.1]
本稿では,表型入力表現の分類と表理解タスクの導入を通じて,重要な概念を紹介する。テーブルは2次元であり、構造化されたデータベーステーブルから複雑な多層スプレッドシートまで、それぞれ異なる目的を持った形式を含んでいる。我々は、さらなる研究の必要性を示す分野におけるいくつかの重要なギャップを強調している。
論文参考訳（メタデータ） (Thu, 31 Jul 2025 23:41:31 GMT)
LLMによるテーブルデータ取り扱いのサーベイ

RoboMemory: A Brain-inspired Multi-memory Agentic Framework for Lifelong Learning in Physical Embodied Systems

RoboMemory: A Brain-inspired Multi-memory Agentic Framework for Lifelong Learning in Physical Embodied Systems [30.5]
本稿では,脳にインスパイアされたマルチメモリ・フレームワークであるRoboMemoryについて紹介する。継続的学習、マルチモジュールメモリレイテンシ、タスク相関キャプチャ、クローズドループ計画における無限ループ緩和といった現実の環境における課題に対処する。
論文参考訳（メタデータ） (Sat, 02 Aug 2025 15:39:42 GMT)
「Inspired by the brain’s unified memory mechanisms, we design a lifelong embodied mem- ory system with four parallel modules (Spatial, Temporal, Episodic, Semantic) under a unified framework. This framework supports parallelized update and retrieval across modules, mitigating latency accumulation in complex systems while facilitating coherent knowledge integration for lifelong learning.」という、AgenticなアプローチのMemory。
現状、現実的にはAgenticなアプローチだと思う一方で、どの段階でモデル構造に踏み込むべきなのかは気になるところ。

MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents [88.4]
MMBench-GUIは、Windows、Linux、iOS、Android、WebプラットフォームでGUI自動化エージェントを評価する階層的なベンチマークである。 GUIコンテンツ理解、要素グラウンディング、タスク自動化、タスクコラボレーションの4つのレベルで構成されており、GUIエージェントに必要なスキルをカバーしています。
論文参考訳（メタデータ） (Fri, 25 Jul 2025 17:59:26 GMT)
GUIエージェント評価用のベンチマーク。「(1) GUI Content Understanding, (2) GUI Element Grounding, (3) GUI Task Automation, and (4) GUI Task Collaboration.」の4段階。「Finding 1: General-purpose language models excel at task decomposition, planning, and self-reflection but struggle with fine-grained visual interactions.」、「Finding 2: Accurate visual grounding significantly determines the success rate of GUI task execution.」は現在のGUIエージェント開発の方向性とも合致している。
リポジトリはopen-compass/MMBench-GUI: Official repo of “MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents”. It can be used to evaluate a GUI agent with a hierarchical manner across multiple platforms, including Windows, Linux, macOS, iOS, Android and Web.

The Missing Parts: Augmenting Fact Verification with Half-Truth Detection

The Missing Parts: Augmenting Fact Verification with Half-Truth Detection [8.1]
多くの現実世界の主張は半真実であり、実際は正しいが、批判的な文脈が欠落しているために誤解を招く。我々は,半真実検出の課題を紹介し,文レベルの証拠アライメントと推論されたクレーム意図を付加した15kの政治的クレームを備えた新しいベンチマークであるPolitiFact-Hiddenを提案する。提案するTRACERは,エビデンスを整理し,インプリートを推定し,隠されたコンテンツの因果的影響を推定することにより,省略に基づく誤報を識別するモジュラー・リアセスメント・フレームワークである。
論文参考訳（メタデータ） (Fri, 01 Aug 2025 10:06:38 GMT)
「half-truth detection as a new task in fact verification, targeting claims that omit critical context while remaining factually correct.」というタスクの提案とベンチマークの作成。
加えて、「 (1) evidence alignment, to classify retrieved evidence as presented or hidden; (2) intent generation, to recover the claim’s implicit message; and (3) causality analysis, to determine whether the hidden evidence undermines the inferred intent. 」という３ステージ構成の「TRACER (Truth ReAssessment with Critical hidden Evidence reasoning)」を提案している。

Beyond Brainstorming: What Drives High-Quality Scientific Ideas? Lessons from Multi-Agent Collaboration

Beyond Brainstorming: What Drives High-Quality Scientific Ideas? Lessons from Multi-Agent Collaboration [59.4]
本稿では,構造化マルチエージェントの議論が独創的思考を超えうるかどうかを考察する。研究提案を作成するための協調型マルチエージェントフレームワークを提案する。エージェントベースのスコアリングと,新規性,戦略的ビジョン,統合深度といった領域にわたるヒューマンレビューを備えた包括的プロトコルを採用している。
論文参考訳（メタデータ） (Wed, 06 Aug 2025 15:59:18 GMT)
「This work challenges the dominant paradigm of solitary AI- driven ideation and provides strong empirical evidence that collaborative multi-agent systems generate higher-quality scientific proposals. Through systematic simulation and evaluation, we identify three actionable principles for building more effective ideation systems: (1) Structured, leader- guided discussions enhance coherence and strategic focus; (2) Cognitive diversity from interdisciplinary or mixed- seniority teams drives originality; (3) Expertise is essential, as collaboration amplifies existing knowledge but cannot replace it.」と非常に面白い結果ではあるのだが、専門性のコントロールがこの手のプロンプトで本当にできているんだろうか（または他の部分もいろいろ変わってるんじゃないか）という疑問はある。
プロジェクトサイトはResearch Proposal Evaluator、リポジトリはNuoJohnChen/Idea2Proposal

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation [65.3]
我々は,ロボット操作のための統一世界基盤プラットフォームであるGenie Envisioner(GE)を紹介する。 GEは、ポリシー学習、評価、シミュレーションを単一のビデオ生成フレームワークに統合する。
論文参考訳（メタデータ） (Thu, 07 Aug 2025 17:59:44 GMT)
「we introduce Genie Envisioner (GE), a unified platform that collapses robot sensing, policy learning, and evaluation into a single closed-loop video generative world model」とビデオ生成をコアとしたフレームワークの提案。この手の学習には身体性が必要という指摘もあるがビデオ生成を主体として解決しうる問題なのかはとても興味がある。
リポジトリはGenie Envisioner

2025年8月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31