2025年9月16日 – arXiv最新論文の紹介

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games [56.8]
我々はFlashAdventureを紹介した。これは、フルストーリーのアーク補完をテストするために設計された、34のFlashベースのアドベンチャーゲームのベンチマークである。また,ゲームプレイの自動評価装置であるCUA-as-a-Judgeと,長期記憶を利用したエージェントフレームワークであるCOASTを提案する。実験では、現在のGUIエージェントがフルストーリーのアークに苦しむのに対して、COASTは観察と振る舞いのギャップを埋めることでマイルストーンの完了を改善する。
論文参考訳（メタデータ） (Mon, 01 Sep 2025 01:33:16 GMT)
アドベンチャーゲームを利用したベンチマークと「We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves mile- stone completion by bridging the observation- behavior gap.」という評価システムの提案。現状のSuccess Rateはとても低いが今後どのくらいの速度で改善していくかが楽しみ。
プロジェクトサイトはFlashAdventure

Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet [93.0]
テストタイムスケーリングは、モデルが長い推論チェーンを生成することによって、推論時間計算を増加させる。本手法は,知識集約型タスクにおいて,高い事実的精度と低幻覚率が不可欠である場合において,まだ有効ではないことを示す。以上の結果から,テスト時間計算の増大は必ずしも精度の向上には至らず,多くの場合において幻覚の増大につながることが示唆された。
論文参考訳（メタデータ） (Mon, 08 Sep 2025 16:28:25 GMT)
「To summarize, while test-time scaling in reasoning models has led to strong performance in many domains, it is not yet effective for knowledge-intensive tasks. Increasing inference time does not consistently improve factual accuracy, and contrary to expectations, it can even increase hallucinations.」とのこと。LRMを使っていて感じていることと整合的。
リポジトリはGitHub – XuZhao0/tts-knowledge: Code and data for “Test-time scaling in reasoning models is not effective for knowledge-intensive tasks yet”

Understanding the Influence of Synthetic Data for Text Embedders [52.0]
まず,Wangらによって提案された合成データの再生と公開を行った。合成データがモデル一般化をどのように改善するかを批判的に検討する。本研究は, 汎用インバータ構築における, 現在の合成データ手法の限界を浮き彫りにしたものである。
論文参考訳（メタデータ） (Sun, 07 Sep 2025 19:28:52 GMT)
合成データの効果についてEmbeddingモデルの観点で検証した論文。「we find that training on synthetic examples designed for a particular task can degrade the performance of other tasks, challenging the notion that training on more diverse synthetic data is strictly better. Moreover, we observe that synthetic data leads to sparse improvement across tasks, showing no statistically significant improvement on a majority of MTEB tasks.」とのこと。
リポジトリはGitHub – jakespringer/open-synthetic-embeddings

<think> So let’s replace this phrase with insult… </think> Lessons learned from generation of toxic texts with LLMs [60.2]
本稿では, 人為的データに代わる合成毒性データを用いた脱毒訓練モデルの可能性について検討する。実験によると、合成データに微調整されたモデルは、人間のデータで訓練されたモデルよりも一貫してパフォーマンスが悪くなっている。根本原因は、致命的な語彙の多様性のギャップとして認識される: LLMは、小さな反復的な侮辱の語彙を用いて、人間の毒性のニュアンスや多様性を捉えるのに失敗する有毒な内容を生成する。
論文参考訳（メタデータ） (Wed, 10 Sep 2025 07:48:24 GMT)
こちらも合成データに関する記載があり「Models trained on fully synthetic data significantly underperform those trained on humanannotated data.」としている。モデル崩壊の報告でも合成データのみでは良くない結果を招いていて、これはそうなのだろうと思う。

Language Self-Play For Data-Free Training [37.2]
大規模言語モデル(LLM)は,近年,大規模,高品質なトレーニングデータ,強化学習によって急速に進歩している。しかし、この進歩は根本的なボトルネックに直面している。我々は、追加データなしでモデルの改善を可能にすることで、この依存を取り除く強化学習手法を提案する。
論文参考訳（メタデータ） (Tue, 09 Sep 2025 05:51:34 GMT)
「Language Self-Play agent operates under two modes: Challenger and Solver. Challenger generates instructions that Solver follows. While Solver learns to improve its responses to the prompts, Challenger learns to make them more difficult. Both modes are instantiated by one model and thus enable perpetual training on increasingly higher-quality self-generated data.」というLanguage Self-Play (LSP)フレームワークの提案。
R-Zero: Self-Evolving Reasoning LLM from Zero Data – arXiv最新論文の紹介に似ている？

Dexplore: Scalable Neural Control for Dexterous Manipulation from Reference-Scoped Exploration [58.4]
ハンドオブジェクトモーションキャプチャ(MoCap)は、大規模でコンタクトに富んだデモと、器用なロボットスコープの約束を提供する。 Dexploreは、リポジトリとトラッキングを実行し、MoCapから直接ロボット制御ポリシーを学習する、統一された単一ループ最適化である。
論文参考訳（メタデータ） (Thu, 11 Sep 2025 17:59:07 GMT)
「(I) Our DEXPLORE is a unified single-loop optimization that learns dexterous manipulation directly from human MoCap by treating demonstrations as soft references within adaptive spatial scopes, without explicit retargeting and residual correction. (II) We distill the learned state-based tracker into a vision-based, skill-conditioned generative control policy that maps single-view depth and proprioception, together with a latent skill code, to low-level actions. (III) We demonstrate successful real-world deployment on a dexterous hand using only single-view depth sensing.」とのこと。
プロジェクトサイトはDexplore