arXiv最新論文の紹介

CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation

CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation [70.3]
CowPilotは、自律的および人間とエージェントの協調的なWebナビゲーションをサポートするフレームワークである。エージェントが次のステップを提案することによって、人間が実行しなければならないステップの数を減らすと同時に、ユーザが一時停止、拒否、代替アクションを取ることができる。 CowPilotは、Webサイト間でのデータ収集とエージェント評価のための便利なツールとして機能する。
論文参考訳（メタデータ） (Tue, 28 Jan 2025 00:56:53 GMT)
人間とエージェントが協調することを前提としたフレームワークの提案。「We conducted case studies on five common websites and found that the human-agent collaborative mode achieves the highest success rate of 95% while requiring humans to perform only 15.2% of the total steps.」で現実的な効率化につながりそうな結果。（ではあるが、多くのタスクで完全自動化と協調的な自動化の意味は大きく違う点には注意が必要。）
プロジェクトサイトはCowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training [127.5]
ファウンデーションモデルでは、教師付き微調整(SFT)と強化学習(RL)がポストトレーニング技術として広く使われている。本稿では,一般化と記憶におけるSFTとRLの違いについて検討する。 RLは、特に結果に基づく報酬で訓練された場合、ルールベースのテキストと視覚的バリエーションの両方で一般化されることを示す。
論文参考訳（メタデータ） (Tue, 28 Jan 2025 18:59:44 GMT)
まさに今知りたい情報という感じの論文、「Through extensive experiments on the GeneralPoints and V-IRL tasks, we demonstrated that RL exhibits superior performance in learning generalizable knowledge, while SFT tends to merely memorize the training data, across both the rule and visual variations.」とのこと。
上記に加え、「SFT is necessary for RL training when the backbone model does not follow instructions.」はとても興味深い。基礎性能によって効果的なトレーニング方針が異なるというのは他の事例でもよく見られる印象があり（直感的にもそうだろうとも思い）、このあたりは重要なノウハウでありそう。
プロジェクトサイトはSFT Memorizes, RL Generalizes

International AI Safety Report

International AI Safety Report [229.3]
報告書は英国ブレッチリーで開催されたAI Safety Summitに出席する各国によって委任された。 30カ国、国連、OECD、EUはそれぞれ報告書の専門顧問パネルの代表を指名した。合計で100人のAI専門家が貢献し、さまざまな視点と規律を表現した。
論文参考訳（メタデータ） (Wed, 29 Jan 2025 17:47:36 GMT)
先端AIのリスクをまとめた報告書、非常に参考になる。
XユーザーのYoshua Bengioさん: 「Today, we are publishing the first-ever International AI Safety Report, backed by 30 countries and the OECD, UN, and EU. It summarises the state of the science on AI capabilities and risks, and how to mitigate those risks. 🧵 Link to full Report: https://t.co/k9ggxL7i66 1/16 https://t.co/68Gcm4iYH5」 / X で概要が議長であるYoshua Bengioによって解説されている。

o3-mini vs DeepSeek-R1: Which One is Safer?

o3-mini vs DeepSeek-R1: Which One is Safer? [6.1]
DeepSeek-R1はOpenAIのo3-miniと比べて非常に安全ではない。 DeepSeek-R1は、実行されたプロンプトの11.98%に対して安全ではないと答えたが、o3-miniは1.19%だった。
論文参考訳（メタデータ） (Thu, 30 Jan 2025 15:45:56 GMT)
Deepseek R1とOpenAI o3-miniの安全性評価。既存フレームワークを使っているとはいえ、すごいスピード間での発表。（「The team conducting the study was part of the early access safety testing program of OpenAI: https://openai.com/index/ early-access-for-safety-testing/」との脚注はある）
結論としては「Our results suggests that OpenAI’s o3-mini LLM is a much safer model than DeepSeek-R1, which answered unsafely to almost 12% of the executed unsafe prompts.」とのこと。

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.2]
本稿では,GUIエージェントのネイティブモデルであるUI-TARSを紹介する。 OSWorldベンチマークでは、UI-TARSはスコアが24.6、50ステップが22.7、15ステップが22.7でクロード(それぞれ22.0と14.9)を上回っている。
論文参考訳（メタデータ） (Tue, 21 Jan 2025 17:48:10 GMT)
GUIエージェント、UI-TARSの提案、様々なタスクでSOTAを主張。「UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for contextaware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines.」とやれることは盛り込んだ感がすごい。
リポジトリはGitHub – bytedance/UI-TARS

A Survey of Embodied AI in Healthcare: Techniques, Applications, and Opportunities

A Survey of Embodied AI in Healthcare: Techniques, Applications, and Opportunities [31.2]
医療におけるEmAIは、アルゴリズム、ロボティクス、バイオメディシンといった多様な分野にまたがる。医療のためのEmAIの”脳”の概要を包括的に紹介し、認識、アクティベーション、計画、記憶のためのAIアルゴリズムを紹介します。我々は、技術的な障壁を議論し、倫理的考察を探求し、医療におけるEmAIの将来を前方視する。
論文参考訳（メタデータ） (Mon, 13 Jan 2025 16:35:52 GMT)
医療におけるEmbodiedAIのサーベイ。非常に広範な内容で引用数は800を超える

MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents [26.4]
この研究はMMDocIRと呼ばれる新しいベンチマークを導入し、ページレベルとレイアウトレベルの検索という2つの異なるタスクを含んでいる。 MMDocIRベンチマークは,1,685問の注釈付きラベルと173,843問の自己ストラップ付きラベルを備えた,豊富なデータセットで構成されている。
論文参考訳（メタデータ） (Wed, 15 Jan 2025 14:30:13 GMT)
マルチモーダル、長い文書への検索ベンチマーク、document page-level and layout-level retrievalの２つがあるのが特徴的。
リポジトリはMMDocIR (MMDocIR)

RECALL: Library-Like Behavior In Language Models is Enhanced by Self-Referencing Causal Cycles

RECALL: Library-Like Behavior In Language Models is Enhanced by Self-Referencing Causal Cycles [18.1]
自己参照因果サイクル(RECALL)の概念を紹介する。これにより、一方向因果関係の制限を回避できる。 RECALLは、私たちがサイクルトークンとして指定したものによって駆動されています。
論文参考訳（メタデータ） (Thu, 23 Jan 2025 09:14:07 GMT)
self-referencing causal cycles、RECALL 「a mechanism that enables large language models (LLMs) to bypass the limitations of unidirectional causality, which underlies a phenomenon known as the reversal curse.」の提案。Causal language modelでよくみられるの課題への対応で興味深い。
https://github.com/samunaai/remember　がリポジトリとのことだが、現状404

Harnessing Large Language Models for Disaster Management: A Survey

Harnessing Large Language Models for Disaster Management: A Survey [57.0]
大規模言語モデル(LLM)は、その例外的な能力で科学研究に革命をもたらし、様々な分野を変革した。本研究の目的は,災害対策のための高度LLMの開発における専門家コミュニティの指導であり,自然災害に対するレジリエンスを高めることである。
論文参考訳（メタデータ） (Sun, 12 Jan 2025 21:00:50 GMT)
災害へのLLM適用に関するサーベイで、Mitigation、Preparedness、Response、Recoveryの軸で整理

GPS as a Control Signal for Image Generation

GPS as a Control Signal for Image Generation [95.4]
画像メタデータに含まれるGPSタグは,画像生成に有用な制御信号であることを示す。私たちはGPSと画像のモデルをトレーニングし、都市内の画像がどのように変化するかの詳細な理解を必要とするタスクにそれらを使用します。
論文参考訳（メタデータ） (Tue, 21 Jan 2025 18:59:46 GMT)
「Our work suggests that GPS coordinates are a useful signal for controllable image generation.」とのこと。直観的には確かに有効そうであるし、コンテキストとして明確な情報を与える場合も多そうに思う。
プロジェクトサイトはGPS as a Control Signal for Image Generation

2025年6月
月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30