2024年10月23日 – arXiv最新論文の紹介

Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance

Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance [95.0]
我々は、人間の指示なしにタスクを予測および開始できるプロアクティブエージェントを開発するという課題に取り組む。まず,実世界の人的活動を収集し,前向きなタスク予測を生成する。これらの予測は、ヒトのアノテータによって受け入れられるか拒否されるかのどちらかとしてラベル付けされる。ラベル付きデータは、人間の判断をシミュレートする報酬モデルをトレーニングするために使用される。
論文参考訳（メタデータ） (Wed, 16 Oct 2024 08:24:09 GMT)
指示なしで動くエージェントの開発、「we investigate a new scenario where the agent autonomously predicts tasks users might assign, aiming to offer assistance proactively」という設定。ProactiveBenchというベンチマークを構築し評価を行っている。fine tuningが非常に有効そうに見えるのはタスクの特殊性が原因だろうか。
リポジトリはGitHub – thunlp/ProactiveAgent: A LLM-based Agent that predict its tasks proactively.

Harnessing Webpage UIs for Text-Rich Visual Understanding [112.0]
テキストベース大規模言語モデル(LLM)を用いたWebページUIからの汎用マルチモーダル命令の合成を提案する。これらの命令はUIスクリーンショットと組み合わせて、マルチモーダルモデルのトレーニングを行う。我々は、100万のWebサイトから730万のサンプルを含むデータセットであるMultiUIを紹介し、多様なマルチモーダルタスクとUIレイアウトをカバーした。
論文参考訳（メタデータ） (Thu, 17 Oct 2024 17:48:54 GMT)
「We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts.」というデータセットの構築と、それらデータを用いたMLLMの構築。
プロジェクトサイトはMultiUI、リポジトリはGitHub – neulab/MultiUI: Code for Paper: Harnessing Webpage Uis For Text Rich Visual Understanding

ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition [111.3]
ActionAtlasは、様々なスポーツのショートビデオを含むビデオ質問応答ベンチマークである。このデータセットには、56のスポーツで580のユニークなアクションを示す934の動画が含まれており、合計1896のアクションが選択できる。我々は、このベンチマークでオープンでプロプライエタリな基礎モデルを評価し、最高のモデルであるGPT-4oが45.52%の精度を達成することを発見した。
論文参考訳（メタデータ） (Tue, 08 Oct 2024 07:55:09 GMT)
「The question pinpoints specific individuals, asking which choice “best” describes their action within a certain temporal context.」というデータセット。とても難しく見える。。。
プロジェクトサイトはActionAtlas (mrsalehi.github.io)

TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration [33.9]
視覚言語基礎モデル(CLIPなど)は、大規模な画像テキスト事前学習により、転送学習におけるその能力を示している。本稿では,分離されたエージェントの知識を統一的に伝達する,汎用的で簡潔なTransAgentフレームワークを提案する。われわれのTransAgentは、11の視覚的認識データセット上で最先端のパフォーマンスを達成する。
論文参考訳（メタデータ） (Wed, 16 Oct 2024 03:01:44 GMT)
エージェンティックなモデルの統合、「By adaptively integrating the external knowledge of agents from different modalities via MoA gating mechanism, TransAgent achieves state-of-the-art performance on 11 datasets under the low-shot scenarios.」とのこと。
リポジトリはGitHub – markywg/transagent: [NeurIPS 2024] TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration