staka – ページ 41 – arXiv最新論文の紹介

How Numerical Precision Affects Mathematical Reasoning Capabilities of LLMs

How Numerical Precision Affects Mathematical Reasoning Capabilities of LLMs [69.6]
本稿では,変圧器を用いた大規模言語モデルの数学的タスクにおける有効性に影響を与える重要な要因として,数値的精度を同定する。その結果,数値精度の低いトランスフォーマーでは,繰り返し加算や整数乗算などの算術的なタスクに対処できないことがわかった。対照的に、標準的な数値精度のトランスフォーマーは、モデルサイズを大幅に小さくすることで、これらのタスクを効率的に処理することができる。
論文参考訳（メタデータ） (Thu, 17 Oct 2024 17:59:35 GMT)
「Our results show that Transformers operating with low numerical precision fail to address arithmetic tasks, such as iterated addition and integer multiplication, unless the model size grows super-polynomially with respect to the input length.」という指摘。

Mamba in Vision: A Comprehensive Survey of Techniques and Applications

Mamba in Vision: A Comprehensive Survey of Techniques and Applications [3.5]
Mambaは、コンピュータビジョンにおいて、畳み込みニューラルネットワーク(CNN)とビジョントランスフォーマー(ViT)が直面する課題を克服するための、新しいアプローチとして登場した。 MambaはSelective Structured State Space Modelsを活用して、線形計算の複雑さで長距離依存を効果的に捉えることで、これらの制限に対処する。
論文参考訳（メタデータ） (Fri, 04 Oct 2024 02:58:49 GMT)
画像におけるMamba活用のサーベイ
リポジトリはGitHub – maklachur/Mamba-in-Computer-Vision: Mamba in Vision: A Comprehensive Survey of Techniques and Applications

Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance

Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance [95.0]
我々は、人間の指示なしにタスクを予測および開始できるプロアクティブエージェントを開発するという課題に取り組む。まず,実世界の人的活動を収集し,前向きなタスク予測を生成する。これらの予測は、ヒトのアノテータによって受け入れられるか拒否されるかのどちらかとしてラベル付けされる。ラベル付きデータは、人間の判断をシミュレートする報酬モデルをトレーニングするために使用される。
論文参考訳（メタデータ） (Wed, 16 Oct 2024 08:24:09 GMT)
指示なしで動くエージェントの開発、「we investigate a new scenario where the agent autonomously predicts tasks users might assign, aiming to offer assistance proactively」という設定。ProactiveBenchというベンチマークを構築し評価を行っている。fine tuningが非常に有効そうに見えるのはタスクの特殊性が原因だろうか。
リポジトリはGitHub – thunlp/ProactiveAgent: A LLM-based Agent that predict its tasks proactively.

Harnessing Webpage UIs for Text-Rich Visual Understanding

Harnessing Webpage UIs for Text-Rich Visual Understanding [112.0]
テキストベース大規模言語モデル(LLM)を用いたWebページUIからの汎用マルチモーダル命令の合成を提案する。これらの命令はUIスクリーンショットと組み合わせて、マルチモーダルモデルのトレーニングを行う。我々は、100万のWebサイトから730万のサンプルを含むデータセットであるMultiUIを紹介し、多様なマルチモーダルタスクとUIレイアウトをカバーした。
論文参考訳（メタデータ） (Thu, 17 Oct 2024 17:48:54 GMT)
「We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts.」というデータセットの構築と、それらデータを用いたMLLMの構築。
プロジェクトサイトはMultiUI、リポジトリはGitHub – neulab/MultiUI: Code for Paper: Harnessing Webpage Uis For Text Rich Visual Understanding

ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition

ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition [111.3]
ActionAtlasは、様々なスポーツのショートビデオを含むビデオ質問応答ベンチマークである。このデータセットには、56のスポーツで580のユニークなアクションを示す934の動画が含まれており、合計1896のアクションが選択できる。我々は、このベンチマークでオープンでプロプライエタリな基礎モデルを評価し、最高のモデルであるGPT-4oが45.52%の精度を達成することを発見した。
論文参考訳（メタデータ） (Tue, 08 Oct 2024 07:55:09 GMT)
「The question pinpoints specific individuals, asking which choice “best” describes their action within a certain temporal context.」というデータセット。とても難しく見える。。。
プロジェクトサイトはActionAtlas (mrsalehi.github.io)

TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration

TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration [33.9]
視覚言語基礎モデル(CLIPなど)は、大規模な画像テキスト事前学習により、転送学習におけるその能力を示している。本稿では,分離されたエージェントの知識を統一的に伝達する,汎用的で簡潔なTransAgentフレームワークを提案する。われわれのTransAgentは、11の視覚的認識データセット上で最先端のパフォーマンスを達成する。
論文参考訳（メタデータ） (Wed, 16 Oct 2024 03:01:44 GMT)
エージェンティックなモデルの統合、「By adaptively integrating the external knowledge of agents from different modalities via MoA gating mechanism, TransAgent achieves state-of-the-art performance on 11 datasets under the low-shot scenarios.」とのこと。
リポジトリはGitHub – markywg/transagent: [NeurIPS 2024] TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration

Jailbreaking LLM-Controlled Robots

Jailbreaking LLM-Controlled Robots [82.0]
大規模言語モデル(LLM)は、文脈推論と直感的な人間とロボットの相互作用を可能にすることによって、ロボット工学の分野に革命をもたらした。 LLMは脱獄攻撃に弱いため、悪意のあるプロンプトはLLMの安全ガードレールをバイパスすることで有害なテキストを誘発する。 LLM制御ロボットをジェイルブレイクするアルゴリズムであるRoboPAIRを紹介する。
論文参考訳（メタデータ） (Thu, 17 Oct 2024 15:55:36 GMT)
LLMが制御するロボットに対する脱獄攻撃、「(i) a white-box setting, wherein the attacker has full access to the NVIDIA Dolphins self-driving LLM, (ii) a gray-box setting, wherein the attacker has partial access to a Clearpath Robotics Jackal UGV robot equipped with a GPT-4o planner, and (iii) a black-box setting, wherein the attacker has only query access to the GPT-3.5-integrated Unitree Robotics Go2 robot dog. 」を設定、「In each scenario and across three new datasets of harmful robotic actions, we demonstrate that ROBOPAIR, as well as several static baselines, finds jailbreaks quickly and effectively, often achieving 100% attack success rates.」とのこと。。大きな脅威になりうる。
プロジェクトサイトはRoboPAIR

DocLayout-YOLO

DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception [16.3]
速度の優位性を保ちながら精度を向上させる新しいアプローチであるDoc-YOLOを導入する。堅牢な文書事前学習には、Mesh-candidate BestFitアルゴリズムを導入する。モデル最適化の観点からは,グローバルからローカライズ可能な受信モジュールを提案する。
論文参考訳（メタデータ） (Wed, 16 Oct 2024 14:50:47 GMT)
多様なレイアウトデータを合成する手法、Mesh-candidate BestFit methodologyの提案とそれを用いた高速高性能なDocLayout-YOLOの提案。
リポジトリはGitHub – opendatalab/DocLayout-YOLO: DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception

Latent Action Pretraining from Videos

Latent Action Pretraining from Videos [156.9]
一般行動モデル(LAPA)のための潜在行動事前訓練について紹介する。 LAPA(英: LAPA)は、VLA(Vision-Language-Action)モデルに接地型ロボットアクションラベルを含まない教師なしの訓練方法である。本稿では,ロボットアクションラベルを持たないインターネット規模のビデオから学習する手法を提案する。
論文参考訳（メタデータ） (Tue, 15 Oct 2024 16:28:09 GMT)
インターネットにあるようなビデオデータからVLAを構築する手法の提案、「Across three benchmarks spanning both simulation and real-world robot experiments, we show that our method significantly improves transfer to downstream tasks compared to existing approaches.」とのこと
プロジェクトサイトはLAPA

MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models

MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models [115.2]
MRAG-Benchというマルチモーダル検索拡張生成ベンチマークを導入する。 MRAG-Benchは16,130枚の画像と1,353個の人間による複数の質問からなる。その結果,すべての大規模視覚言語モデル (LVLM) は,テキスト知識と比較して画像で拡張すると改善が見られた。
論文参考訳（メタデータ） (Thu, 10 Oct 2024 17:55:02 GMT)
マルチモーダルなRAGのベンチマーク、様々なモデルのスコア一覧表もとても参考になる。
リポジトリはMRAG-Bench (mragbench.github.io)

月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30