arXiv – ページ 10 – arXiv最新論文の紹介

Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights

Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights [49.4]
本稿では,大規模言語モデル(LLM)の複雑な課題解決における推論と計画能力について検討する。近年の推論時間技術の発展は,LLM推論を追加訓練なしで向上させる可能性を示している。 OpenAIのo1モデルは、マルチステップ推論と検証の新たな使用を通じて、有望なパフォーマンスを示している。
論文参考訳（メタデータ） (Tue, 18 Feb 2025 04:11:29 GMT)
流行りのInference-time computationについての検証。「Language models rely on retrieval rather than true understanding. Despite advancements in reasoning abilities with LRMs such as O1 and O1-Mini, they still appear to be pattern matching rather than genuine reasoning.」というのが興味深かった。
リポジトリはGitHub – divelab/Sys2Bench: Sys2Bench is a benchmarking suite designed to evaluate reasoning and planning capabilities of large language models across algorithmic, logical, arithmetic, and common-sense reasoning tasks.

PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC

PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC [98.8]
本稿では,PC-Agentという階層型エージェントフレームワークを提案する。認識の観点からは,現在のMLLMのスクリーンショットコンテンツに対する認識能力の不十分さを克服するために,アクティブ知覚モジュール(APM)を考案する。意思決定の観点から、複雑なユーザ命令や相互依存サブタスクをより効果的に扱うために、階層的なマルチエージェント協調アーキテクチャを提案する。
論文参考訳（メタデータ） (Thu, 20 Feb 2025 05:41:55 GMT)
(1) Active Perception Module、(2) Hierarchical Multi-agent Collaboration、(3) Reflection-based Dynamic Decision-makingを特徴とするフレームワークの提案。評価のためのベンチマークも構築。UFOやAgent-Sに比べ優位性を主張。
Manger Agent 、Progress Agent 、Decision Agent 、Reflection Agent のマルチエージェント構成。

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? [0.0]
SWE-LancerはUpworkの1,400以上のフリーランスソフトウェアエンジニアリングタスクのベンチマークである。独立したタスクは、経験豊富なソフトウェアエンジニアによって三度検証されたエンドツーエンドのテストによって評価される。モデル性能を評価し、フロンティアモデルが依然としてほとんどのタスクを解決できないことを発見した。
論文参考訳（メタデータ） (Mon, 17 Feb 2025 18:41:16 GMT)
「SWE-Lancer encompasses both independent engineering tasks — ranging from $50 bug fixes to $32,000 feature implementations —」と金額換算が可能なベンチマーク
リポジトリはGitHub – openai/SWELancer-Benchmark: This repo contains the dataset and code for the paper “SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?”

Towards an AI co-scientist, Grok-3, Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling

AIの共同科学者は、新しい独自の知識を発見し、実証可能な新しい研究仮説と提案を定式化し、事前の証拠に基づいて科学者が提供する研究目標とガイダンスに整合させることを意図している。システムの設計には、テスト時間計算のスケーリングによって加速される仮説生成に対する生成、議論、進化のアプローチが組み込まれている。主な貢献は、(1)フレキシブルな計算スケーリングのための非同期タスク実行フレームワークを備えたマルチエージェントアーキテクチャ、(2)自己改善仮説生成のためのトーナトーナメント進化プロセスである。本システムは, 臨床応用濃度で腫瘍抑制をin vitroで示す急性骨髄性白血病の候補を含む, 有望なバリデーションの候補を提案する。
Google Research launches new scientific research tool, AI co-scientist　ai_coscientist.pdf

GoogleによるAIを用いた科学者支援の提案、「Its ability to generate novel testable hypotheses across diverse scientiﬁc and biomedical domains, some supported by experimental ﬁndings, along with the capacity for recursive self-improvement with increasing compute, demonstrates the promise of meaningfully accelerating scientists’ endeavours to resolve grand challenges in human health, medicine and science.」と主張。パイプライン構成（とマルチエージェントな構成）も凝ったものになっている。Google AI co-scientist Trusted Tester Program　で申し込みが可能とのこと。

xAIによるGrok-3やDeepSearchの発表（Grok 3 Beta — The Age of Reasoning Agents）やNVIDIAのAutomating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling | NVIDIA Technical Blogも調査などのタスクにAIが組み込まれていく・必須のものになっていくことを示唆しているように思う。オープンな取り組みを含め様々なトライが行われていて今後が楽しみ。

How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation

How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation [30.7]
本稿では,デジタル双生児が連続した人間の行動をシミュレートする能力を評価する最初のベンチマークであるBehavimentChainを紹介する。 BehaviorChainは、多種多様で高品質なペルソナベースの行動連鎖で構成され、1,001のユニークなペルソナに対して15,846の異なる振る舞いがある。総合的な評価結果は、最先端モデルでさえ、連続した人間の行動の正確なシミュレートに苦慮していることを示している。
論文参考訳（メタデータ） (Thu, 20 Feb 2025 15:29:32 GMT)
人のデジタルツインを構築できるのであれば可能なはずの連続的行動の予測に関するベンチマーク。「BEHAVIORCHAIN instance is composed of four key components: a persona profile p, a historical narrative h, a behavior chain B = {b1,b2,…,bn} of the specific persona, and the contextual setting for each behavior C = {c1,c2,…,cn}.」というデータセットで「BEHAVIORCHAIN comprises 1,001 high-quality, persona-based behavior chains, each containing 10–20 context-behavior nodes, automatically extracted from fiction and biographical literature.」とのこと。GPT-4oでも解くのが難しいタスクになっているようだがLlamaの性能が意外と高い。Leakの影響は気になるが面白いタスク。
リポジトリはGitHub – O-L1RU1/BehaviorChain

Towards Trustworthy Retrieval Augmented Generation for Large Language Models: A Survey

Towards Trustworthy Retrieval Augmented Generation for Large Language Models: A Survey [92.4]
Retrieval-Augmented Generation (RAG)は、AIGC(AIGC)の課題に対処するために設計された高度な技術である。 RAGは信頼性と最新の外部知識を提供し、幻覚を減らし、幅広いタスクで関連するコンテキストを保証する。 RAGの成功と可能性にもかかわらず、最近の研究により、RAGパラダイムはプライバシーの懸念、敵対的攻撃、説明責任の問題など、新たなリスクももたらしていることが示されている。
論文参考訳（メタデータ） (Sat, 08 Feb 2025 06:50:47 GMT)
RAG、Trustworthyのサーベイ。実用上様々な考慮点があるとはいえ、この観点でサーベイが必要な状況に若干驚き。
リポジトリはGitHub – Arstanley/Awesome-Trustworthy-Retrieval-Augmented-Generation、論文リストが公開されている。

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention [32.5]
我々は、アルゴリズムのイノベーションとハードウェアの最適化を統合する、ネイティブにトレーニング可能なスパースアテンションメカニズムであるNSAを紹介する。 NSAは動的な階層的なスパース戦略を採用し、粗粒のトークン圧縮と細粒のトークン選択を組み合わせて、グローバルなコンテキスト認識と局所的精度の両方を維持する。
論文参考訳（メタデータ） (Sun, 16 Feb 2025 11:53:44 GMT)
DeepSeekによる階層的、スパースなアテンションの提案。通常の実装に比べ数倍以上高速。
「Following the common practice in state-of-the-art LLMs, our experiments adopt a backbone combining Grouped-Query Attention (GQA) and Mixture-of-Experts (MoE), featuring 27B total parameters with 3B active parameters.」という構成で実験をしており、品質もAverageではfull attention以上という成績。

Generative AI and Creative Work: Narratives, Values, and Impacts

Generative AI and Creative Work: Narratives, Values, and Impacts [37.2]
私たちは、オンラインメディアをレビューし、彼らが伝達するクリエイティブな仕事に対するAIの影響に関する支配的な物語を分析します。この談話は、人的労働を通じて物質的実現から解放された創造性を促進する。この言説は、支配的なテクノ実証主義のビジョンに対応し、創造的経済と文化に対する権力を主張する傾向にある。
論文参考訳（メタデータ） (Thu, 06 Feb 2025 10:26:56 GMT)
「In this article, we review online media outlets and analyze the dominant narratives around AI’s impact on creative work that they convey.」
参入障壁の低下が良いことなのか、アイデアと実行でアイデアの重要性（比率）が上がるのは好ましいのか、などは人によって考え方が異なるとはいえ、テクノロジーの普及は止められない。。それはそれとして「For example, we believe that five years ago, narratives of generative AI in art emphasized the replacement of artists by technology, whereas current narratives focus more on augmentation and collaboration.」は本当なんだろうか・・・という疑問も。

SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models

SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models [51.9]
SelfCiteは自己教師型のアプローチで、LLMを調整して、生成された応答における文に対する高品質できめ細かい文レベルの引用を生成する。コストと労働集約的なアノテーションに頼る代わりに、SelfCiteはLLM自体が提供する報酬シグナルをコンテキストアブレーションを通じて活用する。 SelfCiteの有効性は、5つの長文質問応答タスクにわたるLongBench-Citeベンチマークにおいて、引用F1を5.3ポイントまで増やすことによって示される。
論文参考訳（メタデータ） (Thu, 13 Feb 2025 18:55:13 GMT)
「First, the full context is used to generate a response. Then, the framework evaluates the probability of generating the same response after (1) removing the cited sentences from the context and (2) using only the cited sentences in the context. The probability drop and hold are computed from these probability differences, and their sum is used as the final reward.」というアプローチのreward計算、preference optimization with SimPOが良い結果だったとの報告。

Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks

Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks [88.8]
最近のMLセキュリティ文献は、整列型大規模言語モデル(LLM)に対する攻撃に焦点を当てている。本稿では,LLMエージェントに特有のセキュリティとプライバシの脆弱性を分析する。我々は、人気のあるオープンソースおよび商用エージェントに対する一連の実証的な攻撃を行い、その脆弱性の即時的な影響を実証した。
論文参考訳（メタデータ） (Wed, 12 Feb 2025 17:19:36 GMT)
LLM based Agentsに対する攻撃手法の提案、「In this paper, we argue that LLM-powered agents, especially those that have the ability to communicate with the outside world via web access or external-facing databases, already pose a massive danger to their users which has largely been overlooked by the ML security and privacy community.」とのこと。Agentに対するPhisingが意外とできそうなことに若干驚き。Redditが信頼できるかというと見解は様々だと思うが、現状のAgentへの攻撃有効性が高いというのが意外だった。論文中にもある通り、自動化が進むゆえに開発側の対応体制は重要に思う。

2025年4月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30