2025年2月24日 – arXiv最新論文の紹介

Towards an AI co-scientist, Grok-3, Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling

AIの共同科学者は、新しい独自の知識を発見し、実証可能な新しい研究仮説と提案を定式化し、事前の証拠に基づいて科学者が提供する研究目標とガイダンスに整合させることを意図している。システムの設計には、テスト時間計算のスケーリングによって加速される仮説生成に対する生成、議論、進化のアプローチが組み込まれている。主な貢献は、(1)フレキシブルな計算スケーリングのための非同期タスク実行フレームワークを備えたマルチエージェントアーキテクチャ、(2)自己改善仮説生成のためのトーナトーナメント進化プロセスである。本システムは, 臨床応用濃度で腫瘍抑制をin vitroで示す急性骨髄性白血病の候補を含む, 有望なバリデーションの候補を提案する。
Google Research launches new scientific research tool, AI co-scientist　ai_coscientist.pdf

GoogleによるAIを用いた科学者支援の提案、「Its ability to generate novel testable hypotheses across diverse scientiﬁc and biomedical domains, some supported by experimental ﬁndings, along with the capacity for recursive self-improvement with increasing compute, demonstrates the promise of meaningfully accelerating scientists’ endeavours to resolve grand challenges in human health, medicine and science.」と主張。パイプライン構成（とマルチエージェントな構成）も凝ったものになっている。Google AI co-scientist Trusted Tester Program　で申し込みが可能とのこと。

xAIによるGrok-3やDeepSearchの発表（Grok 3 Beta — The Age of Reasoning Agents）やNVIDIAのAutomating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling | NVIDIA Technical Blogも調査などのタスクにAIが組み込まれていく・必須のものになっていくことを示唆しているように思う。オープンな取り組みを含め様々なトライが行われていて今後が楽しみ。

How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation

How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation [30.7]
本稿では,デジタル双生児が連続した人間の行動をシミュレートする能力を評価する最初のベンチマークであるBehavimentChainを紹介する。 BehaviorChainは、多種多様で高品質なペルソナベースの行動連鎖で構成され、1,001のユニークなペルソナに対して15,846の異なる振る舞いがある。総合的な評価結果は、最先端モデルでさえ、連続した人間の行動の正確なシミュレートに苦慮していることを示している。
論文参考訳（メタデータ） (Thu, 20 Feb 2025 15:29:32 GMT)
人のデジタルツインを構築できるのであれば可能なはずの連続的行動の予測に関するベンチマーク。「BEHAVIORCHAIN instance is composed of four key components: a persona profile p, a historical narrative h, a behavior chain B = {b1,b2,…,bn} of the specific persona, and the contextual setting for each behavior C = {c1,c2,…,cn}.」というデータセットで「BEHAVIORCHAIN comprises 1,001 high-quality, persona-based behavior chains, each containing 10–20 context-behavior nodes, automatically extracted from fiction and biographical literature.」とのこと。GPT-4oでも解くのが難しいタスクになっているようだがLlamaの性能が意外と高い。Leakの影響は気になるが面白いタスク。
リポジトリはGitHub – O-L1RU1/BehaviorChain

Towards Trustworthy Retrieval Augmented Generation for Large Language Models: A Survey

Towards Trustworthy Retrieval Augmented Generation for Large Language Models: A Survey [92.4]
Retrieval-Augmented Generation (RAG)は、AIGC(AIGC)の課題に対処するために設計された高度な技術である。 RAGは信頼性と最新の外部知識を提供し、幻覚を減らし、幅広いタスクで関連するコンテキストを保証する。 RAGの成功と可能性にもかかわらず、最近の研究により、RAGパラダイムはプライバシーの懸念、敵対的攻撃、説明責任の問題など、新たなリスクももたらしていることが示されている。
論文参考訳（メタデータ） (Sat, 08 Feb 2025 06:50:47 GMT)
RAG、Trustworthyのサーベイ。実用上様々な考慮点があるとはいえ、この観点でサーベイが必要な状況に若干驚き。
リポジトリはGitHub – Arstanley/Awesome-Trustworthy-Retrieval-Augmented-Generation、論文リストが公開されている。

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention [32.5]
我々は、アルゴリズムのイノベーションとハードウェアの最適化を統合する、ネイティブにトレーニング可能なスパースアテンションメカニズムであるNSAを紹介する。 NSAは動的な階層的なスパース戦略を採用し、粗粒のトークン圧縮と細粒のトークン選択を組み合わせて、グローバルなコンテキスト認識と局所的精度の両方を維持する。
論文参考訳（メタデータ） (Sun, 16 Feb 2025 11:53:44 GMT)
DeepSeekによる階層的、スパースなアテンションの提案。通常の実装に比べ数倍以上高速。
「Following the common practice in state-of-the-art LLMs, our experiments adopt a backbone combining Grouped-Query Attention (GQA) and Mixture-of-Experts (MoE), featuring 27B total parameters with 3B active parameters.」という構成で実験をしており、品質もAverageではfull attention以上という成績。

月	火	水	木	金	土	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28