staka – ページ 26 – arXiv最新論文の紹介

Provable In-Context Vector Arithmetic via Retrieving Task Concepts

Provable In-Context Vector Arithmetic via Retrieving Task Concepts [53.7]
クロスエントロピー損失に対する勾配降下による非線形残差変圧器の訓練は,ベクトル演算による実-リコールICLタスクをいかに行うかを示す。これらの結果は、静的埋め込み前駆体よりもトランスフォーマーの利点を解明する。
論文参考訳（メタデータ） (Wed, 13 Aug 2025 13:54:44 GMT)
「We develop an optimization theory demonstrating that transformers with nonlinear softmax attention, MLP, layer normalization, and residual connections—trained via Gradient Descent (GD) with cross- entropy loss—can effectively perform factual-recall ICL in a vector arithmetic manner, grounded in empirically motivated data modeling. Our analysis shows that the transformer retrieves the high-level task/function concept through attention-MLP, which, when combined with any embedded query vector within the same high- level task concept, yields the correct corresponding answer vector.」とtask vectorを想定した理論的研究。
不明点はまだまだ多そうに思うが、理論的研究が進むことに期待。

Don’t Overthink It: A Survey of Efficient R1-style Large Reasoning Models

Don’t Overthink It: A Survey of Efficient R1-style Large Reasoning Models [49.6]
大規模共振モデル (LRM) は, 複雑なタスクの処理性能に優れていたため, 徐々に研究ホットスポットになりつつある。しかし、これらのモデルが広く適用されたことにより、過度に考え直すという問題が徐々に顕在化していった。モデル性能と推論能力を損なうことなく、推論経路の長さを短縮することを目的とした、様々な効率的な推論手法が提案されている。
論文参考訳（メタデータ） (Mon, 04 Aug 2025 06:54:31 GMT)
Reasoningの効率化に関するサーベイだが、すでに様々なアプローチと多くの研究成果があるのに驚き
リポジトリはyuelinan/Awesome-Efficient-R1-style-LRMs

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory [11.7]
本稿では,長期記憶を備えた新しいフレームワークであるM3-Agentを紹介する。 M3-Agentは、リアルタイムの視覚および聴覚入力を処理して、長期記憶の構築と更新を行うことができる。我々は,M3-Benchという長ビデオ質問応答ベンチマークを開発した。
論文参考訳（メタデータ） (Wed, 13 Aug 2025 12:03:03 GMT)
こちらも長期記憶を備えたエージェントフレームワークの提案。「Compared to the strongest baseline, Gemini-GPT4o-Hybrid, which implements M3-Agent framework by prompting Gemini-1.5-Pro [41] for memorization and GPT-4o [15] for control, M3-Agent improves accuracy by 6.7%, 7.7%, and 5.3% on M3-Bench-robot, M3-Bench-web, and VideoMME-long, respectively. Our ablation study demonstrates the importance of semantic memory: removing it reduces accuracy by 17.1%, 19.2% and 13.1% on M3-Bench-robot, M3-Bench-web, and VideoMME-long, respectively.」と効果を報告している。
プロジェクトサイトはSeeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

Memp: Exploring Agent Procedural Memory

Memp: Exploring Agent Procedural Memory [72.4]
LLM(Large Language Models)ベースのエージェントは様々なタスクをこなすが、静的パラメータで手動で設計または絡み合うような不安定なプロシージャメモリに悩まされる。本稿では,過去のエージェントの軌跡をステップバイステップの細粒度と高レベルなスクリプトライクな抽象化の両方に蒸留するMempを提案する。メモリレポジトリが洗練されるにつれて、エージェントは着実に高い成功率と類似タスクの効率を達成できることを示す。
論文参考訳（メタデータ） (Fri, 08 Aug 2025 16:20:56 GMT)
エージェントへのMemory導入、「Empirical results on housework automation and information-seeking bench- marks show that leveraging procedural memory significantly boosts task success rates and efficiency. Beyond improving individual episodes, Memp supports continual learning and robust generalization, marking a step toward self-improving, resilient agents.」とのこと。
メモリ管理はシンプルに行っているように見える。

OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use

OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use [101.6]
アイアンマンの架空のJ.A.R.V.I.Sほど有能で多用途なAIアシスタントを作る夢は、長い間想像力に恵まれてきた。マルチモーダル(multi-modal)な大きな言語モデル((M)LLMs)の進化により、この夢は現実に近づいている。本調査は,OSエージェント研究の現状を整理し,学術調査と産業開発の両方の指針を提供する。
論文参考訳（メタデータ） (Wed, 06 Aug 2025 14:33:45 GMT)
「The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution of (multimodal) large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computing devices (e g , computers and mobile phones) by operating within the environments and interfaces (e g , Graphical User Interface (GUI)) provided by operating systems (OS) to automate tasks have significantly advanced.」から始まるサーベイ。
リポジトリはOS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use (ACL 2025)

AI in Agriculture: A Survey of Deep Learning Techniques for Crops, Fisheries and Livestock / AgroBench: Vision-Language Model Benchmark in Agriculture

AI in Agriculture: A Survey of Deep Learning Techniques for Crops, Fisheries and Livestock [78.0]
作物、漁業、家畜が世界の食料生産のバックボーンを形成し、成長を続ける世界の人口を養うのに不可欠である。これらの問題に対処するには、効率的で正確でスケーラブルな技術ソリューションが必要であり、人工知能(AI)の重要性を強調している。本調査では,従来の機械学習アプローチ,高度なディープラーニング技術,最新のビジョン言語基礎モデルなど,200以上の研究成果を体系的かつ徹底的にレビューする。
論文参考訳（メタデータ） (Tue, 29 Jul 2025 17:59:48 GMT)
農業分野におけるAI活用のサーベイ

AgroBench: Vision-Language Model Benchmark in Agriculture [25.5]
AgroBenchは、視覚言語モデル(VLM)を7つの農業トピックにわたって評価するためのベンチマークである。私たちのAgroBenchは、203の作物カテゴリと682の病気カテゴリを含む最先端のカテゴリをカバーし、VLM能力を徹底的に評価しています。
論文参考訳（メタデータ） (Mon, 28 Jul 2025 04:58:29 GMT)
こちらは農業分野のベンチマーク
リポジトリはAgroBehch

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models [194.6]
GLM-4.5はオープンソースのMixture-of-Experts(MoE)大言語モデルであり,総パラメータは355B,アクティベートパラメータは32Bである。 23Tトークンのマルチステージトレーニングと、エキスパートモデルのイテレーションと強化学習による総合的なポストトレーニングを通じて、GLM-4.5はエージェント、推論、コーディングタスクにわたって強力なパフォーマンスを実現している。 GLM-4.5(355Bパラメータ)とGLM-4.5-Air(106Bパラメータ)をそれぞれリリースし、推論とエージェントAIシステムの研究を進めた。
論文参考訳（メタデータ） (Fri, 08 Aug 2025 17:21:06 GMT)
GLM-4.5（GLM-4.5, Step-3, Falcon-H1, HunyuanWorld – arXiv最新論文の紹介）の論文。性能の割にパラメータ（特にアクティブパラメータ）が少ない。詳細に比較しないと何とも言えないところではあるが、GPT-OSSとの比較が気になるところ。
リポジトリはGitHub – zai-org/GLM-4.5: GLM-4.5: An open-source large language model designed for intelligent agents by Z.ai

Refine-n-Judge: Curating High-Quality Preference Chains for LLM-Fine-Tuning

Refine-n-Judge: Curating High-Quality Preference Chains for LLM-Fine-Tuning [14.3]
大規模言語モデル(LLM)は、好みに基づく微調整を通じて顕著な進歩を見せている。本稿では、1つのLCMを精細化と判定の両方に活用し、データセットの品質を向上させる自動反復手法であるRefine-n-Judgeを紹介する。本研究では,5つのコーパスにまたがる公開データセットにまたがるRefine-n-Judgeの有効性を示す。
論文参考訳（メタデータ） (Sun, 03 Aug 2025 01:56:03 GMT)
「Bringing these capabilities together, we propose Refine-n-Judge, a fully automated dataset curation pipeline, summarized in Figure 2. In this framework, an LLM model serves as both the refiner- generating improved outputs- and the judge-comparing the refined output against the original and selecting the preferred version.」という高品質化フレームワークの提案。
judge 部分なしでは十分な効果がなかったという結果が興味深い。改善とは異なるタスクとしてjudge をLLMに解かせるというのが重要なんだろうか。

Beyond Chunks and Graphs: Retrieval-Augmented Generation through Triplet-Driven Thinking

Beyond Chunks and Graphs: Retrieval-Augmented Generation through Triplet-Driven Thinking [31.7]
Retrieval-augmented Generation(RAG)は、幻覚を減らし、外部知識をLarge Language Models(LLM)に組み込むために重要である。 T$2$RAGは、原子三重項の単純でグラフのない知識ベースで動作する新しいフレームワークである。実験結果から,T$2$RAGは最先端のマルチラウンド法とグラフRAG法を著しく上回ることがわかった。
論文参考訳（メタデータ） (Mon, 04 Aug 2025 13:50:44 GMT)
「We introduce a novel RAG framework that leverages triplets as the fundamental unit for indexing, retrieval, and reasoning, moving beyond the limitations of chunk-based and explicit graph-based approaches」とトリプレットベースのRAGアプローチの提案。グラフ構造を上回るのはやや意外だが、コンポーネントとしては「both the iterative process and the use of chunks are important. The iterative reasoning module proves to be a critical component.」ということでシンプルな構成であることも有利だったりするのだろうか。
リポジトリはrockcor/T2RAG: Official code of paper “Beyond Chunks and Graphs: Retrieval-Augmented Generation through Triplet-Driven Thinking”

CoAct-1: Computer-using Agents with Coding as Actions

CoAct-1: Computer-using Agents with Coding as Actions [95.0]
CoAct-1はGUIベースの制御と直接プログラム実行を組み合わせた新しいマルチエージェントシステムである。我々は、CoAct-1が60.76%の最先端の成功率を達成したOSWorldベンチマークで、我々のシステムを評価した。
論文参考訳（メタデータ） (Tue, 05 Aug 2025 21:33:36 GMT)
「CoAct-1 features an Orchestrator that dynamically delegates subtasks to either a conventional GUI Operator or a specialized Programmer agent, which can write and execute Python or Bash scripts. This hybrid approach allows the agent to bypass inefficient GUI action sequences for tasks like file management and data processing, while still leveraging visual interaction when necessary.」とコード生成をうまく使うGUIエージェントの提案。OS WorldでSoTAを主張。
プロジェクトサイトはCoAct-1

2025年12月
月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31