2026年2月27日 – arXiv最新論文の紹介

What Makes a Good LLM Agent for Real-world Penetration Testing? [37.6]
LLMをベースとした28の浸透試験システムを分析し,複雑性の増大を示す3つのベンチマークで5つの代表的実装を評価した。我々は、B型障害がLLMの根本原因とほとんど変わらず、エージェントはリアルタイムなタスクの難易度推定を欠いていることを示す。 Excaliburは、強力なツールと困難な計画とを結合した浸透試験エージェントである。
論文参考訳（メタデータ） (Thu, 19 Feb 2026 18:42:40 GMT)
ペネトレーションテストへのLLMAgent適用。
「PENTEST- GPT V2 achieves 91% task completion on CTF benchmarks (49% improvement over baselines) and compromises 4 of 5 hosts on the GOAD Active Directory environment versus 2 for prior systems」という結果。この領域もAIとの連携が必須になっていて納得感がある（と同時に怖いとも感じる）

Towards a Science of AI Agent Reliability [9.6]
AIエージェントは、重要なタスクを実行するためにますますデプロイされる。標準ベンチマークにおける精度の上昇は急速な進歩を示唆する一方で、多くのエージェントが実際に失敗し続けている。エージェントの信頼性を4つの重要な次元(一貫性、堅牢性、予測可能性、安全性)に沿って分解する12のメトリクスを提案する。
論文参考訳（メタデータ） (Wed, 18 Feb 2026 18:05:44 GMT)
通常のパフォーマンスではなく信頼性の4軸（consistency, robustness, predictability, safety）からのベンチマーク比較、「14 models across two complementary benchmarks. Our results show that 18 months of rapid capability gains have produced only small improvements in reli- ability: models that are substantially more accurate remain inconsistent across runs, brittle to prompt rephrasings, and often fail to understand when they are likely to succeed.」とのこと。
プロジェクトサイトはHAL Reliability Dashboard