guard – arXiv最新論文の紹介

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security [126.5]
現在のガードレールモデルは、リスク診断におけるエージェント的リスク認識と透明性を欠いている。エージェントリスクをソース(場所)、障害モード(方法)、結果(何)で分類する統合された3次元分類法を提案する。 AgentDoG(AgentDoG)のための,エージェント安全性ベンチマーク(ATBench)と診断ガードレールフレームワークを新たに導入する。
論文参考訳（メタデータ） (Mon, 26 Jan 2026 13:45:41 GMT)
「AgentDoG provides fine-grained and contextual monitoring across agents’ trajectories, including malicious tool execution and prompt injection. More crucially, AgentDoG provides a more transparent perspective to understand why an agent takes a particular action in an unsafe or seemingly safe but unreasonible way,」と軌跡レベルで評価していく高性能なガードレールの提案。この分野の外観を知るにも良い論文だと思う。
リポジトリはGitHub – AI45Lab/AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

GUIGuard: Toward a General Framework for Privacy-Preserving GUI Agents [38.4]
GUIはよりリッチでアクセスしやすいプライベート情報を公開し、プライバシーリスクはシーケンシャルなシーンにわたるインタラクションの軌跡に依存する。本稿では,プライバシ認識,プライバシ保護,保護下のタスク実行という,プライバシ保護GUIエージェントのための3段階フレームワークを提案する。この結果は,GUIエージェントにとって重要なボトルネックとして,プライバシ認識に注目した。
論文参考訳（メタデータ） (Mon, 26 Jan 2026 11:33:40 GMT)
可能性があるものリスクも大きいGUIエージェントに対するプライバシー保護のためのフレームワークおよびベンチマークの提案。「these results underscore privacy recognition as a critical and unresolved bottleneck in GUI privacy protection pipelines, limiting the reliability of subsequent protection mechanisms.」これはそうだろうと思うし、今後解決していく必要がある。
プロジェクトサイトはGUIGuard: Toward a General Framework for Privacy-Preserving GUI Agents

YuFeng-XGuard: A Reasoning-Centric, Interpretable, and Flexible Guardrail Model for Large Language Models [36.1]
我々はYuFeng-XGuardについて紹介する。YuFeng-XGuardは大規模言語モデル(LLM)のための論理中心ガードレールモデルである。 YuFeng-XGuardは不透明な二項判定を生成する代わりに、明確なリスクカテゴリや信頼性スコアを含む構造化されたリスク予測を生成する。リスク認識を政策執行から切り離す動的政策機構を導入し、モデルの再訓練なしに安全政策を調整できるようにした。
論文参考訳（メタデータ） (Thu, 22 Jan 2026 02:23:18 GMT)
「Instead of producing opaque binary judgments, YuFeng-XGuard generates structured risk predictions, including explicit risk categories and configurable confidence scores, accompanied by natural language explanations that expose the underlying reasoning process.」と詳細を出してくれるガードレール。
モデルはAlibaba-AAIG/YuFeng-XGuard-Reason-8B · Hugging Face

DynaGuard: A Dynamic Guardrail Model With User-Defined Policies [40.6]
ユーザ定義ポリシーに基づいてテキストを評価する動的ガーディアンモデルを提案する。私たちのモデルは、ポリシー違反の迅速な検出や、モデルのアウトプットを明確化し正当化する連鎖推論に使用できます。
論文参考訳（メタデータ） (Tue, 02 Sep 2025 17:57:56 GMT)
「Guardian models are used to supervise and moderate the outputs of user-facing chatbots, enforcing guardrails and detecting bad behaviors.」というガーディアンモデル（その中でもユーザ入力のポリシーに動的に対応可能なもの）の構築、Qwen3ベースで強力な性能。
リポジトリはGitHub – montehoover/DynaGuard: Code for “DynaGuard: A Dynamic Guardrail Model With User-Defined Policies.”