safety – arXiv最新論文の紹介

Automated Safety Benchmarking: A Multi-agent Pipeline for LVLMs

Automated Safety Benchmarking: A Multi-agent Pipeline for LVLMs [61.0]
大規模視覚言語モデル(LVLM)は、クロスモーダルタスクにおいて顕著な能力を示すが、重大な安全性上の課題に直面している。既存のベンチマークは、労働集約的な建設プロセス、静的な複雑さ、限定的な差別力によって妨げられている。 LVLMの安全性ベンチマークのための最初の自動システムであるVLSafetyBencherを提案する。
論文参考訳（メタデータ） (Tue, 27 Jan 2026 11:51:30 GMT)
LVLMのための安全性評価ベンチマーク、「Ex-eriments validates that VLSafetyBencher can construct high-quality safety benchmarks within one week at a minimal cost. The generated benchmark effectively distinguish safety, with a safety rate disparity of 70% between the most and least safe models.」とのこと。
この手のベンチマークではGPT系モデルの優位性が目立つことが多いが、本論文ではClaude-Sonnet-4がトップ。LVLMとしての評価だからだろうか。

When Should We Introduce Safety Interventions During Pretraining?

When Should We Introduce Safety Interventions During Pretraining? [100.4]
先行研究は、有害な内容の表現などの事前訓練の介入が、結果のモデルの安全性を大幅に向上させることを示した。介入の導入は一般的に、過度な拒絶率の増加を伴わない、より堅牢なモデルをもたらす。また、より安全な世代に向けたモデルのステアビリティにも明らかなメリットがあると考えています。
論文参考訳（メタデータ） (Sun, 11 Jan 2026 22:38:17 GMT)
「Our experiments show that incorporating safety pretraining interventions indeed help, and the clearest result is that there is much improved robustness after benign finetuning when pretraining interventions are introduced earlier (e g , at 0% or 20% of the pretraining tokens). This also manifests into impacts on the model’s underlying representation geometry; incorporating interventions and metadata earlier in pretraining leads to greater separation of safe vs unsafe content.」とのこと。
タイミングによって結構な差が出ているのが意外。

International AI Safety Report 2025: Second Key Update: Technical Safeguards and Risk Management

International AI Safety Report 2025: Second Key Update: Technical Safeguards and Risk Management [115.9]
2025年の国際AI安全レポートの第2の更新は、この1年で汎用AIリスク管理の新しい展開を評価している。研究者、公共機関、AI開発者が汎用AIのリスク管理にどのようにアプローチしているかを調べる。
論文参考訳（メタデータ） (Tue, 25 Nov 2025 03:12:56 GMT)
AI Safety Reportの最新版。ハイライトは非常に参考になるが、「Open-weight models lag less than a year behind leading closed-weight models, shifting the risk landscape.」という記載は重要に思える。
攻撃面で「tests show that sophisticated attackers can still bypass safeguards around half of the time when given 10 attempts.」、「As few as 250 malicious documents inserted into training data can allow attackers to trigger undesired model behaviours with specific prompts. Some research shows that such data poisoning attacks require relatively few resources to carry out, regardless of model size.」な状況だが、「The number of AI companies with Frontier AI Safety Frameworks more than doubled in 2025: at least 12 companies now have such frameworks.」という進み具合も興味深い。

SafeRBench: A Comprehensive Benchmark for Safety Assessment in Large Reasoning Models

SafeRBench: A Comprehensive Benchmark for Safety Assessment in Large Reasoning Models [60.9]
LRMの安全性をエンドツーエンドに評価する最初のベンチマークであるSafeRBenchを紹介する。私たちは、リスクカテゴリとレベルを入力設計に組み込んだ先駆者です。我々は,長い推論トレースを意味的に一貫性のある単位にセグメント化するためのマイクロシンクのチャンキング機構を導入する。
論文参考訳（メタデータ） (Thu, 20 Nov 2025 03:41:06 GMT)
LRMを対象とした安全性ベンチマーク評価。
「For small models (e g , Qwen-3- 0.6B), Thinking increases risk, consistent with prior observations that reasoning traces can introduce hazards. For mid-scale models, however, Thinking yields safer behavior—lower risk and execution levels and higher refusal rates—suggesting that structured reasoning can be leveraged to reduce exposure when model capacity is sufficient. At very large scale, this pattern reverses: the MoE-based Qwen-235B shows higher risk levels under Thinking, reflecting an “always-help” tendency that makes unsafe responses more actionable. In short, reasoning improves safety up to a point; beyond that, greater capability without stronger alignment can raise exposure.」とモデルサイズとの関係が興味深い。

SafeToolBench: Pioneering a Prospective Benchmark to Evaluating Tool Utilization Safety in LLMs

SafeToolBench: Pioneering a Prospective Benchmark to Evaluating Tool Utilization Safety in LLMs [35.2]
大規模言語モデル(LLM)は、外部環境において様々なツールを自律的に呼び出す上で、優れたパフォーマンスを示している。本稿では, LLMツール利用の安全性を評価するために, ツールを直接実行することによって生じる不可逆的な害を避けることを目的としている。ツール利用セキュリティを総合的に評価する最初のベンチマークであるSafeToolBenchを提案する。ツール利用セキュリティに対するLCMの認識を3つの観点から向上することを目的とした,新しいフレームワークであるSafeInstructToolも提案する。
論文参考訳（メタデータ） (Tue, 09 Sep 2025 01:31:25 GMT)
LLMのツール利用におけるセキュリティを評価するベンチマーク、「we further pro- pose SafeInstructTool, the first framework to evaluate risks across these three perspectives from nine dimensions: User Instruction Perspective (Data Sensitivity, Harmfulness of the Instruction, Urgency of the Instruction, Frequency of Tool Utilization in the Instruction), Tool Itself Perspective (Key Sensitivity, Type of Operation, Impact Scope of the Operation) and Joint Instruction-Tool Perspective (Alignment Between Instruction and Tool, Value Sensitivity). Thus, it can enhance LLMs’ awareness of tool utilization safety, leading to more safer and trustworthy language agents.」とのこと
リポジトリはGitHub – BITHLP/SafeToolBench: [2025 EMNLP Findings] SafeToolBench: Pioneering a Prospective Benchmark to Evaluating Tool Utilization Safety in LLMs

A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment

A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment [291.0]
本稿では, LLM のトレーニング, 展開, 商業化のプロセス全体を通して, 安全問題を体系的に検討する “フルスタック” の安全性の概念を紹介する。我々の研究は800以上の論文を網羅的にレビューし、包括的カバレッジとセキュリティ問題の体系的な組織化を確保しています。本研究は,データ生成の安全性,アライメント技術,モデル編集,LLMベースのエージェントシステムなど,有望な研究方向を特定する。
論文参考訳（メタデータ） (Tue, 22 Apr 2025 05:02:49 GMT)
安全性に関する包括的な調査
リポジトリにも期待大　bingreeky/full-stack-llm-safety · GitHub

月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28