SafeRBench: A Comprehensive Benchmark for Safety Assessment in Large Reasoning Models
SafeRBench: A Comprehensive Benchmark for Safety Assessment in Large Reasoning Models [60.9] LRMの安全性をエンドツーエンドに評価する最初のベンチマークであるSafeRBenchを紹介する。 私たちは、リスクカテゴリとレベルを入力設計に組み込んだ先駆者です。 我々は,長い推論トレースを意味的に一貫性のある単位にセグメント化するためのマイクロシンクのチャンキング機構を導入する。 論文参考訳(メタデータ) (Thu, 20 Nov 2025 03:41:06 GMT)
LRMを対象とした安全性ベンチマーク評価。
「For small models (e g , Qwen-3- 0.6B), Thinking increases risk, consistent with prior observations that reasoning traces can introduce hazards. For mid-scale models, however, Thinking yields safer behavior—lower risk and execution levels and higher refusal rates—suggesting that structured reasoning can be leveraged to reduce exposure when model capacity is sufficient. At very large scale, this pattern reverses: the MoE-based Qwen-235B shows higher risk levels under Thinking, reflecting an “always-help” tendency that makes unsafe responses more actionable. In short, reasoning improves safety up to a point; beyond that, greater capability without stronger alignment can raise exposure.」とモデルサイズとの関係が興味深い。