When Should We Introduce Safety Interventions During Pretraining?
When Should We Introduce Safety Interventions During Pretraining? [100.4] 先行研究は、有害な内容の表現などの事前訓練の介入が、結果のモデルの安全性を大幅に向上させることを示した。 介入の導入は一般的に、過度な拒絶率の増加を伴わない、より堅牢なモデルをもたらす。 また、より安全な世代に向けたモデルのステアビリティにも明らかなメリットがあると考えています。 論文参考訳(メタデータ) (Sun, 11 Jan 2026 22:38:17 GMT)
「Our experiments show that incorporating safety pretraining interventions indeed help, and the clearest result is that there is much improved robustness after benign finetuning when pretraining interventions are introduced earlier (e g , at 0% or 20% of the pretraining tokens). This also manifests into impacts on the model’s underlying representation geometry; incorporating interventions and metadata earlier in pretraining leads to greater separation of safe vs unsafe content.」とのこと。