MoEを操作し望ましい(または望ましくない)動作に近づける手法の提案。ネガティブな方向性で「Critically, we are also exposing a novel dimension of “Alignment Faking” in LLMs (Greenblatt et al , 2024; Wang et al , 2024), where alignment is concentrated in a subset of experts, neglecting alternate routing paths that can catastrophically bypass alignment when triggered. We argue that, just as safety alignment must extend beyond the first few tokens (Qi et al , 2025), it must also go deeper than just a few expert pathways, ensuring robustness across the entire model routing topology.」はその通りだと思う。