Counterfactual Simulation Training for Chain-of-Thought Faithfulness

Counterfactual Simulation Training for Chain-of-Thought Faithfulness [46.3]
我々は,CST(Counterfactual Simulation Training)と呼ばれるトレーニング手法を導入する。 CSTは、シミュレーターが偽の入力に対してモデルの出力を正確に予測できるCoTに報酬を与える。最大235Bパラメータのモデルによる実験により、CSTはキューベースのカウンターファクトの精度を大幅に向上できることが示された。
論文参考訳（メタデータ） (Tue, 24 Feb 2026 09:15:30 GMT)
CoTの信頼性を向上させるため「we introduce a training method called Counterfactual Simulation Training (CST), which aims to improve CoT faithfulness by rewarding CoTs that enable a simulator to accurately predict a model’s outputs over counterfactual inputs. We apply CST in two settings: (1) CoT monitoring with cue-based counterfactuals, to detect when models rely on spurious features, reward hack, or are sycophantic, and (2) counterfactual simulation over generic model-based counterfactuals, to encourage models to produce more faithful, generalizable reasoning in the CoT.」というアプローチを提案。Reasoningの過程をコントロールするのも重要なのはそうだと思う。
リポジトリはGitHub – peterbhase/counterfactual-simulation-training: Codebase for paper: “Counterfactual Simulation Training for Chain-of-Thought Faithfulness”

コメントを残す

コメントを残す コメントをキャンセル