More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration

More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration [103.2]
ガイダンス・オン・デマンド」アプローチは、自己発見の価値を保ちながら探究を広げる。実験の結果、AMPOは強いベースラインを大幅に上回ることが示された。ピアサイズの4人の教師を用いて、より強力な1人の教師を活用できる手法に匹敵する結果が得られる。
論文参考訳（メタデータ） (Thu, 02 Oct 2025 17:14:00 GMT)
「we introduce Adaptive Multi-Guidance Policy Optimization (AMPO), a novel Mixed-Policy RL framework. Instead of relying on a single stronger teacher (e g , GPT4o or DeepSeek-R1), AMPO leverages the collective intelligence of multiple peer models. It operates on a “guidance-on-demand” principle: external guidance from diverse teachers replaces on-policy failures only when the student model is unable to solve a problem, thus maximizing the value of self- exploration. Furthermore, AMPO employs a comprehension-based guidance selection mechanism.」というフレームワークの提案。教師側が強力な1モデルではなく、複数の小型モデルで良いというは面白い。
リポジトリはGitHub – SII-Enigma/AMPO: Official Repository of “More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration”

コメントを残す

コメントを残す コメントをキャンセル