B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners

B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners [19.0]
自己改善は、パフォーマンスを向上させる主要な方法として現れています。本稿では,この反復的プロセスにおいて2つの重要な要因をモニタする手法を提案し,提案する。 B-STaRは、反復的な構成を調整し、探索とエクスプロイトのバランスをとる自己学習推論フレームワークである。
論文参考訳（メタデータ） (Mon, 23 Dec 2024 03:58:34 GMT)
「In this work, we identify and propose methods to monitor two pivotal factors in this iterative process: (1) the model’s ability to generate sufficiently diverse responses (exploration); and (2) the effectiveness of external rewards in distinguishing high-quality candidates from lower-quality ones (exploitation).」、についてこれらを監視しバランスをとる手法を提案。
リポジトリはGitHub – hkust-nlp/B-STaR

コメントを残す

コメントを残す コメントをキャンセル