多様体バンディット：大規模言語モデルの潜在幾何学に基づくベイズ的カリキュラム学習

要旨

強化学習（RL）は、大規模言語モデル（LLM）の推論能力を向上させるための中心的アプローチであり、その訓練効率は最適化中の問題サンプリング方法に大きく依存する。既存の適応的カリキュラム学習手法は、通常、中程度の難易度のプロンプトを優先し、問題選択を独立した腕を持つ標準的なバンディット問題として扱うため、タスク空間の構造化された不均一な性質を見落としている。本研究では、問題サンプリングを内生的非定常性を伴う多様体構造バンディット問題として位置づける。すなわち、問題はモデルの潜在表現空間を通じて相互に関連しており、サンプリングの選択はその空間全体にわたる学習信号の進化を方向づける。この視点を実現するために、我々はベイズ多様体カリキュラム（BMC）を導入する。これは、問題を階層的タスクツリーに整理し、ベイズ学習を適用してサンプリングを導く構造認識型フレームワークである。実験的に、異なるサンプリング戦略は、生産性（学習信号）、多様性（タスク多様体のカバレッジ）、有用性（評価上の関連性）の間で無視できないトレードオフを引き起こすことが明らかになった。これらの結果は、難易度の優先だけでは下流の性能を強く向上させるには不十分であり、問題サンプリングに構造とタイプ認識を組み込むことの重要性を浮き彫りにしている。

English

Reinforcement learning (RL) is a central approach for improving reasoning capabilities in large language models (LLMs), where training efficiency depends critically on how problems are sampled during optimization. Existing adaptive curriculum learning methods typically prioritize prompts of intermediate difficulty, treating problem selection as a standard bandit problem with independent arms and overlooking the structured, heterogeneous nature of the task space. In this work, we frame problem sampling as a manifold-structured bandit problem with endogenous non-stationarity: problems are related through the model's latent representation space, and sampling decisions can steer how learning signals evolve across that space. To operationalize this perspective, we introduce Bayesian Manifold Curriculum (BMC), a structure-aware framework that organizes problems into a hierarchical task tree and applies Bayesian learning to guide sampling. Empirically, we find that different sampling strategies induce non-trivial tradeoffs between productivity (learning signal), diversity (coverage of the task manifold), and utility (evaluation relevance). These results show that prioritizing difficulty alone is insufficient for strong downstream performance, highlighting the importance of incorporating structure and type-awareness into problem sampling.