流形老虎机：大型语言模型潜在几何上的贝叶斯课程学习

摘要

强化学习是提升大语言模型推理能力的核心方法，其训练效率取决于优化过程中问题采样的方式。现有自适应课程学习方法通常优先选择中等难度的提示，将问题选择简化为具有独立臂的经典多臂老虎机问题，却忽略了任务空间的结构化异质性特征。本研究将问题采样重新定义为包含内生非平稳性的流形结构多臂老虎机问题：通过模型潜在表征空间关联问题，采样决策可引导学习信号在该空间中的演化方向。为落实这一视角，我们提出贝叶斯流形课程（BMC）——一种结构化感知框架，将问题组织为分层任务树，并应用贝叶斯学习指导采样。实验发现，不同采样策略会在生产率（学习信号）、多样性（任务流形覆盖范围）与实用性（评估相关性）之间产生显著权衡。研究结果表明，仅关注难度优先级不足以实现优异的下游性能，这凸显了在问题采样中融入结构化认知与类型感知的重要性。

English

Reinforcement learning (RL) is a central approach for improving reasoning capabilities in large language models (LLMs), where training efficiency depends critically on how problems are sampled during optimization. Existing adaptive curriculum learning methods typically prioritize prompts of intermediate difficulty, treating problem selection as a standard bandit problem with independent arms and overlooking the structured, heterogeneous nature of the task space. In this work, we frame problem sampling as a manifold-structured bandit problem with endogenous non-stationarity: problems are related through the model's latent representation space, and sampling decisions can steer how learning signals evolve across that space. To operationalize this perspective, we introduce Bayesian Manifold Curriculum (BMC), a structure-aware framework that organizes problems into a hierarchical task tree and applies Bayesian learning to guide sampling. Empirically, we find that different sampling strategies induce non-trivial tradeoffs between productivity (learning signal), diversity (coverage of the task manifold), and utility (evaluation relevance). These results show that prioritizing difficulty alone is insufficient for strong downstream performance, highlighting the importance of incorporating structure and type-awareness into problem sampling.