推理调色板:通过潜在情境化调控推理以实现(视觉)语言模型的可控探索
Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs
December 19, 2025
作者: Rujiao Long, Yang Li, Xingyao Zhang, Weixun Wang, Tianqianjin Lin, Xi Zhao, Yuchi Xu, Wenbo Su, Junchi Yan, Bo Zheng
cs.AI
摘要
探索能力同时影响着大型(视觉)语言模型的推理时性能与强化学习训练效果,因为随机采样常产生冗余的推理路径且缺乏高层级多样性。本文提出推理调色板(Reasoning Palette),一种新颖的潜在调制框架,通过为模型配备随机潜变量来实现策略性语境构建,在词元生成前引导其内部规划。该潜在语境通过变分自编码器从问题-答案对的均值池化嵌入中推断得出,每个采样潜变量可能编码不同的推理情境。在推理过程中,采样潜变量被解码为可学习的词元前缀并附加至输入提示前,从而调制模型的内部推理轨迹。通过这种方式,模型在输出生成前对推理策略进行内部采样,进而塑造整个响应序列的风格与结构。简短的监督微调预热阶段使模型能够适应这种潜在条件调节。在强化学习优化中,推理调色板通过按需注入多样化推理模式促进结构化探索,显著提升探索效率与持续学习能力。在多项推理基准测试上的实验表明,本方法能对(视觉)语言模型的策略行为实现可解释、可控的调控,从而相较标准强化学习方法获得持续的性能提升。
English
Exploration capacity shapes both inference-time performance and reinforcement learning (RL) training for large (vision-) language models, as stochastic sampling often yields redundant reasoning paths with little high-level diversity. This paper proposes Reasoning Palette, a novel latent-modulation framework that endows the model with a stochastic latent variable for strategic contextualization, guiding its internal planning prior to token generation. This latent context is inferred from the mean-pooled embedding of a question-answer pair via a variational autoencoder (VAE), where each sampled latent potentially encodes a distinct reasoning context. During inference, a sampled latent is decoded into learnable token prefixes and prepended to the input prompt, modulating the model's internal reasoning trajectory. In this way, the model performs internal sampling over reasoning strategies prior to output generation, which shapes the style and structure of the entire response sequence. A brief supervised fine-tuning (SFT) warm-up phase allows the model to adapt to this latent conditioning. Within RL optimization, Reasoning Palette facilitates structured exploration by enabling on-demand injection for diverse reasoning modes, significantly enhancing exploration efficiency and sustained learning capability. Experiments across multiple reasoning benchmarks demonstrate that our method enables interpretable and controllable control over the (vision-) language model's strategic behavior, thereby achieving consistent performance gains over standard RL methods.