TrajSelector:利用潜在表征实现大规模推理模型中的高效N选优
TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model
October 18, 2025
作者: Bin Yu, Xinming Wang, Shijie Lian, Haotian Li, Changti Wu, Ruina Hu, Bailing Wang, Yuliang Wei, Kai Chen
cs.AI
摘要
大型语言模型(LLMs)在复杂推理任务中展现了显著进展,这主要得益于推理时扩展(TTS)范式,该范式在推理过程中分配额外的计算资源。其中,外部TTS(尤其是最佳N选一策略)通过从多个独立生成的推理轨迹中进行选择,实现了可扩展的性能提升。然而,这种方法面临两大关键限制:(i)部署过程奖励模型带来的高计算开销,(ii)未能充分利用LLM的内在潜在表征。我们提出了TrajSelector,一个高效且有效的最佳N选一框架,它利用采样LLM中的隐藏状态进行过程级评分。一个轻量级验证器(仅含0.6B参数)评估每一步推理轨迹的质量,随后汇总这些分数以确定最优推理轨迹。我们的框架采用完全数据驱动、端到端的训练方案,消除了对大量步骤级标注的依赖。在五个基准测试上的实验结果表明,TrajSelector带来了持续的性能提升。在最佳32选一设置下,其准确率比多数投票高出4.61%,并超越现有过程奖励模型4.31%至12.21%,同时保持了更低的推理成本。
English
Large language models (LLMs) have shown remarkable progress in complex
reasoning tasks, largely enabled by test-time scaling (TTS) paradigms that
allocate additional compute during inference. Among these, external TTS
(particularly the Best-of-N selection paradigm) yields scalable performance
improvements by selecting from multiple independently generated reasoning
trajectories. However, this approach faces key limitations: (i) the high
computational overhead of deploying process reward models, (ii) the
underutilization of the LLM's intrinsic latent representations. We introduce
TrajSelector, an efficient and effective Best-of-N framework that exploit the
hidden states in the sampler LLM for process-level scoring. A lightweight
verifier (with only 0.6B parameters) evaluates the quality of step-wise
trajectory, and then aggregates these scores to identify the optimal reasoning
trajectory. Our framework employs a fully data-driven, end-to-end training
recipe that eliminates reliance on massive step-level annotations. Experiential
results across five benchmarks demonstrate that TrajSelector delivers
consistent performance gains. In Best-of-32 settings, it surpasses majority
voting by 4.61% accuracy and outperforms existing process reward models by
4.31% to 12.21%, all while maintaining lower inference costs.