TrajSelector:利用潛在表徵實現大型推理模型中的高效能最佳N選取
TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model
October 18, 2025
作者: Bin Yu, Xinming Wang, Shijie Lian, Haotian Li, Changti Wu, Ruina Hu, Bailing Wang, Yuliang Wei, Kai Chen
cs.AI
摘要
大型語言模型(LLMs)在複雜推理任務中展現了顯著的進步,這主要得益於推理時擴展(TTS)範式,該範式在推理過程中分配額外的計算資源。其中,外部TTS(尤其是最佳N選取範式)通過從多個獨立生成的推理軌跡中進行選擇,實現了可擴展的性能提升。然而,這種方法面臨著關鍵限制:(i)部署過程獎勵模型的高計算開銷,(ii)未能充分利用LLM的內在潛在表示。我們引入了TrajSelector,這是一個高效且有效的最佳N框架,它利用採樣LLM中的隱藏狀態進行過程級評分。一個輕量級的驗證器(僅有0.6B參數)評估逐步軌跡的質量,然後聚合這些分數以識別最佳推理軌跡。我們的框架採用完全數據驅動、端到端的訓練方案,消除了對大量步驟級註釋的依賴。在五個基準測試中的實驗結果表明,TrajSelector提供了持續的性能提升。在最佳32設置中,其準確率超過多數投票4.61%,並在現有過程獎勵模型上提升了4.31%至12.21%,同時保持了較低的推理成本。
English
Large language models (LLMs) have shown remarkable progress in complex
reasoning tasks, largely enabled by test-time scaling (TTS) paradigms that
allocate additional compute during inference. Among these, external TTS
(particularly the Best-of-N selection paradigm) yields scalable performance
improvements by selecting from multiple independently generated reasoning
trajectories. However, this approach faces key limitations: (i) the high
computational overhead of deploying process reward models, (ii) the
underutilization of the LLM's intrinsic latent representations. We introduce
TrajSelector, an efficient and effective Best-of-N framework that exploit the
hidden states in the sampler LLM for process-level scoring. A lightweight
verifier (with only 0.6B parameters) evaluates the quality of step-wise
trajectory, and then aggregates these scores to identify the optimal reasoning
trajectory. Our framework employs a fully data-driven, end-to-end training
recipe that eliminates reliance on massive step-level annotations. Experiential
results across five benchmarks demonstrate that TrajSelector delivers
consistent performance gains. In Best-of-32 settings, it surpasses majority
voting by 4.61% accuracy and outperforms existing process reward models by
4.31% to 12.21%, all while maintaining lower inference costs.