TrajSelector：利用潛在表徵實現大型推理模型中的高效能最佳N選取

摘要

大型語言模型（LLMs）在複雜推理任務中展現了顯著的進步，這主要得益於推理時擴展（TTS）範式，該範式在推理過程中分配額外的計算資源。其中，外部TTS（尤其是最佳N選取範式）通過從多個獨立生成的推理軌跡中進行選擇，實現了可擴展的性能提升。然而，這種方法面臨著關鍵限制：（i）部署過程獎勵模型的高計算開銷，（ii）未能充分利用LLM的內在潛在表示。我們引入了TrajSelector，這是一個高效且有效的最佳N框架，它利用採樣LLM中的隱藏狀態進行過程級評分。一個輕量級的驗證器（僅有0.6B參數）評估逐步軌跡的質量，然後聚合這些分數以識別最佳推理軌跡。我們的框架採用完全數據驅動、端到端的訓練方案，消除了對大量步驟級註釋的依賴。在五個基準測試中的實驗結果表明，TrajSelector提供了持續的性能提升。在最佳32設置中，其準確率超過多數投票4.61%，並在現有過程獎勵模型上提升了4.31%至12.21%，同時保持了較低的推理成本。

English

Large language models (LLMs) have shown remarkable progress in complex reasoning tasks, largely enabled by test-time scaling (TTS) paradigms that allocate additional compute during inference. Among these, external TTS (particularly the Best-of-N selection paradigm) yields scalable performance improvements by selecting from multiple independently generated reasoning trajectories. However, this approach faces key limitations: (i) the high computational overhead of deploying process reward models, (ii) the underutilization of the LLM's intrinsic latent representations. We introduce TrajSelector, an efficient and effective Best-of-N framework that exploit the hidden states in the sampler LLM for process-level scoring. A lightweight verifier (with only 0.6B parameters) evaluates the quality of step-wise trajectory, and then aggregates these scores to identify the optimal reasoning trajectory. Our framework employs a fully data-driven, end-to-end training recipe that eliminates reliance on massive step-level annotations. Experiential results across five benchmarks demonstrate that TrajSelector delivers consistent performance gains. In Best-of-32 settings, it surpasses majority voting by 4.61% accuracy and outperforms existing process reward models by 4.31% to 12.21%, all while maintaining lower inference costs.

TrajSelector：利用潛在表徵實現大型推理模型中的高效能最佳N選取

TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model

摘要

Support