大型语言模型测试时计算资源的扩展艺术

摘要

测试时扩展（TTS）——即在推理过程中动态分配计算资源——是提升大语言模型（LLM）推理能力的一个前景广阔的方向。然而，目前尚缺乏在相同条件下对知名TTS策略的系统性比较，且模型类型和问题难度对性能的影响仍不明确。为填补这些空白，我们开展了首个大规模TTS研究，涵盖使用八个开源LLM（参数量从70亿到2350亿）生成的超过三百亿个token，并横跨四个推理数据集。我们观察到三个一致趋势：（1）没有单一的TTS策略能普遍优于其他策略；（2）推理模型在不同问题难度和推理轨迹长度上表现出独特的轨迹质量模式，形成短视野和长视野两类；（3）对于特定模型类型，最优TTS性能随计算预算增加呈单调增长。基于这些发现，我们提出了选择最佳TTS策略的实用方案，该方案综合考虑问题难度、模型类型和计算预算，为有效的推理时扩展提供了实践指南。

English

Test-time scaling (TTS) -- the dynamic allocation of compute during inference -- is a promising direction for improving reasoning in large language models (LLMs). However, a systematic comparison of well-known TTS strategies under identical conditions is missing, and the influence of model type and problem difficulty on performance remains unclear. To address these gaps, we conduct the first large-scale study of TTS, spanning over thirty billion tokens generated using eight open-source LLMs (7B to 235B parameters), across four reasoning datasets. We observe three consistent trends: (1) no single TTS strategy universally dominates; (2) reasoning models exhibit distinct trace-quality patterns across problem difficulty and trace length, forming short-horizon and long-horizon categories; and (3) for a given model type, the optimal TTS performance scales monotonically with compute budget. Based on these insights, we provide a practical recipe for selecting the best TTS strategy, considering problem difficulty, model type, and compute budget, providing a practical guide to effective inference-time scaling.

大型语言模型测试时计算资源的扩展艺术

The Art of Scaling Test-Time Compute for Large Language Models

摘要

Support