透過自我校準實現高效的測試時縮放

摘要

增加测试时的计算量是提升大型语言模型（LLMs）响应质量的一种直接方法。虽然“最佳N采样”和“自洽多数投票”简单且有效，但它们对每个查询都要求固定数量的采样响应，而不管其复杂性如何。这可能导致对较简单问题的计算资源浪费，以及对更具挑战性问题探索不足。在本研究中，我们主张利用模型响应的置信度来提高测试时扩展的效率。遗憾的是，LLMs 已知存在过度自信的问题，提供的置信度估计并不可靠。为解决这一局限，我们引入了“自我校准”方法，通过将“自洽”得出的置信度蒸馏到模型自身中，从而在测试时仅需一次前向传播即可实现可靠的置信度估计。随后，我们设计了基于置信度的高效测试时扩展方法，以处理不同难度的查询，例如“最佳N采样的提前终止”和“基于校准置信度的自洽”。在三个LLMs和六个数据集上的实验证明了我们方法的有效性。具体而言，将基于置信度的提前终止应用于“最佳N采样”，在16个响应样本的预算下，将MathQA的准确率从81.0提升至83.6，这显示了在推理时采用基于置信度的采样策略的有效性。

English

Increasing test-time computation is a straightforward approach to enhancing the quality of responses in Large Language Models (LLMs). While Best-of-N sampling and Self-Consistency with majority voting are simple and effective, they require a fixed number of sampling responses for each query, regardless of its complexity. This could result in wasted computation for simpler questions and insufficient exploration for more challenging ones. In this work, we argue that model confidence of responses can be used for improving the efficiency of test-time scaling. Unfortunately, LLMs are known to be overconfident and provide unreliable confidence estimation. To address this limitation, we introduce Self-Calibration by distilling Self-Consistency-derived confidence into the model itself. This enables reliable confidence estimation at test time with one forward pass. We then design confidence-based efficient test-time scaling methods to handle queries of various difficulty, such as Early-Stopping for Best-of-N and Self-Consistency with calibrated confidence. Experiments on three LLMs across six datasets demonstrate the effectiveness of our approach. Specifically, applying confidence-based Early Stopping to Best-of-N improves MathQA accuracy from 81.0 to 83.6 with a sample budget of 16 responses, indicating the efficacy of confidence-based sampling strategy at inference time.

透過自我校準實現高效的測試時縮放

Efficient Test-Time Scaling via Self-Calibration

摘要

Support