ChatPaper.aiChatPaper

测试时计算中的逆缩放

Inverse Scaling in Test-Time Compute

July 19, 2025
作者: Aryo Pradipta Gema, Alexander Hägele, Runjin Chen, Andy Arditi, Jacob Goldman-Wetzler, Kit Fraser-Taliente, Henry Sleight, Linda Petrini, Julian Michael, Beatrice Alex, Pasquale Minervini, Yanda Chen, Joe Benton, Ethan Perez
cs.AI

摘要

我们构建了一系列评估任务,在这些任务中,延长大型推理模型(LRMs)的推理长度反而会降低其性能,展现出测试时计算量与准确性之间的逆向缩放关系。我们的评估任务涵盖四大类别:包含干扰项的简单计数任务、带有虚假特征的回归任务、需要跟踪约束的演绎推理任务,以及高级人工智能风险任务。我们识别出模型在延长推理时出现的五种不同失效模式:1)Claude模型越来越容易被无关信息分散注意力;2)OpenAI o系列模型虽能抵抗干扰项,却过度适应问题框架;3)模型从合理的先验转向虚假相关性;4)所有模型在保持对复杂演绎任务的专注上均表现出困难;5)延长推理可能放大令人担忧的行为,如Claude Sonnet 4表现出更强的自我保存倾向。这些发现表明,尽管测试时计算量的扩展在提升模型能力方面仍具潜力,但它可能无意中强化了有问题的推理模式。我们的结果强调了在不同推理长度下评估模型的重要性,以便识别并解决LRMs中的这些失效模式。
English
We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. Our evaluation tasks span four categories: simple counting tasks with distractors, regression tasks with spurious features, deduction tasks with constraint tracking, and advanced AI risks. We identify five distinct failure modes when models reason for longer: 1) Claude models become increasingly distracted by irrelevant information; 2) OpenAI o-series models resist distractors but overfit to problem framings; 3) models shift from reasonable priors to spurious correlations; 4) all models show difficulties in maintaining focus on complex deductive tasks; and 5) extended reasoning may amplify concerning behaviors, with Claude Sonnet 4 showing increased expressions of self-preservation. These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns. Our results demonstrate the importance of evaluating models across diverse reasoning lengths to identify and address these failure modes in LRMs.
PDF223July 22, 2025