测试时学会探索

摘要

如何利用人工智能探索科学问题的新最优解？先前关于测试时扩展的研究（如AlphaEvolve）通过调用冻结的大语言模型进行搜索。我们则在测试时实施强化学习，使大语言模型能够持续训练，且训练内容专门针对测试问题。这种持续学习形式非常特殊，其目标是产出单个优质解决方案而非追求平均表现，并专注于解决当前问题而非泛化至其他问题。因此，我们的学习目标和搜索子程序均优先考虑最具潜力的解决方案。我们将该方法命名为"测试时训练探索法（TTT-Discover）"。沿袭前人研究，我们聚焦于具有连续奖励的问题，并在数学、GPU内核工程、算法设计和生物学等领域对所有尝试过的问题进行结果汇报。TTT-Discover在几乎所有领域都创造了新的最优记录：（i）埃尔德什最小重叠问题与自相关不等式；（ii）GPUMode内核竞赛（比现有技术快达2倍）；（iii）历史AtCoder算法竞赛；（iv）单细胞分析中的去噪问题。我们的解决方案均经过专家或主办方审核。与先前需要封闭前沿模型才能实现的最佳结果不同，我们所有成果均采用开源模型OpenAI gpt-oss-120b达成，并可通过公开代码复现。测试时训练通过Thinking Machines公司的Tinker API运行，每个问题的成本仅需数百美元。

English

How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) Erdős' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to 2times faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.

测试时学会探索

Learning to Discover at Test Time

摘要

Support