시험 시간에 발견하는 법 배우기

초록

과학적 문제에 대한 새로운 최첨단 기술을 발견하기 위해 AI를 어떻게 활용할 수 있을까요? 기존의 테스트 타임 스케일링 연구(예: AlphaEvolve)는 고정된 LLM에 프롬프팅을 수행하여 탐색을 진행했습니다. 우리는 테스트 타임에 강화 학습을 수행함으로써 LLM이 계속 학습할 수 있도록 하지만, 이제는 테스트 문제에 특화된 경험을 바탕으로 합니다. 이러한 형태의 지속적 학습은 매우 특별한데, 그 이유는 평균적으로 여러 좋은 해법을 내는 것보다 하나의 훌륭한 해법을 도출하는 것, 그리고 다른 문제로 일반화하는 것보다 바로 이 특정 문제를 해결하는 것을 목표로 하기 때문입니다. 따라서 우리의 학습 목표와 탐색 서브루틴은 가장 유망한 해결책을 우선시하도록 설계되었습니다. 우리는 이 방법을 '발견을 위한 테스트 타임 트레이닝(TTT-Discover)'이라고 부릅니다. 선행 연구를 따라 우리는 연속적 보상이 있는 문제에 집중합니다. 우리는 수학, GPU 커널 엔지니어링, 알고리즘 설계, 생물학 분야에 걸쳐 시도한 모든 문제에 대한 결과를 보고합니다. TTT-Discover는 거의 모든 분야에서 새로운 최첨단 기술을 수립했습니다: (i) 에르되시의 최소 중복 문제 및 자기상관 부등식; (ii) GPUMode 커널 경쟁(기존 기술 대비 최대 2배 빠름); (iii) 과거 AtCoder 알고리즘 경진대회; (iv) 단일 세포 분석에서의 노이즈 제거 문제. 우리의 해법은 전문가나 주최자에 의해 검토되었습니다. 우리의 모든 결과는 오픈 모델인 OpenAI gpt-oss-120b를 사용하여 달성되었으며, 공개된 코드를 통해 재현이 가능합니다. 이는 폐쇄형 최첨단 모델이 필요했던 이전의 최고 결과와 대조적입니다. 우리의 테스트 타임 트레이닝 실행은 Thinking Machines의 API인 Tinker를 사용하여 수행되었으며, 문제당 수백 달러의 비용만이 소요됩니다.

English

How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) Erdős' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to 2times faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.

시험 시간에 발견하는 법 배우기

Learning to Discover at Test Time

초록

Support