テストタイムにおける発見の学習

要旨

科学問題における新たなstate of the artを発見するためにAIをどのように活用できるだろうか。従来のテスト時スケーリング研究（AlphaEvolveなど）は、固定された大規模言語モデル（LLM）に対するプロンプト操作による探索を行ってきた。我々はテスト時に強化学習を実施し、LLMが継続的に学習できるようにする。ただし、ここでの学習はテスト問題に特化した経験に基づく。この継続学習は極めて特殊な形態である。なぜなら、その目的が平均的に多数の良好な解を生成することではなく、一つの優れた解を生み出すことにあり、他の問題への一般化ではなくこの特定問題を解決することにあるからだ。したがって、我々の学習目標と探索サブルーチンは、最も有望な解を優先するように設計されている。本手法をTest-Time Training to Discover（TTT-Discover）と命名する。従来研究に倣い、連続的な報酬を伴う問題に焦点を当てる。数学、GPUカーネルエンジニアリング、アルゴリズム設計、生物学など、試行した全問題における結果を報告する。TTT-Discoverはほぼ全ての領域で新たなstate of the artを達成した：（i）エルデシュの最小重複問題と自己相関不等式、（ii）GPUModeカーネル競技会（従来比最大2倍の高速化）、（iii）過去のAtCoderアルゴリズム競技会、（iv）単一細胞解析におけるノイズ除去問題である。これらの解は専門家または主催者による査読を経ている。全ての結果はオープンモデルであるOpenAI gpt-oss-120bで達成され、公開コードで再現可能である。これは、閉鎖的なフロンティアモデルを必要とした従来の最良結果とは対照的である。テスト時トレーニングはThinking Machines社のAPI「Tinker」を用いて実行され、問題あたりのコストはわずか数百ドルに留まる。

English

How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) Erdős' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to 2times faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.

テストタイムにおける発見の学習

Learning to Discover at Test Time

要旨

Support