サプライザル誘導選択：実行に基づくコード生成のための計算最適なテスト時戦略

要旨

テスト時訓練（TTT）は、推論時に勾配ベースの更新により言語モデルを適応させる。しかし、適応は果たして適切な戦略だろうか？我々は、検証可能な実行基盤タスク（VEG）における計算最適なテスト時戦略を研究する。VEGタスクとは、GPUカーネル最適化のような分野であり、決定論的評価器が密で連続的な報酬信号を提供する。KernelBenchをテストベッドとし、120Bパラメータモデル（LoRA適応済みGPT-OSS-120B）を使用して、我々は以下のことを発見した：最小限の適応（1～5勾配ステップ）よりも、検索が優れている。Best-of-Nサンプリングは、KernelBench L1評価セット全体においてK=64で90%（20タスク中18タスク）のタスク成功率を達成するのに対し、TTTの最良チェックポイントは30.6%（3シード平均）に留まり、TTTの「等価K」は1を下回り、単一サンプル推論よりも悪い結果となった。この失敗モードは過剰先鋭化である：勾配更新は多様性を崩壊させ、最適解を発見するのではなく、凡庸な解へと収束させる。我々の主な貢献は、サプライザル誘導選択である：最高サプライザル（最低信頼度）の正しいサンプルを選択すると、80%の成功率を達成し、最も信頼度の高い選択の50%を30%上回る。これをサプライザル誘導トップ3に拡張すると、オラクル性能に匹敵する100%を達成する。このゼロコスト戦略は、長さ制御分析を通じて検証され、オラクル性能を回復する。密報酬VEGタスクにおいては、計算リソースは勾配適応ではなく、サンプルの多様性とインテリジェントな選択に割り当てるべきである。サプライザル誘導選択の原理は、最適解が分布の尾部に位置する他の実行基盤領域にも一般化可能である。

English

Test-time training (TTT) adapts language models through gradient-based updates at inference. But is adaptation the right strategy? We study compute-optimal test-time strategies for verifiable execution-grounded (VEG) tasks, domains like GPU kernel optimization where a deterministic evaluator provides dense, continuous reward signals. Using KernelBench as our testbed and a 120B-parameter model (GPT-OSS-120B with LoRA adaptation), we find that search outperforms minimal adaptation (1-5 gradient steps): Best-of-N sampling achieves 90% task success (18/20 tasks) at K=64 across the full KernelBench L1 eval set while TTT's best checkpoint reaches only 30.6% (3-seed mean), with TTT's "equivalent K" falling below 1, worse than single-sample inference. The failure mode is over-sharpening: gradient updates collapse diversity toward mediocre solutions rather than discovering optimal ones. Our main contribution is surprisal-guided selection: selecting the highest-surprisal (lowest-confidence) correct sample yields 80% success vs. 50% for most-confident selection, a 30% improvement. Extending to surprisal-guided-top3 matches oracle performance at 100%. This zero-cost strategy, validated through length-controlled analysis, recovers oracle performance. For dense-reward VEG tasks, compute should be allocated to sample diversity and intelligent selection rather than gradient adaptation. The surprisal-guided selection principle may generalize to other execution-grounded domains where optimal solutions occupy the distribution tail.

サプライザル誘導選択：実行に基づくコード生成のための計算最適なテスト時戦略

Surprisal-Guided Selection: Compute-Optimal Test-Time Strategies for Execution-Grounded Code Generation

要旨

Support