놀라움 기반 선택: 실행 기반 코드 생성을 위한 계산 최적의 테스트 타임 전략

초록

테스트 타임 트레이닝(TTT)은 추론 시점에 그래디언트 기반 업데이트를 통해 언어 모델을 적응시킵니다. 그러나 적응이 올바른 전략일까요? 본 연구는 검증 가능한 실행 기반(VEG) 작업, 즉 결정론적 평가자가 밀집되고 연속적인 보상 신호를 제공하는 GPU 커널 최적화와 같은 영역에서 계산 최적의 테스트 타임 전략을 분석합니다. KernelBench을 테스트베드로, 120B 매개변수 모델(LoRA 적응 적용 GPT-OSS-120B)을 사용하여 최소 적응(1-5 그래디언트 스텝)보다 검색이 더 우수함을 확인했습니다: 전체 KernelBench L1 평가 세트에서 Best-of-N 샘플링은 K=64 기준 90% 작업 성공률(20개 작업 중 18개 성공)을 달성한 반면, TTT의 최적 체크포인트는 3-시드 평균 기준 30.6%에 그쳤으며, TTT의 "등가 K"는 1 미만으로 단일 샘플 추론보다도 낮은 성능을 보였습니다. 이러한 실패 원인은 과도한 샤프닝입니다: 그래디언트 업데이트가 다양성을 압축하여 최적의 해결책을 발견하지 못하고 평범한 해법으로 수렴하게 만듭니다. 본 연구의 주요 기여는 surprisal 기반 선택입니다: 가장 높은 surprisal(가장 낮은 신뢰도)을 보이는 정답 샘플을 선택할 경우 가장 높은 신뢰도를 보이는 샘플 선택(50% 성공률) 대비 80%의 성공률을 달성하여 30% 개선되었습니다. 이를 surprisal 기반 상위 3개 선택으로 확장하면 오라클 성능에 해당하는 100% 성공률을 달성했습니다. 길이 제어 분석을 통해 검증된 이 무비용 전략은 오라클 성능을 회복합니다. 밀집 보상 VEG 작업의 경우 계산 자원은 그래디언트 적응보다는 샘플 다양성과 지능형 선택에 할당되어야 합니다. surprisal 기반 선택 원리는 최적 해결책이 분포 꼬리 부분에 위치하는 다른 실행 기반 영역으로도 일반화될 수 있습니다.

English

Test-time training (TTT) adapts language models through gradient-based updates at inference. But is adaptation the right strategy? We study compute-optimal test-time strategies for verifiable execution-grounded (VEG) tasks, domains like GPU kernel optimization where a deterministic evaluator provides dense, continuous reward signals. Using KernelBench as our testbed and a 120B-parameter model (GPT-OSS-120B with LoRA adaptation), we find that search outperforms minimal adaptation (1-5 gradient steps): Best-of-N sampling achieves 90% task success (18/20 tasks) at K=64 across the full KernelBench L1 eval set while TTT's best checkpoint reaches only 30.6% (3-seed mean), with TTT's "equivalent K" falling below 1, worse than single-sample inference. The failure mode is over-sharpening: gradient updates collapse diversity toward mediocre solutions rather than discovering optimal ones. Our main contribution is surprisal-guided selection: selecting the highest-surprisal (lowest-confidence) correct sample yields 80% success vs. 50% for most-confident selection, a 30% improvement. Extending to surprisal-guided-top3 matches oracle performance at 100%. This zero-cost strategy, validated through length-controlled analysis, recovers oracle performance. For dense-reward VEG tasks, compute should be allocated to sample diversity and intelligent selection rather than gradient adaptation. The surprisal-guided selection principle may generalize to other execution-grounded domains where optimal solutions occupy the distribution tail.

놀라움 기반 선택: 실행 기반 코드 생성을 위한 계산 최적의 테스트 타임 전략

Surprisal-Guided Selection: Compute-Optimal Test-Time Strategies for Execution-Grounded Code Generation

초록

Support