수면 시간 연산: 테스트 시간 추론 확장을 넘어서

초록

테스트 시간 계산 확장(Scaling test-time compute)은 대규모 언어 모델(LLM)이 어려운 문제를 해결할 수 있도록 하는 핵심 요소로 부상했지만, 높은 지연 시간과 추론 비용이 수반됩니다. 우리는 슬립 시간 계산(sleep-time compute)을 도입하여, 모델이 쿼리가 제시되기 전에 컨텍스트에 대해 오프라인으로 "생각"할 수 있도록 합니다: 사용자가 어떤 쿼리를 할지 예측하고 유용한 양을 미리 계산함으로써, 테스트 시간에 필요한 계산 요구량을 크게 줄일 수 있습니다. 우리의 방법의 효용성을 입증하기 위해, 두 가지 추론 작업인 Stateful GSM-Symbolic과 Stateful AIME의 수정 버전을 생성했습니다. 우리는 슬립 시간 계산이 Stateful GSM-Symbolic과 Stateful AIME에서 동일한 정확도를 달성하기 위해 필요한 테스트 시간 계산량을 약 5배 줄일 수 있으며, 슬립 시간 계산을 확장함으로써 Stateful GSM-Symbolic에서 최대 13%, Stateful AIME에서 최대 18%까지 정확도를 더욱 높일 수 있음을 발견했습니다. 또한, 우리는 GSM-Symbolic을 확장하여 컨텍스트당 여러 관련 쿼리를 포함하는 Multi-Query GSM-Symbolic을 소개합니다. Multi-Query GSM-Symbolic을 사용하여 동일한 컨텍스트에 대한 관련 쿼리 간에 슬립 시간 계산을 분산함으로써, 쿼리당 평균 비용을 2.5배 줄일 수 있습니다. 그런 다음, 슬립 시간 계산이 가장 효과적인 시기를 이해하기 위해 추가 분석을 수행하여, 사용자 쿼리의 예측 가능성이 슬립 시간 계산의 효용성과 잘 상관관계가 있음을 발견했습니다. 마지막으로, 우리는 현실적인 에이전트 기반 SWE 작업에 슬립 시간 계산을 적용한 사례 연구를 수행합니다.

English

Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time compute, which allows models to "think" offline about contexts before queries are presented: by anticipating what queries users might ask and pre-computing useful quantities, we can significantly reduce the compute requirements at test-time. To demonstrate the efficacy of our method, we create modified versions of two reasoning tasks - Stateful GSM-Symbolic and Stateful AIME. We find that sleep-time compute can reduce the amount of test-time compute needed to achieve the same accuracy by ~ 5x on Stateful GSM-Symbolic and Stateful AIME and that by scaling sleep-time compute we can further increase accuracy by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME. Furthermore, we introduce Multi-Query GSM-Symbolic, which extends GSM-Symbolic by including multiple related queries per context. By amortizing sleep-time compute across related queries about the same context using Multi-Query GSM-Symbolic, we can decrease the average cost per query by 2.5x. We then conduct additional analysis to understand when sleep-time compute is most effective, finding the predictability of the user query to be well correlated with the efficacy of sleep-time compute. Finally, we conduct a case-study of applying sleep-time compute to a realistic agentic SWE task.

수면 시간 연산: 테스트 시간 추론 확장을 넘어서

Sleep-time Compute: Beyond Inference Scaling at Test-time

초록

Support