睡眠時計算：テスト時推論スケーリングを超えて

要旨

大規模言語モデル（LLM）が難しい問題を解決するためには、テスト時の計算リソースのスケーリングが重要な要素として浮上していますが、これには高いレイテンシと推論コストが伴います。本論文では、スリープ時計算（sleep-time compute）を導入し、クエリが提示される前にモデルがコンテキストについてオフラインで「考える」ことを可能にします。ユーザーがどのようなクエリを投げるかを予測し、有用な量を事前計算することで、テスト時に必要な計算リソースを大幅に削減できます。本手法の有効性を実証するため、2つの推論タスク（Stateful GSM-SymbolicおよびStateful AIME）を改変したバージョンを作成しました。その結果、スリープ時計算により、同じ精度を達成するために必要なテスト時の計算量をStateful GSM-SymbolicとStateful AIMEで約5分の1に削減できることがわかりました。さらに、スリープ時計算をスケールさせることで、Stateful GSM-Symbolicでは最大13%、Stateful AIMEでは最大18%の精度向上が可能であることも明らかになりました。また、Multi-Query GSM-Symbolicを導入し、GSM-Symbolicを拡張してコンテキストごとに複数の関連クエリを含めることで、同じコンテキストに関する関連クエリ間でスリープ時計算を分散させ、クエリあたりの平均コストを2.5分の1に削減できることを示しました。さらに、スリープ時計算が最も効果的となる条件を理解するため追加分析を行い、ユーザークエリの予測可能性がスリープ時計算の有効性と強く相関していることを明らかにしました。最後に、現実的なエージェント型ソフトウェアエンジニアリング（SWE）タスクにスリープ時計算を適用するケーススタディを行いました。

English

Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time compute, which allows models to "think" offline about contexts before queries are presented: by anticipating what queries users might ask and pre-computing useful quantities, we can significantly reduce the compute requirements at test-time. To demonstrate the efficacy of our method, we create modified versions of two reasoning tasks - Stateful GSM-Symbolic and Stateful AIME. We find that sleep-time compute can reduce the amount of test-time compute needed to achieve the same accuracy by ~ 5x on Stateful GSM-Symbolic and Stateful AIME and that by scaling sleep-time compute we can further increase accuracy by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME. Furthermore, we introduce Multi-Query GSM-Symbolic, which extends GSM-Symbolic by including multiple related queries per context. By amortizing sleep-time compute across related queries about the same context using Multi-Query GSM-Symbolic, we can decrease the average cost per query by 2.5x. We then conduct additional analysis to understand when sleep-time compute is most effective, finding the predictability of the user query to be well correlated with the efficacy of sleep-time compute. Finally, we conduct a case-study of applying sleep-time compute to a realistic agentic SWE task.

睡眠時計算：テスト時推論スケーリングを超えて

Sleep-time Compute: Beyond Inference Scaling at Test-time

要旨

Support