TimeBill：大型語言模型的時間預算推論

摘要

大型語言模型（LLMs）正日益廣泛部署於時間敏感的系統中，例如機器人技術、自動駕駛、具身智能體和工業自動化等領域。在這些應用場景下，模型必須在限定時間內生成準確回應，這對決策制定、控制系統或安全關鍵任務至關重要。然而，LLMs的自迴歸生成特性使其端到端執行時間難以建模與估算。此外，現有基於固定鍵值（KV）快取淘汰比的高效推理方法，難以適應具有不同時間預算的多元任務——不當的淘汰比率可能導致推理中斷或回應效能下降。本文提出TimeBill，一種創新性的時間預算約束推理框架，旨在平衡LLMs的推理效率與回應品質。具體而言，我們設計了細粒度的回應長度預測器（RLP）與執行時間估算器（ETE），以精準預測LLMs的端到端執行時間。基於此，我們開發出一種可動態調整KV快取淘汰比率的時間預算推理方法，該方法能根據執行時間預測結果與給定時間預算自適應調控資源分配。最後，通過大量實驗驗證，我們證明了TimeBill在多種超時處理策略下，對於提升任務完成率與保持回應效能方面的優勢。

English

Large Language Models (LLMs) are increasingly deployed in time-critical systems, such as robotics, autonomous driving, embodied intelligence, and industrial automation, where generating accurate responses within a given time budget is crucial for decision-making, control, or safety-critical tasks. However, the auto-regressive generation process of LLMs makes it challenging to model and estimate the end-to-end execution time. Furthermore, existing efficient inference methods based on a fixed key-value (KV) cache eviction ratio struggle to adapt to varying tasks with diverse time budgets, where an improper eviction ratio may lead to incomplete inference or a drop in response performance. In this paper, we propose TimeBill, a novel time-budgeted inference framework for LLMs that balances the inference efficiency and response performance. To be more specific, we propose a fine-grained response length predictor (RLP) and an execution time estimator (ETE) to accurately predict the end-to-end execution time of LLMs. Following this, we develop a time-budgeted efficient inference approach that adaptively adjusts the KV cache eviction ratio based on execution time prediction and the given time budget. Finally, through extensive experiments, we demonstrate the advantages of TimeBill in improving task completion rate and maintaining response performance under various overrun strategies.

TimeBill：大型語言模型的時間預算推論

TimeBill: Time-Budgeted Inference for Large Language Models

摘要

Support