时间账单：面向大型语言模型的预算化时间推理

摘要

大型语言模型（LLMs）正日益广泛应用于时间敏感型系统，如机器人技术、自动驾驶、具身智能和工业自动化等领域。在这些场景中，在给定时间预算内生成准确响应对于决策制定、控制或安全关键任务至关重要。然而，LLMs的自回归生成特性使其端到端执行时间的建模与估算面临挑战。此外，现有基于固定键值（KV）缓存淘汰率的高效推理方法难以适应具有不同时间预算的多样化任务，不当的淘汰率可能导致推理中断或响应性能下降。本文提出TimeBill——一种新颖的时间预算约束型LLM推理框架，旨在平衡推理效率与响应性能。具体而言，我们设计了细粒度响应长度预测器（RLP）与执行时间估算器（ETE），以精准预测LLMs的端到端执行时间。在此基础上，开发了一种时间预算约束的高效推理方法，能够根据执行时间预测和给定时间预算自适应调整KV缓存淘汰率。最终，通过大量实验验证了TimeBill在提升任务完成率和维持响应性能方面的优势，并展示了其在多种超时策略下的有效性。

English

Large Language Models (LLMs) are increasingly deployed in time-critical systems, such as robotics, autonomous driving, embodied intelligence, and industrial automation, where generating accurate responses within a given time budget is crucial for decision-making, control, or safety-critical tasks. However, the auto-regressive generation process of LLMs makes it challenging to model and estimate the end-to-end execution time. Furthermore, existing efficient inference methods based on a fixed key-value (KV) cache eviction ratio struggle to adapt to varying tasks with diverse time budgets, where an improper eviction ratio may lead to incomplete inference or a drop in response performance. In this paper, we propose TimeBill, a novel time-budgeted inference framework for LLMs that balances the inference efficiency and response performance. To be more specific, we propose a fine-grained response length predictor (RLP) and an execution time estimator (ETE) to accurately predict the end-to-end execution time of LLMs. Following this, we develop a time-budgeted efficient inference approach that adaptively adjusts the KV cache eviction ratio based on execution time prediction and the given time budget. Finally, through extensive experiments, we demonstrate the advantages of TimeBill in improving task completion rate and maintaining response performance under various overrun strategies.

时间账单：面向大型语言模型的预算化时间推理

TimeBill: Time-Budgeted Inference for Large Language Models

摘要

Support