超越準確性：揭示工具整合推理中的低效模式

摘要

在現實世界的工具整合推理場景中，當大型語言模型將推理過程與外部工具調用交錯執行時，效率低下的主要根源在於工具調用會導致LLM請求之間的停頓，並引發KV快取被驅逐，從而迫使模型進行重複計算。此外，外部工具返回的冗長未經篩選回應會膨脹KV快取，使得每個解碼步驟需花費更多時間加載不斷增長的快取，從而隨著上下文長度增加而持續減速。然而，現有的效率指標（如令牌計數和工具調用次數）無法真實反映模型推理延遲。為解決此問題，我們提出PTE（預填充令牌等價值），這是一種硬體感知的TIR效率指標，能統一內部推理與外部工具使用成本，同時明確考量不可重複使用的KV快取和長工具回應情境。在高併發工業環境中的驗證表明，PTE與實際時鐘延遲的吻合度顯著優於標準令牌計數，並在不同硬體配置下保持一致的效率排名。我們在五個TIR基準測試中進行廣泛實驗，量化其PTE成本，並識別出TIR中存在的四種低效模式。我們還發現PTE成本越高的執行軌跡，其推理正確性往往越低，這表明單純增加工具使用量並不能提升答案品質。

English

In real-world Tool-Integrated Reasoning (TIR) scenarios, where LLMs interleave reasoning with external tool calls, a major source of inefficiency is that the toolcalls create pauses between LLM requests and cause KV-Cache eviction, forcing recomputation. Also, the long, unfiltered response returned by external tools inflates the KV-Cache, so each decode step spends more time loading the growing cache and thus becomes steadily slower as context length increases. However, existing efficiency metrics like token counts and toolcall counts fail to capture the real model inference latency. To address this, we introduce PTE (Prefill Token Equivalents), a hardware-aware TIR-efficiency metric that unifies internal reasoning and external tool-use costs while explicitly accounting for non-reusable KV-Cache and long-tool-response scenarios. Validation in a high-concurrency industrial setting indicates that PTE aligns significantly better with wall-clock latency than standard token counts, while maintaining consistent efficiency rankings across diverse hardware profiles. We conduct extensive experiments across five TIR benchmarks, quantify their PTE costs, and identify four inefficiency patterns that appear in TIR. We also discover that trajectories with higher PTE costs tend to have lower reasoning correctness, indicating that simply using more tools does not improve the quality of the answer.

超越準確性：揭示工具整合推理中的低效模式

Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning

摘要

Support