精度を超えて：ツール統合型推論における非効率パターンの解明

要旨

実世界のツール連携推論（TIR）シナリオでは、LLMが推論と外部ツール呼び出しを交互に行う際、ツール呼び出しによってLLMリクエスト間に一時停止が生じ、KVキャッシュの破棄が発生して再計算を強制されることが非効率性の主要因となっています。また、外部ツールから返される長くフィルタリングされていない応答はKVキャッシュを膨張させるため、各デコードステップでは増大するキャッシュの読み込みに時間を要し、コンテキスト長の増加に伴って処理速度が次第に低下します。しかし既存の効率指標（トークン数やツール呼び出し回数など）は、実際のモデル推論レイテンシを捕捉できていません。この問題に対処するため、我々はPTE（Prefill Token Equivalents）を提案します。これはハードウェアを意識したTIR効率指標であり、内部推論コストと外部ツール使用コストを統一するとともに、再利用不可能なKVキャッシュや長いツール応答シナリオを明示的に考慮します。高並行性の産業環境での検証により、PTEは標準的なトークン数よりも実時間レイテンシとの整合性が大幅に高く、多様なハードウェアプロファイル間で一貫した効率順位を維持することが示されました。5つのTIRベンチマークで広範な実験を実施し、各々のPTEコストを定量化するとともに、TIRに現れる4つの非効率パターンを特定しました。さらに、PTEコストが高い推論軌跡は正答率が低い傾向にあることを発見し、単にツールを多用しても回答品質が向上しないことを示唆しています。

English

In real-world Tool-Integrated Reasoning (TIR) scenarios, where LLMs interleave reasoning with external tool calls, a major source of inefficiency is that the toolcalls create pauses between LLM requests and cause KV-Cache eviction, forcing recomputation. Also, the long, unfiltered response returned by external tools inflates the KV-Cache, so each decode step spends more time loading the growing cache and thus becomes steadily slower as context length increases. However, existing efficiency metrics like token counts and toolcall counts fail to capture the real model inference latency. To address this, we introduce PTE (Prefill Token Equivalents), a hardware-aware TIR-efficiency metric that unifies internal reasoning and external tool-use costs while explicitly accounting for non-reusable KV-Cache and long-tool-response scenarios. Validation in a high-concurrency industrial setting indicates that PTE aligns significantly better with wall-clock latency than standard token counts, while maintaining consistent efficiency rankings across diverse hardware profiles. We conduct extensive experiments across five TIR benchmarks, quantify their PTE costs, and identify four inefficiency patterns that appear in TIR. We also discover that trajectories with higher PTE costs tend to have lower reasoning correctness, indicating that simply using more tools does not improve the quality of the answer.

精度を超えて：ツール統合型推論における非効率パターンの解明

Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning

要旨

Support