トライのベクトル化：アクセラレータ上でのLLMベース生成検索における効率的な制約付きデコード

要旨

生成的検索は、LLMベースの推薦システムにおける強力なパラダイムとして登場した。しかし、産業界の推薦システムでは、ビジネスロジックに基づいて出力空間を限定されたアイテムの部分集合に制限することがしばしば有効である（例：コンテンツの新しさや製品カテゴリの強制）。これは標準的な自己回帰デコーディングではネイティブにサポートできない。さらに、プレフィックス木（トライ木）を利用する既存の制約付きデコーディング手法は、ハードウェアアクセラレータ（TPU/GPU）上で深刻な遅延ペナルティを被る。本研究では、TPU/GPU上での高スループットなLLMベース生成的検索のために特別に設計された、効率的かつスケーラブルな制約付きデコーディング技術であるSTATIC（Sparse Transition Matrix-Accelerated Trie Index for Constrained Decoding）を提案する。プレフィックス木を静的な圧縮行格納（CSR）行列に平坦化することで、不規則な木の走査を完全にベクトル化された疎行列演算に変換し、ハードウェアアクセラレータ上で大幅な効率向上を実現する。我々はSTATICを数十億ユーザーにサービスを提供する大規模産業向け動画推薦プラットフォームに導入した。STATICは、最小限の遅延オーバーヘッド（ステップあたり0.033 ms、推論時間の0.25%）で製品指標に大きな影響を与え、CPUトライ木実装に対して948倍、ハードウェアアクセラレータ対応の二分探索ベースラインに対して47-1033倍の高速化を達成した。さらに、STATICの実行時オーバーヘッドは、様々な実用的な設定において極めて低い水準を維持する。我々の知る限り、STATICは厳密に制約された生成的検索の初の本番環境規模での導入を可能にする。さらに、学術的ベンチマークによる評価は、STATICが生成的検索のコールドスタート性能を大幅に改善できることを示している。コードはhttps://github.com/youtube/static-constraint-decoding で公開されている。

English

Generative retrieval has emerged as a powerful paradigm for LLM-based recommendation. However, industrial recommender systems often benefit from restricting the output space to a constrained subset of items based on business logic (e.g. enforcing content freshness or product category), which standard autoregressive decoding cannot natively support. Moreover, existing constrained decoding methods that make use of prefix trees (Tries) incur severe latency penalties on hardware accelerators (TPUs/GPUs). In this work, we introduce STATIC (Sparse Transition Matrix-Accelerated Trie Index for Constrained Decoding), an efficient and scalable constrained decoding technique designed specifically for high-throughput LLM-based generative retrieval on TPUs/GPUs. By flattening the prefix tree into a static Compressed Sparse Row (CSR) matrix, we transform irregular tree traversals into fully vectorized sparse matrix operations, unlocking massive efficiency gains on hardware accelerators. We deploy STATIC on a large-scale industrial video recommendation platform serving billions of users. STATIC produces significant product metric impact with minimal latency overhead (0.033 ms per step and 0.25% of inference time), achieving a 948x speedup over a CPU trie implementation and a 47-1033x speedup over a hardware-accelerated binary-search baseline. Furthermore, the runtime overhead of STATIC remains extremely low across a wide range of practical configurations. To the best of our knowledge, STATIC enables the first production-scale deployment of strictly constrained generative retrieval. In addition, evaluation on academic benchmarks demonstrates that STATIC can considerably improve cold-start performance for generative retrieval. Our code is available at https://github.com/youtube/static-constraint-decoding.

トライのベクトル化：アクセラレータ上でのLLMベース生成検索における効率的な制約付きデコード

Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators

要旨

Support