ChatPaper.aiChatPaper

向量化Trie树:基于加速器的大语言模型生成式检索高效约束解码方法

Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators

February 26, 2026
作者: Zhengyang Su, Isay Katsman, Yueqi Wang, Ruining He, Lukasz Heldt, Raghunandan Keshavan, Shao-Chuan Wang, Xinyang Yi, Mingyan Gao, Onkar Dalal, Lichan Hong, Ed Chi, Ningren Han
cs.AI

摘要

生成式检索已成为基于大语言模型的推荐系统的重要范式。然而,工业推荐系统通常需要根据业务逻辑将输出空间限制在特定物品子集(例如强制要求内容新鲜度或商品品类),而标准的自回归解码方法无法原生支持这种约束。现有基于前缀树的约束解码方法在硬件加速器(TPU/GPU)上会产生严重的延迟代价。本文提出STATIC(面向约束解码的稀疏转移矩阵加速前缀树索引),这是一种专为TPU/GPU上高吞吐量生成式检索设计的高效可扩展约束解码技术。通过将前缀树扁平化为静态压缩稀疏行矩阵,我们将不规则树遍历操作转化为完全向量化的稀疏矩阵运算,从而在硬件加速器上实现显著的效率提升。我们在服务数十亿用户的工业级视频推荐平台上部署STATIC,该技术以极低的延迟开销(每步0.033毫秒,仅占推理时间的0.25%)带来显著的产品指标提升,相比CPU前缀树实现加速948倍,比硬件加速二分搜索基线快47-1033倍。更重要的是,STATIC在多种实际配置下均保持极低的运行时开销。据我们所知,STATIC实现了首个生产级严格约束生成式检索系统。在学术基准测试中,STATIC被证明能显著提升生成式检索的冷启动性能。代码已开源:https://github.com/youtube/static-constraint-decoding。
English
Generative retrieval has emerged as a powerful paradigm for LLM-based recommendation. However, industrial recommender systems often benefit from restricting the output space to a constrained subset of items based on business logic (e.g. enforcing content freshness or product category), which standard autoregressive decoding cannot natively support. Moreover, existing constrained decoding methods that make use of prefix trees (Tries) incur severe latency penalties on hardware accelerators (TPUs/GPUs). In this work, we introduce STATIC (Sparse Transition Matrix-Accelerated Trie Index for Constrained Decoding), an efficient and scalable constrained decoding technique designed specifically for high-throughput LLM-based generative retrieval on TPUs/GPUs. By flattening the prefix tree into a static Compressed Sparse Row (CSR) matrix, we transform irregular tree traversals into fully vectorized sparse matrix operations, unlocking massive efficiency gains on hardware accelerators. We deploy STATIC on a large-scale industrial video recommendation platform serving billions of users. STATIC produces significant product metric impact with minimal latency overhead (0.033 ms per step and 0.25% of inference time), achieving a 948x speedup over a CPU trie implementation and a 47-1033x speedup over a hardware-accelerated binary-search baseline. Furthermore, the runtime overhead of STATIC remains extremely low across a wide range of practical configurations. To the best of our knowledge, STATIC enables the first production-scale deployment of strictly constrained generative retrieval. In addition, evaluation on academic benchmarks demonstrates that STATIC can considerably improve cold-start performance for generative retrieval. Our code is available at https://github.com/youtube/static-constraint-decoding.
PDF31March 7, 2026