向量化Trie树：基于大语言模型生成式检索在加速器上的高效约束解码（注：标题采用学术论文常见的"主标题:副标题"结构，将专业术语"Vectorizing"译为"向量化"，"Trie"保留专业名词"Trie树"的译法，"Constrained Decoding"译为技术界通用的"约束解码"，"Accelerators"根据上下文译为"加速器"以体现硬件特性。）

摘要

生成式检索已成为基于大语言模型的推荐系统的重要范式。然而工业级推荐系统通常需要根据业务逻辑将输出空间限制在特定物品子集（例如强制要求内容新鲜度或商品品类），而标准的自回归解码方法无法原生支持这种约束。现有基于前缀树的约束解码方法在硬件加速器（TPU/GPU）上会产生严重的延迟代价。本文提出STATIC（面向约束解码的稀疏转移矩阵加速前缀树索引），这是一种专为TPU/GPU高通量生成式检索设计的高效可扩展约束解码技术。通过将前缀树扁平化为静态压缩稀疏行矩阵，我们将不规则树遍历操作转化为完全向量化的稀疏矩阵运算，从而在硬件加速器上实现巨大的效率提升。我们在服务数十亿用户的大规模工业级视频推荐平台上部署了STATIC系统。实验表明，STATIC在仅增加极小延迟开销（每步0.033毫秒，占推理时间0.25%）的情况下显著提升产品指标，相比CPU前缀树实现实现948倍加速，较硬件加速二分搜索基线获得47-1033倍加速。更重要的是，STATIC在多种实际配置下均保持极低运行时开销。据我们所知，STATIC实现了首个生产级严格约束生成式检索系统的实际部署。学术基准测试进一步证明，STATIC能显著提升生成式检索的冷启动性能。代码已开源：https://github.com/youtube/static-constraint-decoding。

English

Generative retrieval has emerged as a powerful paradigm for LLM-based recommendation. However, industrial recommender systems often benefit from restricting the output space to a constrained subset of items based on business logic (e.g. enforcing content freshness or product category), which standard autoregressive decoding cannot natively support. Moreover, existing constrained decoding methods that make use of prefix trees (Tries) incur severe latency penalties on hardware accelerators (TPUs/GPUs). In this work, we introduce STATIC (Sparse Transition Matrix-Accelerated Trie Index for Constrained Decoding), an efficient and scalable constrained decoding technique designed specifically for high-throughput LLM-based generative retrieval on TPUs/GPUs. By flattening the prefix tree into a static Compressed Sparse Row (CSR) matrix, we transform irregular tree traversals into fully vectorized sparse matrix operations, unlocking massive efficiency gains on hardware accelerators. We deploy STATIC on a large-scale industrial video recommendation platform serving billions of users. STATIC produces significant product metric impact with minimal latency overhead (0.033 ms per step and 0.25% of inference time), achieving a 948x speedup over a CPU trie implementation and a 47-1033x speedup over a hardware-accelerated binary-search baseline. Furthermore, the runtime overhead of STATIC remains extremely low across a wide range of practical configurations. To the best of our knowledge, STATIC enables the first production-scale deployment of strictly constrained generative retrieval. In addition, evaluation on academic benchmarks demonstrates that STATIC can considerably improve cold-start performance for generative retrieval. Our code is available at https://github.com/youtube/static-constraint-decoding.

Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators

摘要

Support