트라이 벡터화: 가속기 기반 LLM 생성적 검색을 위한 효율적인 제약 디코딩

초록

생성적 검색은 LLM 기반 추천 시스템의 강력한 패러다임으로 부상했습니다. 그러나 산업용 추천 시스템은 비즈니스 로직에 따라(예: 콘텐츠 신선도 또는 상품 카테고리 강화) 출력 공간을 제한된 항목 집합으로 제한함으로써 이점을 얻는 경우가 많으며, 이는 표준 자기회귀 디코딩으로는 기본적으로 지원할 수 없습니다. 더욱이 접두사 트리(Trie)를 활용하는 기존 제약 디코딩 방법은 하드웨어 가속기(TPU/GPU)에서 심각한 지연 시간 손해를 초래합니다. 본 연구에서는 TPU/GPU에서의 고처리량 LLM 기반 생성적 검색을 위해 특별히 설계된 효율적이고 확장 가능한 제약 디코딩 기술인 STATIC(Sparse Transition Matrix-Accelerated Trie Index for Constrained Decoding)을 소개합니다. 접두사 트리를 정적 압축 희소 행렬(CSR) 형태로 평면화함으로써, 불규칙한 트리 순회를 완전히 벡터화된 희소 행렬 연산으로 변환하여 하드웨어 가속기에서의 대규모 효율성 향상을 실현합니다. 우리는 수십억 사용자를 대상으로 하는 대규모 산업용 비디오 추천 플랫폼에 STATIC을 배포했습니다. STATIC은 최소의 지연 시간 오버헤드(스텝당 0.033ms, 추론 시간의 0.25%)로 상당한 제품 지표 개선을 달성하며, CPU 트라이 구현 대비 948배, 하드웨어 가속 이진 탐색 기준선 대비 47-1033배의 속도 향상을 보였습니다. 또한 STATIC의 런타임 오버헤드는 다양한 실제 구성에서 극도로 낮은 수준을 유지합니다. 우리가 알고 있는 바에 따르면, STATIC은 엄격하게 제약된 생성적 검색의 첫 번째 프로덕션 규모 배포를 가능하게 합니다. 게다가 학술 벤치마크 평가를 통해 STATIC이 생성적 검색의 콜드 스타트 성능을 상당히 개선할 수 있음이 입증되었습니다. 우리의 코드는 https://github.com/youtube/static-constraint-decoding에서 확인할 수 있습니다.

English

Generative retrieval has emerged as a powerful paradigm for LLM-based recommendation. However, industrial recommender systems often benefit from restricting the output space to a constrained subset of items based on business logic (e.g. enforcing content freshness or product category), which standard autoregressive decoding cannot natively support. Moreover, existing constrained decoding methods that make use of prefix trees (Tries) incur severe latency penalties on hardware accelerators (TPUs/GPUs). In this work, we introduce STATIC (Sparse Transition Matrix-Accelerated Trie Index for Constrained Decoding), an efficient and scalable constrained decoding technique designed specifically for high-throughput LLM-based generative retrieval on TPUs/GPUs. By flattening the prefix tree into a static Compressed Sparse Row (CSR) matrix, we transform irregular tree traversals into fully vectorized sparse matrix operations, unlocking massive efficiency gains on hardware accelerators. We deploy STATIC on a large-scale industrial video recommendation platform serving billions of users. STATIC produces significant product metric impact with minimal latency overhead (0.033 ms per step and 0.25% of inference time), achieving a 948x speedup over a CPU trie implementation and a 47-1033x speedup over a hardware-accelerated binary-search baseline. Furthermore, the runtime overhead of STATIC remains extremely low across a wide range of practical configurations. To the best of our knowledge, STATIC enables the first production-scale deployment of strictly constrained generative retrieval. In addition, evaluation on academic benchmarks demonstrates that STATIC can considerably improve cold-start performance for generative retrieval. Our code is available at https://github.com/youtube/static-constraint-decoding.

트라이 벡터화: 가속기 기반 LLM 생성적 검색을 위한 효율적인 제약 디코딩

Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators

초록

Support