APE：通過自適應並行編碼實現更快速和更長的上下文增強生成

摘要

增強內容生成（CAG）技術，包括RAG和ICL，需要有效地結合多個上下文以生成對用戶查詢的回應。將這些上下文直接輸入為序列會在每個請求中重新編碼組合的上下文選擇，從而帶來相當大的計算負擔。為了應對這一問題，我們探索了平行編碼的潛在優勢，獨立地預先計算並緩存每個上下文的KV狀態。這種方法使得在推理過程中可以直接加載緩存的狀態，同時通過在不同上下文之間重複使用位置，實現了對更多上下文的支持。然而，由於注意力分佈的不對齊，直接應用平行編碼導致性能顯著下降。為了實現有效且高效的CAG，我們提出了自適應平行編碼（APE），它引入了共享前綴、注意力溫度和縮放因子，以使平行編碼的分佈與順序編碼對齊。在RAG和ICL任務上的結果表明，APE可以在使用相同輸入時保持98%和93%的順序編碼性能，同時分別比平行編碼優勢高出3.6%和7.9%。它還可以擴展到多次抽樣CAG，有效地平行編碼數百個上下文。效率評估顯示，APE可以通過減少128K長度上下文的28倍預填充時間，實現端到端4.5倍的加速。

English

Context-augmented generation (CAG) techniques, including RAG and ICL, require the efficient combination of multiple contexts to generate responses to user queries. Directly inputting these contexts as a sequence introduces a considerable computational burden by re-encoding the combined selection of contexts for every request. To address this, we explore the promising potential of parallel encoding to independently pre-compute and cache each context's KV states. This approach enables the direct loading of cached states during inference while accommodating more contexts through position reuse across contexts. However, due to misalignments in attention distribution, directly applying parallel encoding results in a significant performance drop. To enable effective and efficient CAG, we propose Adaptive Parallel Encoding (APE), which brings shared prefix, attention temperature, and scaling factor to align the distribution of parallel encoding with sequential encoding. Results on RAG and ICL tasks demonstrate that APE can preserve 98% and 93% sequential encoding performance using the same inputs while outperforming parallel encoding by 3.6% and 7.9%, respectively. It also scales to many-shot CAG, effectively encoding hundreds of contexts in parallel. Efficiency evaluation shows that APE can achieve an end-to-end 4.5times speedup by reducing 28times prefilling time for a 128K-length context.

APE：通過自適應並行編碼實現更快速和更長的上下文增強生成

APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding

摘要

Support