推論擴展長文本檢索生成的規模化
Inference Scaling for Long-Context Retrieval Augmented Generation
October 6, 2024
作者: Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky
cs.AI
摘要
推理計算的擴展已經釋放出長內容大型語言模型(LLMs)在各種情境下的潛力。對於知識密集型任務,增加的計算通常被分配用於整合更多外部知識。然而,如果沒有有效利用這些知識,僅擴展上下文並不總是會提高性能。在這項工作中,我們研究了檢索增強生成(RAG)的推理擴展,探索超越僅僅增加知識量的策略。我們專注於兩種推理擴展策略:上下文學習和迭代提示。這些策略提供了額外的靈活性,以擴展測試時計算(例如,通過增加檢索文檔或生成步驟),從而增強LLMs有效獲取和利用上下文信息的能力。我們探討了兩個關鍵問題:(1)當最佳配置時,RAG性能如何從推理計算的擴展中受益?(2)通過建模RAG性能與推理參數之間的關係,我們能否預測給定預算的最佳測試時計算分配?我們的觀察顯示,當最佳分配時,增加推理計算導致RAG性能幾乎呈線性增長,我們將這種關係描述為RAG的推理擴展定律。基於此,我們進一步發展了計算分配模型,以估計在不同推理配置下的RAG性能。該模型預測了在各種計算限制條件下的最佳推理參數,這與實驗結果密切一致。通過應用這些最佳配置,我們證明在長內容LLMs上擴展推理計算相比標準RAG在基準數據集上可實現高達58.9%的增益。
English
The scaling of inference computation has unlocked the potential of
long-context large language models (LLMs) across diverse settings. For
knowledge-intensive tasks, the increased compute is often allocated to
incorporate more external knowledge. However, without effectively utilizing
such knowledge, solely expanding context does not always enhance performance.
In this work, we investigate inference scaling for retrieval augmented
generation (RAG), exploring strategies beyond simply increasing the quantity of
knowledge. We focus on two inference scaling strategies: in-context learning
and iterative prompting. These strategies provide additional flexibility to
scale test-time computation (e.g., by increasing retrieved documents or
generation steps), thereby enhancing LLMs' ability to effectively acquire and
utilize contextual information. We address two key questions: (1) How does RAG
performance benefit from the scaling of inference computation when optimally
configured? (2) Can we predict the optimal test-time compute allocation for a
given budget by modeling the relationship between RAG performance and inference
parameters? Our observations reveal that increasing inference computation leads
to nearly linear gains in RAG performance when optimally allocated, a
relationship we describe as the inference scaling laws for RAG. Building on
this, we further develop the computation allocation model to estimate RAG
performance across different inference configurations. The model predicts
optimal inference parameters under various computation constraints, which align
closely with the experimental results. By applying these optimal
configurations, we demonstrate that scaling inference compute on long-context
LLMs achieves up to 58.9% gains on benchmark datasets compared to standard RAG.Summary
AI-Generated Summary