ChatPaper.aiChatPaper

用于长上下文检索增强生成的推理扩展

Inference Scaling for Long-Context Retrieval Augmented Generation

October 6, 2024
作者: Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky
cs.AI

摘要

推理计算的扩展释放了长上下文大型语言模型(LLMs)在不同场景下的潜力。对于知识密集型任务,增加计算资源通常用于整合更多外部知识。然而,若未有效利用这些知识,仅仅扩展上下文并不总是能提升性能。本研究探讨了用于检索增强生成(RAG)的推理扩展,探索了超越简单增加知识数量的策略。我们专注于两种推理扩展策略:上下文内学习和迭代提示。这些策略为扩展测试时计算提供了额外灵活性(例如,增加检索文档或生成步骤),从而增强了LLMs有效获取和利用上下文信息的能力。我们探讨了两个关键问题:(1)在最佳配置时,RAG性能如何受益于推理计算的扩展?(2)通过建模RAG性能与推理参数之间的关系,我们能否预测给定预算的最佳测试时计算分配?我们的观察结果显示,当进行最佳分配时,增加推理计算会导致RAG性能近乎线性增益,我们将这种关系描述为RAG的推理扩展定律。基于此,我们进一步开发了计算分配模型,用于估计在不同推理配置下的RAG性能。该模型预测了在各种计算约束下的最佳推理参数,这与实验结果密切一致。通过应用这些最佳配置,我们展示了在长上下文LLMs上扩展推理计算相比标准RAG在基准数据集上可实现高达58.9%的增益。
English
The scaling of inference computation has unlocked the potential of long-context large language models (LLMs) across diverse settings. For knowledge-intensive tasks, the increased compute is often allocated to incorporate more external knowledge. However, without effectively utilizing such knowledge, solely expanding context does not always enhance performance. In this work, we investigate inference scaling for retrieval augmented generation (RAG), exploring strategies beyond simply increasing the quantity of knowledge. We focus on two inference scaling strategies: in-context learning and iterative prompting. These strategies provide additional flexibility to scale test-time computation (e.g., by increasing retrieved documents or generation steps), thereby enhancing LLMs' ability to effectively acquire and utilize contextual information. We address two key questions: (1) How does RAG performance benefit from the scaling of inference computation when optimally configured? (2) Can we predict the optimal test-time compute allocation for a given budget by modeling the relationship between RAG performance and inference parameters? Our observations reveal that increasing inference computation leads to nearly linear gains in RAG performance when optimally allocated, a relationship we describe as the inference scaling laws for RAG. Building on this, we further develop the computation allocation model to estimate RAG performance across different inference configurations. The model predicts optimal inference parameters under various computation constraints, which align closely with the experimental results. By applying these optimal configurations, we demonstrate that scaling inference compute on long-context LLMs achieves up to 58.9% gains on benchmark datasets compared to standard RAG.

Summary

AI-Generated Summary

PDF92November 16, 2024