장문맥 검색 증강 생성을 위한 추론 스케일링

초록

추론 계산의 확장은 다양한 환경에서 긴 문맥의 대형 언어 모델(LLMs)의 잠재력을 발휘하였다. 지식 집약적 작업에서는 증가된 계산량이 종종 외부 지식을 더 많이 통합하기 위해 할당된다. 그러나 이러한 지식을 효과적으로 활용하지 않으면 단순히 문맥을 확장하는 것만으로는 항상 성능을 향상시키지 못한다. 본 연구에서는 검색 보강 생성(RAG)을 위한 추론 확장을 조사하며, 지식의 양을 단순히 증가시키는 것을 넘어가는 전략을 탐구한다. 우리는 두 가지 추론 확장 전략에 초점을 맞추는데, 이는 문맥 내 학습과 반복적 프롬프팅이다. 이러한 전략은 테스트 시간 계산을 확장함으로써(LMs의 능력을 향상시키는데 도움이 된다. 우리는 두 가지 주요 질문에 대답한다: (1) 최적으로 구성된 경우 RAG 성능이 추론 계산의 확장에서 어떻게 이득을 얻는가? (2) RAG 성능과 추론 매개변수 간의 관계를 모델링하여 주어진 예산에 대한 최적의 테스트 시간 계산 할당을 예측할 수 있는가? 우리의 관찰 결과, 추론 계산을 증가시키면 최적으로 할당된 경우 RAG 성능이 거의 선형적으로 향상되는 것을 보여주며, 이 관계를 RAG를 위한 추론 확장 법칙으로 설명한다. 여기에 더하여, 우리는 계산 할당 모델을 발전시켜 다양한 추론 구성에서 RAG 성능을 예측한다. 이 모델은 다양한 계산 제약 조건 하에서 최적의 추론 매개변수를 예측하며, 실험 결과와 밀접하게 일치한다. 이러한 최적의 구성을 적용함으로써, 우리는 긴 문맥 LLMs에서 추론 계산을 확장함으로써 표준 RAG에 비해 벤치마크 데이터셋에서 최대 58.9%의 이득을 얻을 수 있음을 보여준다.

English

The scaling of inference computation has unlocked the potential of long-context large language models (LLMs) across diverse settings. For knowledge-intensive tasks, the increased compute is often allocated to incorporate more external knowledge. However, without effectively utilizing such knowledge, solely expanding context does not always enhance performance. In this work, we investigate inference scaling for retrieval augmented generation (RAG), exploring strategies beyond simply increasing the quantity of knowledge. We focus on two inference scaling strategies: in-context learning and iterative prompting. These strategies provide additional flexibility to scale test-time computation (e.g., by increasing retrieved documents or generation steps), thereby enhancing LLMs' ability to effectively acquire and utilize contextual information. We address two key questions: (1) How does RAG performance benefit from the scaling of inference computation when optimally configured? (2) Can we predict the optimal test-time compute allocation for a given budget by modeling the relationship between RAG performance and inference parameters? Our observations reveal that increasing inference computation leads to nearly linear gains in RAG performance when optimally allocated, a relationship we describe as the inference scaling laws for RAG. Building on this, we further develop the computation allocation model to estimate RAG performance across different inference configurations. The model predicts optimal inference parameters under various computation constraints, which align closely with the experimental results. By applying these optimal configurations, we demonstrate that scaling inference compute on long-context LLMs achieves up to 58.9% gains on benchmark datasets compared to standard RAG.

장문맥 검색 증강 생성을 위한 추론 스케일링

Inference Scaling for Long-Context Retrieval Augmented Generation

초록

Support