긴 문맥에서 Few-shot이 작동할 수 있을까? 문맥을 재활용하여 데모 생성하기

초록

대규모 언어 모델(LLM)의 최근 발전에도 불구하고, 긴 문맥을 포함하는 작업에서의 성능은 여전히 최적화되지 못하고 있습니다. 이러한 상황에서 소수 예제를 활용한 문맥 내 학습(In-Context Learning, ICL)은 LLM 성능을 향상시킬 수 있는 매력적인 해결책일 수 있습니다. 그러나 긴 문맥과 함께 ICL 예제를 단순히 추가하는 것은 각 소수 예제에 상당한 토큰 오버헤드를 발생시키고, 데모와 대상 질의 간의 문맥 불일치를 초래하는 등의 문제를 야기합니다. 본 연구에서는 문맥을 재활용하여 긴 문맥 질의응답(QA) 작업을 위한 소수 예제를 자동으로 생성하는 방법을 제안합니다. 구체적으로, 긴 입력 문맥(1-3k 토큰)과 질의가 주어졌을 때, 주어진 문맥에서 추가적인 질의-응답 쌍을 소수 예제로 생성하면서 문맥은 단 한 번만 도입합니다. 이를 통해 데모가 대상 질의와 동일한 문맥을 활용하도록 보장하면서도 프롬프트에 추가되는 토큰 수를 최소화합니다. 또한, 각 데모를 개선하기 위해 모델이 답변 전에 관련 단락을 명시적으로 식별하도록 지시함으로써 성능을 향상시키고 답변 출처에 대한 세밀한 귀속을 제공합니다. 우리는 이 방법을 여러 LLM에 적용하여 긴 문맥을 가진 다양한 QA 데이터셋에서, 특히 답변이 문맥 중간에 위치할 때 평균 +23%의 상당한 성능 향상을 얻었습니다. 놀랍게도, 단일 홉(single-hop) ICL 예제만 도입했음에도 불구하고, LLM은 우리의 접근 방식을 사용하여 다중 홉(multi-hop) 긴 문맥 QA에도 성공적으로 일반화했습니다.

English

Despite recent advancements in Large Language Models (LLMs), their performance on tasks involving long contexts remains sub-optimal. In-Context Learning (ICL) with few-shot examples may be an appealing solution to enhance LLM performance in this scenario; However, naively adding ICL examples with long context introduces challenges, including substantial token overhead added for each few-shot example and context mismatch between the demonstrations and the target query. In this work, we propose to automatically generate few-shot examples for long context QA tasks by recycling contexts. Specifically, given a long input context (1-3k tokens) and a query, we generate additional query-output pairs from the given context as few-shot examples, while introducing the context only once. This ensures that the demonstrations are leveraging the same context as the target query while only adding a small number of tokens to the prompt. We further enhance each demonstration by instructing the model to explicitly identify the relevant paragraphs before the answer, which improves performance while providing fine-grained attribution to the answer source. We apply our method on multiple LLMs and obtain substantial improvements (+23\% on average across models) on various QA datasets with long context, especially when the answer lies within the middle of the context. Surprisingly, despite introducing only single-hop ICL examples, LLMs also successfully generalize to multi-hop long-context QA using our approach.

긴 문맥에서 Few-shot이 작동할 수 있을까? 문맥을 재활용하여 데모 생성하기

Can Few-shot Work in Long-Context? Recycling the Context to Generate Demonstrations

초록

Support