언어 모델 컨텍스트 윈도우 평가: "작업 기억" 테스트와 추론 시점 보정

초록

대규모 언어 모델은 실제 애플리케이션에서 널리 사용되며, 종종 방대한 양의 문서에 대한 추론을 수행하는 역할을 맡습니다. 이 분야에서 주목할 만한 발전은 확장된 컨텍스트 기능을 자랑하는 모델들로, 일부는 200만 개 이상의 토큰을 처리할 수 있습니다. 그러나 이러한 장문 컨텍스트 모델의 성능은 실제 생산 시스템에서 여전히 불확실하며, 이로 인해 실제 사용 사례에서의 성능을 벤치마킹할 필요성이 대두되고 있습니다. 우리는 이러한 문제를 해결하기 위해 SWiM이라는 평가 프레임워크를 제안하며, 이는 표준 테스트의 한계를 극복합니다. 8개의 장문 컨텍스트 모델에 대해 이 프레임워크를 테스트한 결과, GPT-4와 Claude 3 Opus와 같은 강력한 모델들도 컨텍스트 창의 중간에 정보가 위치할 경우 성능이 저하되는 현상(lost-in-the-middle 효과)을 발견했습니다. 다음으로, 우리는 벤치마크 외에도 medoid voting이라는 간단하지만 효과적인 훈련 없이 적용 가능한 접근 방식을 제안합니다. 이 방법은 컨텍스트 내 문서를 무작위로 재배열하며 여러 번 응답을 생성하고, 그 중 medoid 답변을 선택함으로써 이 효과를 완화하는 데 도움을 줍니다. 우리는 단일 문서 QA 작업에서 medoid voting을 평가하여 최대 24%의 정확도 향상을 달성했습니다.

English

Large language models are prominently used in real-world applications, often tasked with reasoning over large volumes of documents. An exciting development in this space is models boasting extended context capabilities, with some accommodating over 2 million tokens. Such long context model capabilities remain uncertain in production systems, motivating the need to benchmark their performance on real world use cases. We address this challenge by proposing SWiM, an evaluation framework that addresses the limitations of standard tests. Testing the framework on eight long context models, we find that even strong models such as GPT-4 and Claude 3 Opus degrade in performance when information is present in the middle of the context window (lost-in-the-middle effect). Next, in addition to our benchmark, we propose medoid voting, a simple, but effective training-free approach that helps alleviate this effect, by generating responses a few times, each time randomly permuting documents in the context, and selecting the medoid answer. We evaluate medoid voting on single document QA tasks, achieving up to a 24% lift in accuracy.

언어 모델 컨텍스트 윈도우 평가: "작업 기억" 테스트와 추론 시점 보정

Evaluating Language Model Context Windows: A "Working Memory" Test and Inference-time Correction

초록

Support