NeedleBench: LLM이 100만 컨텍스트 윈도우에서 검색과 추론을 할 수 있을까?

초록

대규모 언어 모델(LLM)의 장문맥 처리 능력을 평가함에 있어, 원본 장문 문서에서 사용자 질의와 관련된 내용을 식별하는 것은 LLM이 장문 텍스트를 기반으로 질문에 답변하기 위한 중요한 전제 조건입니다. 본 논문에서는 NeedleBench를 제안합니다. 이는 점점 더 도전적인 과제들로 구성된 프레임워크로, 다중 길이 구간(4k, 8k, 32k, 128k, 200k, 1000k 및 그 이상)과 다양한 깊이 범위에 걸쳐 이중 언어 장문맥 능력을 평가하도록 설계되었습니다. 이를 통해 텍스트의 다양한 깊이 영역에 중요한 데이터 포인트를 전략적으로 삽입함으로써, 다양한 맥락에서 모델의 정보 검색 및 추론 능력을 엄격하게 테스트할 수 있습니다. 우리는 NeedleBench 프레임워크를 사용하여 주요 오픈소스 모델들이 질문과 관련된 핵심 정보를 얼마나 잘 식별하고, 이 정보를 이중 언어 장문 텍스트에서의 추론에 적용할 수 있는지를 평가했습니다. 더 나아가, 실제 세계의 장문맥 과제에서 발생할 가능성이 높은 논리적 추론의 복잡성을 모방한 Ancestral Trace Challenge(ATC)를 제안하여, 복잡한 장문맥 상황에서 LLM을 평가할 수 있는 간단한 방법을 제공합니다. 우리의 결과는 현재의 LLM들이 실제 장문맥 응용에서 상당한 개선의 여지가 있음을 시사하며, 이는 실제 세계의 장문맥 과제에서 발생할 가능성이 높은 논리적 추론의 복잡성을 다루는 데 어려움을 겪기 때문입니다. 모든 코드와 리소스는 OpenCompass에서 확인할 수 있습니다: https://github.com/open-compass/opencompass.

English

In evaluating the long-context capabilities of large language models (LLMs), identifying content relevant to a user's query from original long documents is a crucial prerequisite for any LLM to answer questions based on long text. We present NeedleBench, a framework consisting of a series of progressively more challenging tasks for assessing bilingual long-context capabilities, spanning multiple length intervals (4k, 8k, 32k, 128k, 200k, 1000k, and beyond) and different depth ranges, allowing the strategic insertion of critical data points in different text depth zones to rigorously test the retrieval and reasoning capabilities of models in diverse contexts. We use the NeedleBench framework to assess how well the leading open-source models can identify key information relevant to the question and apply that information to reasoning in bilingual long texts. Furthermore, we propose the Ancestral Trace Challenge (ATC) to mimic the complexity of logical reasoning challenges that are likely to be present in real-world long-context tasks, providing a simple method for evaluating LLMs in dealing with complex long-context situations. Our results suggest that current LLMs have significant room for improvement in practical long-context applications, as they struggle with the complexity of logical reasoning challenges that are likely to be present in real-world long-context tasks. All codes and resources are available at OpenCompass: https://github.com/open-compass/opencompass.

NeedleBench: LLM이 100만 컨텍스트 윈도우에서 검색과 추론을 할 수 있을까?

NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?

초록

Summary

Support

Support