프롤로그: 긴 문맥에 대한 전반적 이해와 추론을 요구하도록 설계된 벤치마크

초록

우리는 캐릭터의 프리퀄 스토리가 원작 서사의 정식 설정과 일관성을 유지하는지를 판단하는 과제를 통해 장문맥 이해 능력을 평가하는 벤치마크인 PRELUDE를 소개한다. 우리의 과제는 기존 벤치마크들보다 더 강력한 전반적 이해와 심층 추론을 요구한다. 프리퀄은 원작 스토리의 일부가 아니기 때문에, 그 타당성을 평가하기 위해서는 간접적으로 관련된 정보를 탐색하고 통합해야 하는 경우가 많다. 실증적으로, 88%의 사례에서 서사의 여러 부분에 걸친 증거가 필요하다. 실험 결과는 우리 과제의 도전적 성격을 보여준다: 최첨단 대형 언어 모델(LLM)을 활용한 인컨텍스트 학습, RAG, 도메인 특화 학습, 그리고 상용 딥리서치 서비스 모두 인간의 성능에 비해 15% 이상 뒤처지는 것으로 나타났다. 추가 인간 연구에서 모델들이 종종 잘못된 추론 과정을 통해 정답을 도출함으로써, 인간 대비 추론 정확도에서 30% 이상의 격차가 발생함이 밝혀졌다. 이러한 발견들은 장문맥 이해와 추론 분야에서 개선의 여지가 상당함을 강조한다.

English

We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character's prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks -- as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by >15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning.

프롤로그: 긴 문맥에 대한 전반적 이해와 추론을 요구하도록 설계된 벤치마크

PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

초록

Support