헤이스택 엔지니어링: 이기종 및 에이전트 기반 장문맥 평가를 위한 컨텍스트 엔지니어링

초록

현대의 장문맥 대규모 언어 모델(LLMs)은 합성된 "건초 더미 속 바늘 찾기(NIAH)" 벤치마크에서 우수한 성능을 보이지만, 이러한 테스트는 편향된 검색과 에이전트 워크플로우에서 발생하는 잡음이 많은 문맥을 간과합니다. 우리는 모델의 장문맥 견고성을 테스트하기 위해 핵심 실제 요소를 충실히 반영한 잡음이 많은 장문맥을 구성하기 위해 건초 더미 엔지니어링이 필요하다고 주장합니다. 이는 이질적인 편향 검색기로 인한 방해와 에이전트 워크플로우에서의 연쇄 오류를 포함합니다. 우리는 이를 HaystackCraft라는 새로운 NIAH 벤치마크를 통해 구현했습니다. HaystackCraft는 전체 영어 위키백과 하이퍼링크 네트워크를 기반으로 다중 홉 질문을 포함하며, 이질적인 검색 전략(예: 희소, 밀집, 하이브리드, 그래프 기반)이 방해 요소 구성, 건초 더미 순서, 그리고 하류 LLM 성능에 미치는 영향을 평가합니다. HaystackCraft는 또한 NIAH를 에이전트 작업을 시뮬레이션하는 동적, LLM 의존적 설정으로 확장하여, 모델이 쿼리를 개선하고 과거 추론을 반영하며 중단 시점을 결정하는 환경을 제공합니다. 15개의 장문맥 모델을 대상으로 한 실험 결과는 다음과 같습니다: (1) 더 강력한 밀집 검색기는 더 어려운 방해 요소를 도입할 수 있지만, 그래프 기반 재순위는 검색 효율성을 개선하고 더 해로운 방해 요소를 완화합니다; (2) 에이전트 테스트에서는 Gemini 2.5 Pro와 GPT-5와 같은 고급 모델도 자체 생성된 방해 요소로 인한 연쇄 실패를 겪거나 조기 중단을 수행하는 데 어려움을 겪습니다. 이러한 결과는 에이전트 장문맥 추론에서의 지속적인 도전 과제를 강조하며, HaystackCraft를 미래 진전을 위한 가치 있는 테스트베드로 확립합니다.

English

Modern long-context large language models (LLMs) perform well on synthetic "needle-in-a-haystack" (NIAH) benchmarks, but such tests overlook how noisy contexts arise from biased retrieval and agentic workflows. We argue that haystack engineering is necessary to construct noisy long contexts that faithfully capture key real-world factors -- distraction from heterogeneous biased retrievers and cascading errors in agentic workflows -- to test models' long-context robustness. We instantiate it through HaystackCraft, a new NIAH benchmark built on the full English Wikipedia hyperlink network with multi-hop questions. HaystackCraft evaluates how heterogeneous retrieval strategies (e.g., sparse, dense, hybrid, and graph-based) affect distractor composition, haystack ordering, and downstream LLM performance. HaystackCraft further extends NIAH to dynamic, LLM-dependent settings that simulate agentic operations, where models refine queries, reflect on their past reasonings, and decide when to stop. Experiments with 15 long-context models show that (1) while stronger dense retrievers can introduce more challenging distractors, graph-based reranking simultaneously improves retrieval effectiveness and mitigates more harmful distractors; (2) in agentic tests, even advanced models like Gemini 2.5 Pro and GPT-5 suffer cascading failures from self-generated distractors or struggle to perform early stops. These results highlight persistent challenges in agentic long-context reasoning and establish HaystackCraft as a valuable testbed for future progress.

헤이스택 엔지니어링: 이기종 및 에이전트 기반 장문맥 평가를 위한 컨텍스트 엔지니어링

Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

초록

Support