ChatPaper.aiChatPaper

干草堆工程:面向异构与自主长上下文评估的语境工程

Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

October 8, 2025
作者: Mufei Li, Dongqi Fu, Limei Wang, Si Zhang, Hanqing Zeng, Kaan Sancak, Ruizhong Qiu, Haoyu Wang, Xiaoxin He, Xavier Bresson, Yinglong Xia, Chonglin Sun, Pan Li
cs.AI

摘要

现代长上下文大语言模型(LLMs)在合成“大海捞针”(NIAH)基准测试中表现优异,但此类测试忽视了由偏置检索和代理工作流产生的噪声上下文。我们认为,构建能够真实反映现实世界关键因素的噪声长上下文——即来自异构偏置检索器的干扰和代理工作流中的级联错误——以测试模型的长上下文鲁棒性,必须进行“草堆工程”。我们通过HaystackCraft实例化了这一理念,这是一个基于完整英文维基百科超链接网络构建的新型NIAH基准测试,包含多跳问题。HaystackCraft评估了异构检索策略(如稀疏、密集、混合及基于图的检索)如何影响干扰项构成、草堆排序及下游LLM性能。此外,HaystackCraft将NIAH扩展至动态、依赖LLM的模拟代理操作场景,其中模型需优化查询、反思其先前推理并决定何时停止。对15个长上下文模型的实验表明:(1)尽管更强的密集检索器可能引入更具挑战性的干扰项,基于图的重排序同时提升了检索效率并减少了更有害的干扰项;(2)在代理测试中,即便是Gemini 2.5 Pro和GPT-5等先进模型,也会因自生干扰项遭遇级联失败或难以实现早期停止。这些结果凸显了代理长上下文推理中持续存在的挑战,并确立了HaystackCraft作为未来进展的重要测试平台。
English
Modern long-context large language models (LLMs) perform well on synthetic "needle-in-a-haystack" (NIAH) benchmarks, but such tests overlook how noisy contexts arise from biased retrieval and agentic workflows. We argue that haystack engineering is necessary to construct noisy long contexts that faithfully capture key real-world factors -- distraction from heterogeneous biased retrievers and cascading errors in agentic workflows -- to test models' long-context robustness. We instantiate it through HaystackCraft, a new NIAH benchmark built on the full English Wikipedia hyperlink network with multi-hop questions. HaystackCraft evaluates how heterogeneous retrieval strategies (e.g., sparse, dense, hybrid, and graph-based) affect distractor composition, haystack ordering, and downstream LLM performance. HaystackCraft further extends NIAH to dynamic, LLM-dependent settings that simulate agentic operations, where models refine queries, reflect on their past reasonings, and decide when to stop. Experiments with 15 long-context models show that (1) while stronger dense retrievers can introduce more challenging distractors, graph-based reranking simultaneously improves retrieval effectiveness and mitigates more harmful distractors; (2) in agentic tests, even advanced models like Gemini 2.5 Pro and GPT-5 suffer cascading failures from self-generated distractors or struggle to perform early stops. These results highlight persistent challenges in agentic long-context reasoning and establish HaystackCraft as a valuable testbed for future progress.
PDF32February 7, 2026