ChatPaper.aiChatPaper

乾草堆工程:面向異質性與能動性長上下文評估的上下文工程

Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

October 8, 2025
作者: Mufei Li, Dongqi Fu, Limei Wang, Si Zhang, Hanqing Zeng, Kaan Sancak, Ruizhong Qiu, Haoyu Wang, Xiaoxin He, Xavier Bresson, Yinglong Xia, Chonglin Sun, Pan Li
cs.AI

摘要

現代長上下文大型語言模型(LLMs)在合成性的「大海撈針」(NIAH)基準測試中表現出色,但此類測試忽略了由偏見檢索和代理工作流程產生的噪聲上下文。我們認為,為了測試模型的長上下文魯棒性,有必要進行「草堆工程」以構建能夠真實反映關鍵現實因素的噪聲長上下文——即來自異質性偏見檢索器的干擾以及代理工作流程中的級聯錯誤。我們通過HaystackCraft實現了這一點,這是一個基於完整英文維基百科超鏈接網絡並包含多跳問題的新NIAH基準測試。HaystackCraft評估了異質性檢索策略(如稀疏、密集、混合及基於圖的檢索)如何影響干擾項的構成、草堆的排序以及下游LLM的表現。HaystackCraft進一步將NIAH擴展至模擬代理操作的動態、依賴於LLM的環境中,在此環境中模型會精煉查詢、反思其過去的推理並決定何時停止。對15個長上下文模型的實驗表明:(1)雖然更強的密集檢索器可能引入更具挑戰性的干擾項,但基於圖的重新排序同時提高了檢索效率並減少了更具危害性的干擾項;(2)在代理測試中,即使是Gemini 2.5 Pro和GPT-5等先進模型,也會因自我生成的干擾項而遭遇級聯失敗,或難以實現早期停止。這些結果凸顯了代理長上下文推理中持續存在的挑戰,並確立了HaystackCraft作為未來進展的重要測試平台。
English
Modern long-context large language models (LLMs) perform well on synthetic "needle-in-a-haystack" (NIAH) benchmarks, but such tests overlook how noisy contexts arise from biased retrieval and agentic workflows. We argue that haystack engineering is necessary to construct noisy long contexts that faithfully capture key real-world factors -- distraction from heterogeneous biased retrievers and cascading errors in agentic workflows -- to test models' long-context robustness. We instantiate it through HaystackCraft, a new NIAH benchmark built on the full English Wikipedia hyperlink network with multi-hop questions. HaystackCraft evaluates how heterogeneous retrieval strategies (e.g., sparse, dense, hybrid, and graph-based) affect distractor composition, haystack ordering, and downstream LLM performance. HaystackCraft further extends NIAH to dynamic, LLM-dependent settings that simulate agentic operations, where models refine queries, reflect on their past reasonings, and decide when to stop. Experiments with 15 long-context models show that (1) while stronger dense retrievers can introduce more challenging distractors, graph-based reranking simultaneously improves retrieval effectiveness and mitigates more harmful distractors; (2) in agentic tests, even advanced models like Gemini 2.5 Pro and GPT-5 suffer cascading failures from self-generated distractors or struggle to perform early stops. These results highlight persistent challenges in agentic long-context reasoning and establish HaystackCraft as a valuable testbed for future progress.
PDF32February 7, 2026