ヘイスタックエンジニアリング：異種混合かつエージェンシーを備えた長文脈評価のためのコンテキストエンジニアリング

要旨

現代の長文脈大規模言語モデル（LLMs）は、合成的な「干し草の山の中の針」（NIAH）ベンチマークにおいて良好な性能を示すが、そのようなテストは、偏った検索やエージェント的なワークフローから生じるノイズの多い文脈を考慮していない。我々は、干し草の山のエンジニアリングが、現実世界の重要な要素――異種の偏った検索器からの注意散漫やエージェント的なワークフローにおける連鎖的なエラー――を忠実に捉えたノイズの多い長文脈を構築し、モデルの長文脈ロバスト性をテストするために必要であると主張する。我々はこれを、HaystackCraftという新しいNIAHベンチマークを通じて具体化する。HaystackCraftは、英語版Wikipediaのハイパーリンクネットワーク全体に基づいて構築され、マルチホップの質問を含む。HaystackCraftは、異種の検索戦略（例えば、スパース、デンス、ハイブリッド、グラフベース）が、ディストラクターの構成、干し草の山の順序、および下流のLLMの性能にどのように影響するかを評価する。HaystackCraftはさらに、NIAHを動的でLLM依存の設定に拡張し、エージェント的な操作をシミュレートする。ここでは、モデルがクエリを改良し、過去の推論を反映し、いつ停止するかを決定する。15の長文脈モデルを用いた実験結果は、(1) より強力なデンス検索器がより挑戦的なディストラクターを導入する一方で、グラフベースの再ランキングが検索の有効性を向上させ、より有害なディストラクターを緩和すること、(2) エージェント的なテストでは、Gemini 2.5 ProやGPT-5のような先進的なモデルでさえ、自己生成されたディストラクターからの連鎖的な失敗に苦しむか、早期停止を実行するのに苦労することを示している。これらの結果は、エージェント的な長文脈推論における持続的な課題を浮き彫りにし、HaystackCraftを将来の進歩のための貴重なテストベッドとして確立する。

English

Modern long-context large language models (LLMs) perform well on synthetic "needle-in-a-haystack" (NIAH) benchmarks, but such tests overlook how noisy contexts arise from biased retrieval and agentic workflows. We argue that haystack engineering is necessary to construct noisy long contexts that faithfully capture key real-world factors -- distraction from heterogeneous biased retrievers and cascading errors in agentic workflows -- to test models' long-context robustness. We instantiate it through HaystackCraft, a new NIAH benchmark built on the full English Wikipedia hyperlink network with multi-hop questions. HaystackCraft evaluates how heterogeneous retrieval strategies (e.g., sparse, dense, hybrid, and graph-based) affect distractor composition, haystack ordering, and downstream LLM performance. HaystackCraft further extends NIAH to dynamic, LLM-dependent settings that simulate agentic operations, where models refine queries, reflect on their past reasonings, and decide when to stop. Experiments with 15 long-context models show that (1) while stronger dense retrievers can introduce more challenging distractors, graph-based reranking simultaneously improves retrieval effectiveness and mitigates more harmful distractors; (2) in agentic tests, even advanced models like Gemini 2.5 Pro and GPT-5 suffer cascading failures from self-generated distractors or struggle to perform early stops. These results highlight persistent challenges in agentic long-context reasoning and establish HaystackCraft as a valuable testbed for future progress.

ヘイスタックエンジニアリング：異種混合かつエージェンシーを備えた長文脈評価のためのコンテキストエンジニアリング

Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

要旨

Support