SproutRAG:面向长文档检索增强生成的注意力引导树搜索与渐进式嵌入
SproutRAG: Attention-Guided Tree Search with Progressive Embeddings for Long-Document RAG
June 16, 2026
作者: Amirhossein Abaskohi, Issam H. Laradji, Peter West, Giuseppe Carenini
cs.AI
摘要
检索增强生成(RAG)系统需在检索粒度与上下文连贯性之间取得平衡,现有方法通过LLM引导的分块、单层级上下文扩展或层级摘要来应对这一挑战。这些方法在索引或检索过程中依赖昂贵的LLM调用、将上下文聚合限制在单一粒度层级,或通过摘要引入信息损失。我们提出SproutRAG——一种基于注意力引导的层级RAG框架,通过将句子级分块组织成渐进增大但语义连贯的单元,利用学习到的跨句子注意力构建二叉分块树,从而解决这一权衡问题。与依赖外部LLM、固定上下文扩展或有损摘要的先前方法不同,SproutRAG通过学习哪些注意力头与层能最佳捕捉语义文档结构,在不额外调用LLM或使用压缩摘要的情况下实现多粒度检索。在检索阶段,SproutRAG采用层级束搜索以多粒度获取候选结果,捕获超越平面检索的多句子相关性。该框架通过联合目标进行端到端训练,同时优化嵌入表示与树结构。在涵盖科学文献、法律文本和开放领域场景的四个基准测试中,SproutRAG相较于最强基线平均提升了6.1%的信息效率(IE)。代码已开源至https://github.com/AmirAbaskohi/SproutRAG。
English
Retrieval-augmented generation (RAG) systems must balance retrieval granularity with contextual coherence, a challenge that existing methods address through LLM-guided chunking, single-level context expansion, or hierarchical summarization. These approaches variously depend on costly LLM calls during indexing or retrieval, limit context aggregation to a single granularity level, or introduce information loss through summarization. We present SproutRAG, an attention-guided hierarchical RAG framework that addresses this trade-off by organizing sentence-level chunks into progressively larger but semantically coherent units, using learned inter-sentence attention to construct a binary chunking tree. Unlike prior approaches that rely on external LLMs, fixed context expansion, or lossy summarization, SproutRAG learns which attention heads and layers best capture semantic document structure, enabling multi-granularity retrieval without additional LLM calls or compressed summaries. At retrieval time, SproutRAG uses hierarchical beam search to retrieve candidates at multiple granularities, capturing multi-sentence relevance beyond flat retrieval. The framework is trained end-to-end with a joint objective that improves both embeddings and tree structure. Experiments across four benchmarks spanning scientific, legal, and open-domain settings demonstrate that SproutRAG improves information efficiency (IE) by 6.1% on average over the strongest baseline. Code is available on https://github.com/AmirAbaskohi/SproutRAG.