SproutRAG: 점진적 임베딩을 활용한 어텐션 기반 트리 탐색 기법의 장문 문서 RAG

초록

검색 증강 생성(RAG) 시스템은 검색의 세분성과 맥락적 일관성 사이에서 균형을 맞춰야 하는데, 기존 방법들은 LLM 기반 청킹, 단일 수준 컨텍스트 확장, 또는 계층적 요약을 통해 이 문제를 해결한다. 이러한 접근 방식은 인덱싱 또는 검색 과정에서 비용이 많이 드는 LLM 호출을 필요로 하거나, 컨텍스트 집계를 단일 세분성 수준으로 제한하거나, 요약을 통해 정보 손실을 초래한다는 단점이 있다. 본 논문에서는 SproutRAG를 제안한다. 이는 어텐션 기반 계층적 RAG 프레임워크로, 문장 수준 청크를 점진적으로 더 크면서도 의미적으로 일관된 단위로 구성하고, 학습된 문장 간 어텐션을 활용하여 이진 청킹 트리를 구축함으로써 위의 균형 문제를 해결한다. 외부 LLM, 고정된 컨텍스트 확장, 또는 손실이 있는 요약에 의존하는 기존 접근 방식과 달리, SproutRAG는 문서의 의미 구조를 가장 잘 포착하는 어텐션 헤드와 층을 학습하여 추가적인 LLM 호출이나 압축된 요약 없이 다중 세분성 검색을 가능하게 한다. 검색 시 SproutRAG는 계층적 빔 서치를 사용하여 여러 세분성 수준에서 후보를 검색함으로써, 평면적 검색을 넘어 다중 문장 관련성을 포착한다. 프레임워크는 임베딩과 트리 구조를 모두 개선하는 통합 목적 함수를 통해 종단간 학습된다. 과학, 법률, 개방형 도메인을 포괄하는 네 가지 벤치마크에 대한 실험 결과, SproutRAG가 가장 강력한 기준선 대비 정보 효율성(IE)을 평균 6.1% 향상시키는 것으로 나타났다. 코드는 https://github.com/AmirAbaskohi/SproutRAG에서 확인할 수 있다.

English

Retrieval-augmented generation (RAG) systems must balance retrieval granularity with contextual coherence, a challenge that existing methods address through LLM-guided chunking, single-level context expansion, or hierarchical summarization. These approaches variously depend on costly LLM calls during indexing or retrieval, limit context aggregation to a single granularity level, or introduce information loss through summarization. We present SproutRAG, an attention-guided hierarchical RAG framework that addresses this trade-off by organizing sentence-level chunks into progressively larger but semantically coherent units, using learned inter-sentence attention to construct a binary chunking tree. Unlike prior approaches that rely on external LLMs, fixed context expansion, or lossy summarization, SproutRAG learns which attention heads and layers best capture semantic document structure, enabling multi-granularity retrieval without additional LLM calls or compressed summaries. At retrieval time, SproutRAG uses hierarchical beam search to retrieve candidates at multiple granularities, capturing multi-sentence relevance beyond flat retrieval. The framework is trained end-to-end with a joint objective that improves both embeddings and tree structure. Experiments across four benchmarks spanning scientific, legal, and open-domain settings demonstrate that SproutRAG improves information efficiency (IE) by 6.1% on average over the strongest baseline. Code is available on https://github.com/AmirAbaskohi/SproutRAG.