ChatPaper.aiChatPaper

面向高效经济检索增强生成系统的网络检索感知分块方法

Web Retrieval-Aware Chunking (W-RAC) for Efficient and Cost-Effective Retrieval-Augmented Generation Systems

January 8, 2026
作者: Uday Allu, Sonu Kedia, Tanmay Odapally, Biddwan Ahmed
cs.AI

摘要

检索增强生成(RAG)系统的效能关键取决于文档分块策略对检索质量、延迟与运营成本的平衡能力。传统分块方法(如固定尺寸分块、基于规则分块或全智能体分块)常面临高令牌消耗、冗余文本生成、可扩展性有限及调试困难等问题,尤其在处理大规模网络内容时更为突出。本文提出网络检索感知分块(W-RAC),这是一种专为网络文档设计的新型高性价比分块框架。W-RAC通过将解析后的网络内容表示为结构化可寻址单元,使文本提取与语义分块规划相分离,并仅利用大语言模型(LLM)进行检索感知的分组决策而非文本生成。该设计显著降低了令牌使用量,消除了幻觉风险,并提升了系统可观测性。实验分析与架构对比表明,W-RAC在实现与传统分块方法相当或更优检索性能的同时,将分块相关的LLM成本降低了一个数量级。
English
Retrieval-Augmented Generation (RAG) systems critically depend on effective document chunking strategies to balance retrieval quality, latency, and operational cost. Traditional chunking approaches, such as fixed-size, rule-based, or fully agentic chunking, often suffer from high token consumption, redundant text generation, limited scalability, and poor debuggability, especially for large-scale web content ingestion. In this paper, we propose Web Retrieval-Aware Chunking (W-RAC), a novel, cost-efficient chunking framework designed specifically for web-based documents. W-RAC decouples text extraction from semantic chunk planning by representing parsed web content as structured, ID-addressable units and leveraging large language models (LLMs) only for retrieval-aware grouping decisions rather than text generation. This significantly reduces token usage, eliminates hallucination risks, and improves system observability.Experimental analysis and architectural comparison demonstrate that W-RAC achieves comparable or better retrieval performance than traditional chunking approaches while reducing chunking-related LLM costs by an order of magnitude.