MoC:基於文本分塊學習器混合的檢索增強生成系統
MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System
March 12, 2025
作者: Jihao Zhao, Zhiyuan Ji, Zhaoxin Fan, Hanyu Wang, Simin Niu, Bo Tang, Feiyu Xiong, Zhiyu Li
cs.AI
摘要
檢索增強生成(RAG)作為大型語言模型(LLMs)的有效補充,其流程中往往忽視了文本分塊這一關鍵環節。本文首先提出了一種雙指標評估方法,包含邊界清晰度與塊粘性,以實現對分塊質量的直接量化。基於此評估方法,我們揭示了傳統與語義分塊在處理複雜上下文細微差別時的固有侷限,從而證明了將LLMs整合至分塊過程的必要性。為解決基於LLM方法在計算效率與分塊精度之間固有的權衡問題,我們設計了粒度感知的混合分塊器(MoC)框架,該框架包含一個三階段處理機制。值得注意的是,我們的目標是引導分塊器生成結構化的分塊正則表達式列表,隨後利用這些表達式從原始文本中提取塊。大量實驗表明,我們提出的指標與MoC框架均有效解決了分塊任務的挑戰,揭示了分塊核心,同時提升了RAG系統的性能。
English
Retrieval-Augmented Generation (RAG), while serving as a viable complement to
large language models (LLMs), often overlooks the crucial aspect of text
chunking within its pipeline. This paper initially introduces a dual-metric
evaluation method, comprising Boundary Clarity and Chunk Stickiness, to enable
the direct quantification of chunking quality. Leveraging this assessment
method, we highlight the inherent limitations of traditional and semantic
chunking in handling complex contextual nuances, thereby substantiating the
necessity of integrating LLMs into chunking process. To address the inherent
trade-off between computational efficiency and chunking precision in LLM-based
approaches, we devise the granularity-aware Mixture-of-Chunkers (MoC)
framework, which consists of a three-stage processing mechanism. Notably, our
objective is to guide the chunker towards generating a structured list of
chunking regular expressions, which are subsequently employed to extract chunks
from the original text. Extensive experiments demonstrate that both our
proposed metrics and the MoC framework effectively settle challenges of the
chunking task, revealing the chunking kernel while enhancing the performance of
the RAG system.Summary
AI-Generated Summary