MoC：基於文本分塊學習器混合的檢索增強生成系統

摘要

檢索增強生成（RAG）作為大型語言模型（LLMs）的有效補充，其流程中往往忽視了文本分塊這一關鍵環節。本文首先提出了一種雙指標評估方法，包含邊界清晰度與塊粘性，以實現對分塊質量的直接量化。基於此評估方法，我們揭示了傳統與語義分塊在處理複雜上下文細微差別時的固有侷限，從而證明了將LLMs整合至分塊過程的必要性。為解決基於LLM方法在計算效率與分塊精度之間固有的權衡問題，我們設計了粒度感知的混合分塊器（MoC）框架，該框架包含一個三階段處理機制。值得注意的是，我們的目標是引導分塊器生成結構化的分塊正則表達式列表，隨後利用這些表達式從原始文本中提取塊。大量實驗表明，我們提出的指標與MoC框架均有效解決了分塊任務的挑戰，揭示了分塊核心，同時提升了RAG系統的性能。

English

Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline. This paper initially introduces a dual-metric evaluation method, comprising Boundary Clarity and Chunk Stickiness, to enable the direct quantification of chunking quality. Leveraging this assessment method, we highlight the inherent limitations of traditional and semantic chunking in handling complex contextual nuances, thereby substantiating the necessity of integrating LLMs into chunking process. To address the inherent trade-off between computational efficiency and chunking precision in LLM-based approaches, we devise the granularity-aware Mixture-of-Chunkers (MoC) framework, which consists of a three-stage processing mechanism. Notably, our objective is to guide the chunker towards generating a structured list of chunking regular expressions, which are subsequently employed to extract chunks from the original text. Extensive experiments demonstrate that both our proposed metrics and the MoC framework effectively settle challenges of the chunking task, revealing the chunking kernel while enhancing the performance of the RAG system.

MoC：基於文本分塊學習器混合的檢索增強生成系統

MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System

摘要

Support