元块化：通过逻辑感知学习高效文本分割

摘要

检索增强生成（RAG）虽然作为大型语言模型（LLMs）的一个可行补充，但常常忽视其管道中文本分块的关键方面，这影响了知识密集型任务的质量。本文介绍了“元分块”概念，指的是介于句子和段落之间的粒度，由段落内具有深层语言逻辑关系的句子集合组成。为了实现元分块，我们设计了两种基于LLMs的策略：边界采样分块和困惑度分块。前者利用LLMs对连续句子是否需要分割进行二元分类，根据从边界采样获得的概率差异做出决策。后者通过分析困惑度分布的特征精确识别文本分块边界。此外，考虑到不同文本的固有复杂性，我们提出了一种将元分块与动态合并相结合的策略，以在细粒度和粗粒度文本分块之间取得平衡。在十一个数据集上进行的实验表明，元分块可以更有效地提高基于RAG的单跳和多跳问答的性能。例如，在2WikiMultihopQA数据集上，它的表现优于相似性分块1.32，同时只消耗了45.8%的时间。我们的代码可在https://github.com/IAAR-Shanghai/Meta-Chunking找到。

English

Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline, which impacts the quality of knowledge-intensive tasks. This paper introduces the concept of Meta-Chunking, which refers to a granularity between sentences and paragraphs, consisting of a collection of sentences within a paragraph that have deep linguistic logical connections. To implement Meta-Chunking, we designed two strategies based on LLMs: Margin Sampling Chunking and Perplexity Chunking. The former employs LLMs to perform binary classification on whether consecutive sentences need to be segmented, making decisions based on the probability difference obtained from margin sampling. The latter precisely identifies text chunk boundaries by analyzing the characteristics of perplexity distribution. Additionally, considering the inherent complexity of different texts, we propose a strategy that combines Meta-Chunking with dynamic merging to achieve a balance between fine-grained and coarse-grained text chunking. Experiments conducted on eleven datasets demonstrate that Meta-Chunking can more efficiently improve the performance of single-hop and multi-hop question answering based on RAG. For instance, on the 2WikiMultihopQA dataset, it outperforms similarity chunking by 1.32 while only consuming 45.8% of the time. Our code is available at https://github.com/IAAR-Shanghai/Meta-Chunking.

元块化：通过逻辑感知学习高效文本分割

Meta-Chunking: Learning Efficient Text Segmentation via Logical Perception

摘要

Support