MoC: 検索拡張生成システムのためのテキストチャンキング学習器の混合モデル

要旨

検索拡張生成（RAG）は、大規模言語モデル（LLM）の有効な補完として機能する一方で、そのパイプライン内におけるテキストチャンキングの重要な側面を見落としがちです。本論文ではまず、チャンキング品質を直接定量化するための二重評価指標、すなわち「境界の明確性」と「チャンクの粘着性」を導入します。この評価方法を活用し、従来の手法や意味的チャンキングが複雑な文脈のニュアンスを扱う際に持つ本質的な限界を指摘し、LLMをチャンキングプロセスに統合する必要性を実証します。LLMベースのアプローチにおける計算効率とチャンキング精度のトレードオフに対処するため、粒度を意識したMixture-of-Chunkers（MoC）フレームワークを考案します。これは3段階の処理メカニズムで構成されています。特に、我々の目的は、チャンカーが構造化されたチャンキング正規表現のリストを生成するよう導き、それを元のテキストからチャンクを抽出するために使用することです。大規模な実験により、提案した評価指標とMoCフレームワークがチャンキングタスクの課題を効果的に解決し、チャンキングの核心を明らかにするとともにRAGシステムの性能を向上させることが実証されました。

English

Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline. This paper initially introduces a dual-metric evaluation method, comprising Boundary Clarity and Chunk Stickiness, to enable the direct quantification of chunking quality. Leveraging this assessment method, we highlight the inherent limitations of traditional and semantic chunking in handling complex contextual nuances, thereby substantiating the necessity of integrating LLMs into chunking process. To address the inherent trade-off between computational efficiency and chunking precision in LLM-based approaches, we devise the granularity-aware Mixture-of-Chunkers (MoC) framework, which consists of a three-stage processing mechanism. Notably, our objective is to guide the chunker towards generating a structured list of chunking regular expressions, which are subsequently employed to extract chunks from the original text. Extensive experiments demonstrate that both our proposed metrics and the MoC framework effectively settle challenges of the chunking task, revealing the chunking kernel while enhancing the performance of the RAG system.

MoC: 検索拡張生成システムのためのテキストチャンキング学習器の混合モデル

MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System

要旨

Support