MoC: Misturas de Aprendizes de Segmentação de Texto para Sistemas de Geração Aumentada por Recuperação

Resumo

A Geração Aumentada por Recuperação (RAG), embora sirva como um complemento viável para modelos de linguagem de grande escala (LLMs), frequentemente negligencia o aspecto crucial de segmentação de texto em seu pipeline. Este artigo inicialmente introduz um método de avaliação de dupla métrica, composto por Clareza de Limite e Aderência de Segmento, para permitir a quantificação direta da qualidade da segmentação. Utilizando esse método de avaliação, destacamos as limitações inerentes das abordagens tradicionais e semânticas de segmentação ao lidar com nuances contextuais complexas, corroborando assim a necessidade de integrar LLMs ao processo de segmentação. Para abordar o trade-off inerente entre eficiência computacional e precisão de segmentação em abordagens baseadas em LLMs, elaboramos o framework Granularidade-Aware Mixture-of-Chunkers (MoC), que consiste em um mecanismo de processamento em três estágios. Notavelmente, nosso objetivo é orientar o segmentador a gerar uma lista estruturada de expressões regulares de segmentação, que são subsequentemente empregadas para extrair segmentos do texto original. Experimentos extensivos demonstram que tanto nossas métricas propostas quanto o framework MoC resolvem efetivamente os desafios da tarefa de segmentação, revelando o núcleo da segmentação enquanto aprimoram o desempenho do sistema RAG.

English

Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline. This paper initially introduces a dual-metric evaluation method, comprising Boundary Clarity and Chunk Stickiness, to enable the direct quantification of chunking quality. Leveraging this assessment method, we highlight the inherent limitations of traditional and semantic chunking in handling complex contextual nuances, thereby substantiating the necessity of integrating LLMs into chunking process. To address the inherent trade-off between computational efficiency and chunking precision in LLM-based approaches, we devise the granularity-aware Mixture-of-Chunkers (MoC) framework, which consists of a three-stage processing mechanism. Notably, our objective is to guide the chunker towards generating a structured list of chunking regular expressions, which are subsequently employed to extract chunks from the original text. Extensive experiments demonstrate that both our proposed metrics and the MoC framework effectively settle challenges of the chunking task, revealing the chunking kernel while enhancing the performance of the RAG system.

MoC: Misturas de Aprendizes de Segmentação de Texto para Sistemas de Geração Aumentada por Recuperação

MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System

Resumo

Support