MoC: Miscele di Modelli per il Chunking del Testo nei Sistemi di Generazione Aumentata con Recupero delle Informazioni

Abstract

Il Retrieval-Augmented Generation (RAG), pur rappresentando un valido complemento ai grandi modelli linguistici (LLM), spesso trascura l'aspetto cruciale della suddivisione del testo (chunking) all'interno della sua pipeline. Questo articolo introduce inizialmente un metodo di valutazione a doppia metrica, composto da Boundary Clarity e Chunk Stickiness, per consentire la quantificazione diretta della qualità del chunking. Sfruttando questo metodo di valutazione, evidenziamo le limitazioni intrinseche del chunking tradizionale e semantico nel gestire le complesse sfumature contestuali, dimostrando così la necessità di integrare gli LLM nel processo di chunking. Per affrontare il compromesso intrinseco tra efficienza computazionale e precisione del chunking negli approcci basati su LLM, proponiamo il framework granularity-aware Mixture-of-Chunkers (MoC), che consiste in un meccanismo di elaborazione a tre fasi. In particolare, il nostro obiettivo è guidare il chunker nella generazione di una lista strutturata di espressioni regolari di chunking, che vengono successivamente utilizzate per estrarre i chunk dal testo originale. Esperimenti estensivi dimostrano che sia le metriche proposte che il framework MoC affrontano efficacemente le sfide del task di chunking, rivelando il nucleo del chunking e migliorando le prestazioni del sistema RAG.

English

Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline. This paper initially introduces a dual-metric evaluation method, comprising Boundary Clarity and Chunk Stickiness, to enable the direct quantification of chunking quality. Leveraging this assessment method, we highlight the inherent limitations of traditional and semantic chunking in handling complex contextual nuances, thereby substantiating the necessity of integrating LLMs into chunking process. To address the inherent trade-off between computational efficiency and chunking precision in LLM-based approaches, we devise the granularity-aware Mixture-of-Chunkers (MoC) framework, which consists of a three-stage processing mechanism. Notably, our objective is to guide the chunker towards generating a structured list of chunking regular expressions, which are subsequently employed to extract chunks from the original text. Extensive experiments demonstrate that both our proposed metrics and the MoC framework effectively settle challenges of the chunking task, revealing the chunking kernel while enhancing the performance of the RAG system.

MoC: Miscele di Modelli per il Chunking del Testo nei Sistemi di Generazione Aumentata con Recupero delle Informazioni

MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System

Abstract

Support