Fai in modo che il tuo LLM sfrutti appieno il contesto

Abstract

Sebbene molti moderni modelli linguistici su larga scala (LLM) siano in grado di elaborare input di grandi dimensioni, continuano a incontrare difficoltà nel sfruttare appieno le informazioni all'interno di contesti lunghi, un problema noto come la sfida del "lost-in-the-middle". Ipotesizziamo che ciò derivi da una supervisione esplicita insufficiente durante l'addestramento su contesti lunghi, che non riesce a enfatizzare il fatto che qualsiasi posizione in un contesto lungo possa contenere informazioni cruciali. Basandoci su questa intuizione, il nostro studio presenta l'addestramento information-intensive (IN2), una soluzione puramente basata sui dati per superare il problema del "lost-in-the-middle". Nello specifico, l'addestramento IN2 sfrutta un dataset sintetico di domande e risposte su contesti lunghi, in cui la risposta richiede (1) una consapevolezza fine delle informazioni su un breve segmento (~128 token) all'interno di un contesto lungo sintetico (4K-32K token), e (2) l'integrazione e il ragionamento su informazioni provenienti da due o più segmenti brevi. Applicando questo addestramento information-intensive su Mistral-7B, presentiamo FILM-7B (FILl-in-the-Middle). Per valutare approfonditamente la capacità di FILM-7B di utilizzare contesti lunghi, abbiamo progettato tre task di probing che coprono vari stili di contesto (documento, codice e contesto di dati strutturati) e modelli di recupero delle informazioni (recupero in avanti, indietro e bidirezionale). I risultati dei probing dimostrano che FILM-7B può recuperare in modo robusto informazioni da diverse posizioni nella sua finestra di contesto di 32K. Oltre a questi task di probing, FILM-7B migliora significativamente le prestazioni su task reali su contesti lunghi (ad esempio, 23.5->26.9 punteggio F1 su NarrativeQA), mantenendo al contempo prestazioni comparabili su task su contesti brevi (ad esempio, 59.3->59.2 accuratezza su MMLU). Link GitHub: https://github.com/microsoft/FILM.

English

While many contemporary large language models (LLMs) can process lengthy input, they still struggle to fully utilize information within the long context, known as the lost-in-the-middle challenge. We hypothesize that it stems from insufficient explicit supervision during the long-context training, which fails to emphasize that any position in a long context can hold crucial information. Based on this intuition, our study presents information-intensive (IN2) training, a purely data-driven solution to overcome lost-in-the-middle. Specifically, IN2 training leverages a synthesized long-context question-answer dataset, where the answer requires (1) fine-grained information awareness on a short segment (~128 tokens) within a synthesized long context (4K-32K tokens), and (2) the integration and reasoning of information from two or more short segments. Through applying this information-intensive training on Mistral-7B, we present FILM-7B (FILl-in-the-Middle). To thoroughly assess the ability of FILM-7B for utilizing long contexts, we design three probing tasks that encompass various context styles (document, code, and structured-data context) and information retrieval patterns (forward, backward, and bi-directional retrieval). The probing results demonstrate that FILM-7B can robustly retrieve information from different positions in its 32K context window. Beyond these probing tasks, FILM-7B significantly improves the performance on real-world long-context tasks (e.g., 23.5->26.9 F1 score on NarrativeQA), while maintaining a comparable performance on short-context tasks (e.g., 59.3->59.2 accuracy on MMLU). Github Link: https://github.com/microsoft/FILM.

Fai in modo che il tuo LLM sfrutti appieno il contesto

Make Your LLM Fully Utilize the Context

Abstract

Support