Nutzen Sie den vollen Kontext Ihres LLM voll aus.

papers.abstract

Obwohl viele zeitgenössische große Sprachmodelle (LLMs) längere Eingaben verarbeiten können, haben sie immer noch Schwierigkeiten, Informationen innerhalb des langen Kontexts vollständig zu nutzen, was als das "lost-in-the-middle"-Problem bekannt ist. Wir vermuten, dass dies auf unzureichende explizite Überwachung während des Trainings mit langem Kontext zurückzuführen ist, was es versäumt zu betonen, dass jede Position in einem langen Kontext wichtige Informationen enthalten kann. Basierend auf dieser Intuition präsentiert unsere Studie das informationsintensive (IN2) Training, eine rein datengetriebene Lösung zur Überwindung des "lost-in-the-middle"-Problems. Speziell nutzt das IN2-Training einen synthetisierten Frage-Antwort-Datensatz mit langem Kontext, bei dem die Antwort (1) ein feingranulares Informationsbewusstsein über ein kurzes Segment (~128 Tokens) innerhalb eines synthetisierten langen Kontexts (4K-32K Tokens) erfordert und (2) die Integration und Schlussfolgerung von Informationen aus zwei oder mehr kurzen Segmenten. Durch die Anwendung dieses informationsintensiven Trainings auf Mistral-7B präsentieren wir FILM-7B (FILl-in-the-Middle). Um die Fähigkeit von FILM-7B zur Nutzung langer Kontexte gründlich zu bewerten, entwerfen wir drei Untersuchungsaufgaben, die verschiedene Kontextstile (Dokument, Code und strukturierte Datenkontexte) und Informationsabrufmuster (vorwärts, rückwärts und bidirektionaler Abruf) umfassen. Die Untersuchungsergebnisse zeigen, dass FILM-7B robust Informationen aus verschiedenen Positionen in seinem 32K-Kontextfenster abrufen kann. Über diese Untersuchungsaufgaben hinaus verbessert FILM-7B signifikant die Leistung bei realen langen Kontextaufgaben (z. B. 23,5 -> 26,9 F1-Score bei NarrativeQA), während es eine vergleichbare Leistung bei kurzen Kontextaufgaben beibehält (z. B. 59,3 -> 59,2 Genauigkeit bei MMLU). Github-Link: https://github.com/microsoft/FILM.

English

While many contemporary large language models (LLMs) can process lengthy input, they still struggle to fully utilize information within the long context, known as the lost-in-the-middle challenge. We hypothesize that it stems from insufficient explicit supervision during the long-context training, which fails to emphasize that any position in a long context can hold crucial information. Based on this intuition, our study presents information-intensive (IN2) training, a purely data-driven solution to overcome lost-in-the-middle. Specifically, IN2 training leverages a synthesized long-context question-answer dataset, where the answer requires (1) fine-grained information awareness on a short segment (~128 tokens) within a synthesized long context (4K-32K tokens), and (2) the integration and reasoning of information from two or more short segments. Through applying this information-intensive training on Mistral-7B, we present FILM-7B (FILl-in-the-Middle). To thoroughly assess the ability of FILM-7B for utilizing long contexts, we design three probing tasks that encompass various context styles (document, code, and structured-data context) and information retrieval patterns (forward, backward, and bi-directional retrieval). The probing results demonstrate that FILM-7B can robustly retrieve information from different positions in its 32K context window. Beyond these probing tasks, FILM-7B significantly improves the performance on real-world long-context tasks (e.g., 23.5->26.9 F1 score on NarrativeQA), while maintaining a comparable performance on short-context tasks (e.g., 59.3->59.2 accuracy on MMLU). Github Link: https://github.com/microsoft/FILM.

Nutzen Sie den vollen Kontext Ihres LLM voll aus.

Make Your LLM Fully Utilize the Context

papers.abstract

Support