ChatQA 2: 長文脈とRAG能力におけるプロプライエタリLLMとのギャップを埋める

要旨

本研究では、長文理解と検索拡張生成（RAG）能力において、オープンアクセスの大規模言語モデル（LLM）と主要なプロプライエタリモデル（例：GPT-4-Turbo）のギャップを埋めることを目的としたLlama3ベースのモデル、ChatQA 2を紹介します。これら2つの能力は、単一のプロンプトに収まらない大量の情報を処理するためにLLMにとって不可欠であり、下流タスクや計算予算に応じて互いに補完的です。Llama3-70B-baseのコンテキストウィンドウを8Kから128Kトークンに拡張する詳細な継続学習レシピと、モデルの指示追従能力、RAG性能、長文理解能力を向上させる3段階の指示チューニングプロセスを提示します。結果として、Llama3-ChatQA-2-70Bモデルは、多くの長文理解タスクにおいてGPT-4-Turbo-2024-0409と同等の精度を達成し、RAGベンチマークではそれを上回りました。興味深いことに、最先端の長文検索器がRAGにおけるトップkコンテキストの断片化問題を緩和し、長文理解タスクにおけるRAGベースの結果をさらに改善することがわかりました。また、最先端の長文LLMを使用したRAGと長文ソリューションの広範な比較も提供します。

English

In this work, we introduce ChatQA 2, a Llama3-based model designed to bridge the gap between open-access LLMs and leading proprietary models (e.g., GPT-4-Turbo) in long-context understanding and retrieval-augmented generation (RAG) capabilities. These two capabilities are essential for LLMs to process large volumes of information that cannot fit into a single prompt and are complementary to each other, depending on the downstream tasks and computational budgets. We present a detailed continued training recipe to extend the context window of Llama3-70B-base from 8K to 128K tokens, along with a three-stage instruction tuning process to enhance the model's instruction-following, RAG performance, and long-context understanding capabilities. Our results demonstrate that the Llama3-ChatQA-2-70B model achieves accuracy comparable to GPT-4-Turbo-2024-0409 on many long-context understanding tasks and surpasses it on the RAG benchmark. Interestingly, we find that the state-of-the-art long-context retriever can alleviate the top-k context fragmentation issue in RAG, further improving RAG-based results for long-context understanding tasks. We also provide extensive comparisons between RAG and long-context solutions using state-of-the-art long-context LLMs.

ChatQA 2: 長文脈とRAG能力におけるプロプライエタリLLMとのギャップを埋める

ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities

要旨

Support