LongCite：長文脈QAにおいてLLMが詳細な引用を生成することを可能にする

要旨

現在の長い文脈を持つ大規模言語モデル（LLM）は、広範囲のテキストに基づいたユーザーの質問に回答する能力を示していますが、その回答に引用がないため、ユーザーの検証が困難であり、潜在的な幻覚による信頼性への懸念が生じています。本研究では、長い文脈を持つLLMが、細かい文レベルの引用を含む回答を生成できるようにし、その忠実性と検証可能性を向上させることを目指します。まず、現在のLLMの長い文脈における質問応答と引用（LQAC）のパフォーマンスを評価するための自動ベンチマークであるLongBench-Citeを導入し、改善の余地があることを明らかにします。そのために、オフザシェルフのLLMを利用して長い文脈のQAインスタンスを自動的に生成し、正確な文レベルの引用を含むCoF（Coarse to Fine）という新しいパイプラインを提案し、このパイプラインを利用してLQAC向けの大規模なSFTデータセットであるLongCite-45kを構築します。最後に、LongCite-45kデータセットを使用してLongCite-8BおよびLongCite-9Bをトレーニングし、正確な回答と細かい文レベルの引用を一つの出力で生成できるようにします。LongBench-Citeでの評価結果は、トレーニングされたモデルがGPT-4oを含む先進的なプロプライエタリモデルを上回り、最先端の引用品質を達成していることを示しています。

English

Though current long-context large language models (LLMs) have demonstrated impressive capacities in answering user questions based on extensive text, the lack of citations in their responses makes user verification difficult, leading to concerns about their trustworthiness due to their potential hallucinations. In this work, we aim to enable long-context LLMs to generate responses with fine-grained sentence-level citations, improving their faithfulness and verifiability. We first introduce LongBench-Cite, an automated benchmark for assessing current LLMs' performance in Long-Context Question Answering with Citations (LQAC), revealing considerable room for improvement. To this end, we propose CoF (Coarse to Fine), a novel pipeline that utilizes off-the-shelf LLMs to automatically generate long-context QA instances with precise sentence-level citations, and leverage this pipeline to construct LongCite-45k, a large-scale SFT dataset for LQAC. Finally, we train LongCite-8B and LongCite-9B using the LongCite-45k dataset, successfully enabling their generation of accurate responses and fine-grained sentence-level citations in a single output. The evaluation results on LongBench-Cite show that our trained models achieve state-of-the-art citation quality, surpassing advanced proprietary models including GPT-4o.

LongCite：長文脈QAにおいてLLMが詳細な引用を生成することを可能にする

LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA

要旨

Support