LongCite：讓LLMs在長文本問答中生成細緻引用

摘要

儘管目前的長文本大型語言模型（LLMs）在回答基於廣泛文本的使用者問題方面展現出令人印象深刻的能力，但由於其回應中缺乏引文，使得使用者驗證變得困難，引發對其可信度的擔憂，因為可能出現幻覺。在這項工作中，我們旨在讓長文本LLMs能夠生成具有細粒度句級引文的回應，從而提高其忠實度和可驗證性。我們首先介紹了LongBench-Cite，這是一個自動化基準測試，用於評估目前LLMs在帶有引文的長文本問答（LQAC）中的表現，揭示了有待改進的相當大空間。為此，我們提出了CoF（Coarse to Fine），這是一個新穎的流程，利用現成的LLMs自動生成具有精確句級引文的長文本問答實例，並利用這個流程構建了LongCite-45k，一個用於LQAC的大規模SFT數據集。最後，我們使用LongCite-45k數據集訓練了LongCite-8B和LongCite-9B，成功使它們能夠在單一輸出中生成準確的回應和細粒度句級引文。在LongBench-Cite上的評估結果顯示，我們訓練的模型實現了最先進的引文質量，超越了包括GPT-4o在內的先進專有模型。

English

Though current long-context large language models (LLMs) have demonstrated impressive capacities in answering user questions based on extensive text, the lack of citations in their responses makes user verification difficult, leading to concerns about their trustworthiness due to their potential hallucinations. In this work, we aim to enable long-context LLMs to generate responses with fine-grained sentence-level citations, improving their faithfulness and verifiability. We first introduce LongBench-Cite, an automated benchmark for assessing current LLMs' performance in Long-Context Question Answering with Citations (LQAC), revealing considerable room for improvement. To this end, we propose CoF (Coarse to Fine), a novel pipeline that utilizes off-the-shelf LLMs to automatically generate long-context QA instances with precise sentence-level citations, and leverage this pipeline to construct LongCite-45k, a large-scale SFT dataset for LQAC. Finally, we train LongCite-8B and LongCite-9B using the LongCite-45k dataset, successfully enabling their generation of accurate responses and fine-grained sentence-level citations in a single output. The evaluation results on LongBench-Cite show that our trained models achieve state-of-the-art citation quality, surpassing advanced proprietary models including GPT-4o.