LongCite: 장문맥 QA에서 LLMs가 세밀한 인용을 생성할 수 있도록 하는 기술

초록

현재의 긴 문맥 대형 언어 모델(Large Language Models, LLMs)은 방대한 텍스트를 기반으로 사용자 질문에 대답하는 능력을 인상적으로 보여주고 있지만, 그들의 응답에는 인용이 없어 사용자 확인이 어려워져 신뢰성에 대한 우려가 생기고 있습니다. 이 연구에서는 긴 문맥 LLMs가 세밀한 문장 수준의 인용을 포함한 응답을 생성할 수 있도록 하여 그들의 충실성과 검증 가능성을 향상시키는 것을 목표로 합니다. 우리는 먼저 LongBench-Cite를 소개합니다. 이는 현재 LLMs의 성능을 평가하는 자동화된 벤치마크로, 긴 문맥 질의 응답과 인용(LQAC)에 대한 것으로 상당한 개선 여지를 보여줍니다. 이를 위해 우리는 CoF (Coarse to Fine)를 제안합니다. 이는 오프더셸프 LLMs를 활용하여 자동으로 정확한 문장 수준의 인용을 포함한 긴 문맥 QA 인스턴스를 생성하고, 이 파이프라인을 활용하여 LQAC를 위한 대규모 SFT 데이터셋인 LongCite-45k를 구축합니다. 마지막으로, LongCite-45k 데이터셋을 사용하여 LongCite-8B 및 LongCite-9B를 훈련시킴으로써, 이들이 정확한 응답과 세밀한 문장 수준의 인용을 단일 출력으로 생성할 수 있도록 성공적으로 활성화합니다. LongBench-Cite에서의 평가 결과는 우리의 훈련된 모델이 GPT-4o를 포함한 고급 프로프라이어터리 모델을 능가하는 최첨단 인용 품질을 달성했음을 보여줍니다.

English

Though current long-context large language models (LLMs) have demonstrated impressive capacities in answering user questions based on extensive text, the lack of citations in their responses makes user verification difficult, leading to concerns about their trustworthiness due to their potential hallucinations. In this work, we aim to enable long-context LLMs to generate responses with fine-grained sentence-level citations, improving their faithfulness and verifiability. We first introduce LongBench-Cite, an automated benchmark for assessing current LLMs' performance in Long-Context Question Answering with Citations (LQAC), revealing considerable room for improvement. To this end, we propose CoF (Coarse to Fine), a novel pipeline that utilizes off-the-shelf LLMs to automatically generate long-context QA instances with precise sentence-level citations, and leverage this pipeline to construct LongCite-45k, a large-scale SFT dataset for LQAC. Finally, we train LongCite-8B and LongCite-9B using the LongCite-45k dataset, successfully enabling their generation of accurate responses and fine-grained sentence-level citations in a single output. The evaluation results on LongBench-Cite show that our trained models achieve state-of-the-art citation quality, surpassing advanced proprietary models including GPT-4o.

LongCite: 장문맥 QA에서 LLMs가 세밀한 인용을 생성할 수 있도록 하는 기술

LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA

초록

Support