CLIPPER：壓縮技術實現長上下文合成數據生成

摘要

大型語言模型（LLM）開發者日益依賴合成數據，但為複雜的長上下文推理任務生成高質量數據仍具挑戰性。我們介紹了CLIPPER，這是一種基於壓縮的方法，專門用於生成針對敘事聲明驗證的合成數據——這項任務需要對整本書進行推理以驗證給定的聲明。與直接從書籍原始文本生成聲明（這會導致聲明充滿人工痕跡）不同，CLIPPER首先將書籍壓縮成章節大綱和書籍摘要，然後利用這些中間表示來生成複雜的聲明及相應的思維鏈。相比於簡單的方法，CLIPPER生成的聲明更為有效、有根據且複雜。利用CLIPPER，我們構建了一個包含19K條合成書籍聲明的數據集，每條聲明都配有其源文本和思維鏈推理，並用其微調了三種開源模型。我們的最佳模型在敘事聲明驗證上取得了突破性成果（在我們的測試集上準確率從28%提升至76%），並在NoCha排行榜上為低於100億參數的模型設定了新的技術標準。進一步分析表明，我們的模型能生成更為詳細且有根據的思維鏈推理，同時也在其他敘事理解任務（如NarrativeQA）上提升了表現。

English

LLM developers are increasingly reliant on synthetic data, but generating high-quality data for complex long-context reasoning tasks remains challenging. We introduce CLIPPER, a compression-based approach for generating synthetic data tailored to narrative claim verification - a task that requires reasoning over a book to verify a given claim. Instead of generating claims directly from the raw text of the book, which results in artifact-riddled claims, CLIPPER first compresses the book into chapter outlines and book summaries and then uses these intermediate representations to generate complex claims and corresponding chain-of-thoughts. Compared to naive approaches, CLIPPER produces claims that are more valid, grounded, and complex. Using CLIPPER, we construct a dataset of 19K synthetic book claims paired with their source texts and chain-of-thought reasoning, and use it to fine-tune three open-weight models. Our best model achieves breakthrough results on narrative claim verification (from 28% to 76% accuracy on our test set) and sets a new state-of-the-art for sub-10B models on the NoCha leaderboard. Further analysis shows that our models generate more detailed and grounded chain-of-thought reasoning while also improving performance on other narrative understanding tasks (e.g., NarrativeQA).

CLIPPER：壓縮技術實現長上下文合成數據生成

CLIPPER: Compression enables long-context synthetic data generation

摘要

Support