ChatPaper.aiChatPaper

CLIPPER:壓縮技術實現長上下文合成數據生成

CLIPPER: Compression enables long-context synthetic data generation

February 20, 2025
作者: Chau Minh Pham, Yapei Chang, Mohit Iyyer
cs.AI

摘要

大型語言模型(LLM)開發者日益依賴合成數據,但為複雜的長上下文推理任務生成高質量數據仍具挑戰性。我們介紹了CLIPPER,這是一種基於壓縮的方法,專門用於生成針對敘事聲明驗證的合成數據——這項任務需要對整本書進行推理以驗證給定的聲明。與直接從書籍原始文本生成聲明(這會導致聲明充滿人工痕跡)不同,CLIPPER首先將書籍壓縮成章節大綱和書籍摘要,然後利用這些中間表示來生成複雜的聲明及相應的思維鏈。相比於簡單的方法,CLIPPER生成的聲明更為有效、有根據且複雜。利用CLIPPER,我們構建了一個包含19K條合成書籍聲明的數據集,每條聲明都配有其源文本和思維鏈推理,並用其微調了三種開源模型。我們的最佳模型在敘事聲明驗證上取得了突破性成果(在我們的測試集上準確率從28%提升至76%),並在NoCha排行榜上為低於100億參數的模型設定了新的技術標準。進一步分析表明,我們的模型能生成更為詳細且有根據的思維鏈推理,同時也在其他敘事理解任務(如NarrativeQA)上提升了表現。
English
LLM developers are increasingly reliant on synthetic data, but generating high-quality data for complex long-context reasoning tasks remains challenging. We introduce CLIPPER, a compression-based approach for generating synthetic data tailored to narrative claim verification - a task that requires reasoning over a book to verify a given claim. Instead of generating claims directly from the raw text of the book, which results in artifact-riddled claims, CLIPPER first compresses the book into chapter outlines and book summaries and then uses these intermediate representations to generate complex claims and corresponding chain-of-thoughts. Compared to naive approaches, CLIPPER produces claims that are more valid, grounded, and complex. Using CLIPPER, we construct a dataset of 19K synthetic book claims paired with their source texts and chain-of-thought reasoning, and use it to fine-tune three open-weight models. Our best model achieves breakthrough results on narrative claim verification (from 28% to 76% accuracy on our test set) and sets a new state-of-the-art for sub-10B models on the NoCha leaderboard. Further analysis shows that our models generate more detailed and grounded chain-of-thought reasoning while also improving performance on other narrative understanding tasks (e.g., NarrativeQA).

Summary

AI-Generated Summary

PDF82February 21, 2025