CLIPPER: 圧縮技術による長文脈合成データ生成

要旨

LLM開発者は合成データにますます依存するようになっているが、複雑な長文脈推論タスクのための高品質なデータ生成は依然として課題である。我々は、物語の主張検証（書籍全体を推論して与えられた主張を検証するタスク）に特化した合成データ生成のための圧縮ベースのアプローチであるCLIPPERを提案する。書籍の生テキストから直接主張を生成する方法では、不自然な主張が生じるが、CLIPPERはまず書籍を章の概要と書籍の要約に圧縮し、これらの中間表現を使用して複雑な主張とそれに対応する連鎖的思考を生成する。単純なアプローチと比較して、CLIPPERはより有効で、根拠があり、複雑な主張を生成する。CLIPPERを使用して、19Kの合成書籍主張とそのソーステキスト、連鎖的思考推論をペアにしたデータセットを構築し、それを用いて3つのオープンウェイトモデルをファインチューニングした。我々の最良のモデルは、物語の主張検証において画期的な結果（テストセットでの精度が28%から76%に向上）を達成し、NoChaリーダーボードにおいて10B未満のモデルで新たな最先端を樹立した。さらなる分析により、我々のモデルはより詳細で根拠のある連鎖的思考推論を生成しつつ、他の物語理解タスク（例：NarrativeQA）のパフォーマンスも向上させることが示された。

English

LLM developers are increasingly reliant on synthetic data, but generating high-quality data for complex long-context reasoning tasks remains challenging. We introduce CLIPPER, a compression-based approach for generating synthetic data tailored to narrative claim verification - a task that requires reasoning over a book to verify a given claim. Instead of generating claims directly from the raw text of the book, which results in artifact-riddled claims, CLIPPER first compresses the book into chapter outlines and book summaries and then uses these intermediate representations to generate complex claims and corresponding chain-of-thoughts. Compared to naive approaches, CLIPPER produces claims that are more valid, grounded, and complex. Using CLIPPER, we construct a dataset of 19K synthetic book claims paired with their source texts and chain-of-thought reasoning, and use it to fine-tune three open-weight models. Our best model achieves breakthrough results on narrative claim verification (from 28% to 76% accuracy on our test set) and sets a new state-of-the-art for sub-10B models on the NoCha leaderboard. Further analysis shows that our models generate more detailed and grounded chain-of-thought reasoning while also improving performance on other narrative understanding tasks (e.g., NarrativeQA).

CLIPPER: 圧縮技術による長文脈合成データ生成

CLIPPER: Compression enables long-context synthetic data generation

要旨

Support