ChatPaper.aiChatPaper

Condor:透過知識驅動的資料合成和精煉來增強LLM對齊

Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement

January 21, 2025
作者: Maosong Cao, Taolin Zhang, Mo Li, Chuyu Zhang, Yunxin Liu, Haodong Duan, Songyang Zhang, Kai Chen
cs.AI

摘要

監督微調(SFT)數據的質量對提升大型語言模型(LLM)的對話能力至關重要。然而,隨著LLM變得更加先進,高質量的人工標註SFT數據的可用性已成為一個重要瓶頸,迫使更多依賴合成訓練數據。在這項工作中,我們介紹了Condor,一種新型的兩階段合成數據生成框架,該框架融合了世界知識樹和自我反思精煉,以大規模生成高質量的SFT數據。我們的實驗結果表明,只在20K個Condor生成樣本上進行微調的基礎模型表現優異,勝過同行。Condor中的額外精煉階段進一步實現了LLM在不同規模(高達72B)上的迭代自我改進,驗證了我們方法的有效性。此外,我們對後訓練中合成數據的規模化研究揭示了性能改進的巨大潛力,為未來研究開啟了有前途的途徑。
English
The quality of Supervised Fine-Tuning (SFT) data plays a critical role in enhancing the conversational capabilities of Large Language Models (LLMs). However, as LLMs become more advanced, the availability of high-quality human-annotated SFT data has become a significant bottleneck, necessitating a greater reliance on synthetic training data. In this work, we introduce Condor, a novel two-stage synthetic data generation framework that incorporates World Knowledge Tree and Self-Reflection Refinement to produce high-quality SFT data at scale. Our experimental results demonstrate that a base model fine-tuned on only 20K Condor-generated samples achieves superior performance compared to counterparts. The additional refinement stage in Condor further enables iterative self-improvement for LLMs at various scales (up to 72B), validating the effectiveness of our approach. Furthermore, our investigation into the scaling for synthetic data in post-training reveals substantial unexplored potential for performance improvements, opening promising avenues for future research.

Summary

AI-Generated Summary

PDF142January 22, 2025