Condor：透過知識驅動的資料合成和精煉來增強LLM對齊

摘要

監督微調（SFT）數據的質量對提升大型語言模型（LLM）的對話能力至關重要。然而，隨著LLM變得更加先進，高質量的人工標註SFT數據的可用性已成為一個重要瓶頸，迫使更多依賴合成訓練數據。在這項工作中，我們介紹了Condor，一種新型的兩階段合成數據生成框架，該框架融合了世界知識樹和自我反思精煉，以大規模生成高質量的SFT數據。我們的實驗結果表明，只在20K個Condor生成樣本上進行微調的基礎模型表現優異，勝過同行。Condor中的額外精煉階段進一步實現了LLM在不同規模（高達72B）上的迭代自我改進，驗證了我們方法的有效性。此外，我們對後訓練中合成數據的規模化研究揭示了性能改進的巨大潛力，為未來研究開啟了有前途的途徑。

English

The quality of Supervised Fine-Tuning (SFT) data plays a critical role in enhancing the conversational capabilities of Large Language Models (LLMs). However, as LLMs become more advanced, the availability of high-quality human-annotated SFT data has become a significant bottleneck, necessitating a greater reliance on synthetic training data. In this work, we introduce Condor, a novel two-stage synthetic data generation framework that incorporates World Knowledge Tree and Self-Reflection Refinement to produce high-quality SFT data at scale. Our experimental results demonstrate that a base model fine-tuned on only 20K Condor-generated samples achieves superior performance compared to counterparts. The additional refinement stage in Condor further enables iterative self-improvement for LLMs at various scales (up to 72B), validating the effectiveness of our approach. Furthermore, our investigation into the scaling for synthetic data in post-training reveals substantial unexplored potential for performance improvements, opening promising avenues for future research.

Condor：透過知識驅動的資料合成和精煉來增強LLM對齊

Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement

摘要

Support