通过概念感知微调提升大型语言模型性能
Improving large language models with concept-aware fine-tuning
June 9, 2025
作者: Michael K. Chen, Xikun Zhang, Jiaxing Huang, Dacheng Tao
cs.AI
摘要
大型语言模型(LLMs)已成为现代人工智能的基石。然而,现有的下一词预测范式从根本上限制了其形成连贯、高层次概念的能力,这成为实现类人理解和推理的关键障碍。以“核糖核酸”这一短语为例:LLM会首先将其分解为人工文本片段(如“核”、“糖”等)的标记,然后逐一学习这些标记,而非将其作为一个统一、连贯的语义实体来把握。这种碎片化的表征阻碍了更深层次的概念理解,并最终影响了真正智能系统的发展。为此,我们引入了概念感知微调(CAFT),这是一种新颖的多标记训练方法,重新定义了LLMs的微调方式。通过支持跨多个标记的序列学习,该方法促进了更强的概念感知学习。我们的实验表明,在包括文本摘要等传统应用及从头蛋白质设计等特定领域任务中,CAFT相较于传统的下一词微调方法均取得了显著提升。多标记预测以往仅在成本高昂的预训练阶段可行;据我们所知,CAFT首次将多标记设置引入训练后阶段,从而有效普及了其优势,惠及更广泛的实践者和研究群体。最后,我们提出方法的意外有效性暗示了对机器学习研究界更广泛的影响。所有代码和数据均可在https://github.com/michaelchen-lab/caft-llm获取。
English
Large language models (LLMs) have become the cornerstone of modern AI.
However, the existing paradigm of next-token prediction fundamentally limits
their ability to form coherent, high-level concepts, making it a critical
barrier to human-like understanding and reasoning. Take the phrase "ribonucleic
acid" as an example: an LLM will first decompose it into tokens, i.e.,
artificial text fragments ("rib", "on", ...), then learn each token
sequentially, rather than grasping the phrase as a unified, coherent semantic
entity. This fragmented representation hinders deeper conceptual understanding
and, ultimately, the development of truly intelligent systems. In response, we
introduce Concept-Aware Fine-Tuning (CAFT), a novel multi-token training method
that redefines how LLMs are fine-tuned. By enabling the learning of sequences
that span multiple tokens, this method fosters stronger concept-aware learning.
Our experiments demonstrate significant improvements compared to conventional
next-token finetuning methods across diverse tasks, including traditional
applications like text summarization and domain-specific ones like de novo
protein design. Multi-token prediction was previously only possible in the
prohibitively expensive pretraining phase; CAFT, to our knowledge, is the first
to bring the multi-token setting to the post-training phase, thus effectively
democratizing its benefits for the broader community of practitioners and
researchers. Finally, the unexpected effectiveness of our proposed method
suggests wider implications for the machine learning research community. All
code and data are available at https://github.com/michaelchen-lab/caft-llm