ChatPaper.aiChatPaper

透過概念感知微調提升大型語言模型效能

Improving large language models with concept-aware fine-tuning

June 9, 2025
作者: Michael K. Chen, Xikun Zhang, Jiaxing Huang, Dacheng Tao
cs.AI

摘要

大型语言模型(LLMs)已成为现代人工智能的基石。然而,现有的下一词预测范式从根本上限制了它们形成连贯、高层次概念的能力,这成为实现类人理解与推理的关键障碍。以“核糖核酸”这一短语为例:LLM首先会将其分解为若干词元,即人工文本片段(如“核”、“糖”、“酸”等),随后逐一学习这些词元,而非将整个短语作为一个统一、连贯的语义实体来把握。这种碎片化的表征阻碍了更深层次的概念理解,并最终影响了真正智能系统的发展。为此,我们提出了概念感知微调(CAFT),一种新颖的多词元训练方法,重新定义了LLMs的微调方式。通过支持跨越多词元的序列学习,该方法促进了更强的概念感知学习。我们的实验表明,相较于传统的下一词微调方法,CAFT在多种任务上均取得了显著提升,包括文本摘要等传统应用以及从头蛋白质设计等特定领域任务。多词元预测以往仅在成本高昂的预训练阶段可行;据我们所知,CAFT首次将多词元设置引入训练后阶段,从而有效普及了其优势,惠及更广泛的实践者与研究群体。最后,我们提出方法的意外有效性暗示了对机器学习研究界更广泛的影响。所有代码与数据均可在https://github.com/michaelchen-lab/caft-llm获取。
English
Large language models (LLMs) have become the cornerstone of modern AI. However, the existing paradigm of next-token prediction fundamentally limits their ability to form coherent, high-level concepts, making it a critical barrier to human-like understanding and reasoning. Take the phrase "ribonucleic acid" as an example: an LLM will first decompose it into tokens, i.e., artificial text fragments ("rib", "on", ...), then learn each token sequentially, rather than grasping the phrase as a unified, coherent semantic entity. This fragmented representation hinders deeper conceptual understanding and, ultimately, the development of truly intelligent systems. In response, we introduce Concept-Aware Fine-Tuning (CAFT), a novel multi-token training method that redefines how LLMs are fine-tuned. By enabling the learning of sequences that span multiple tokens, this method fosters stronger concept-aware learning. Our experiments demonstrate significant improvements compared to conventional next-token finetuning methods across diverse tasks, including traditional applications like text summarization and domain-specific ones like de novo protein design. Multi-token prediction was previously only possible in the prohibitively expensive pretraining phase; CAFT, to our knowledge, is the first to bring the multi-token setting to the post-training phase, thus effectively democratizing its benefits for the broader community of practitioners and researchers. Finally, the unexpected effectiveness of our proposed method suggests wider implications for the machine learning research community. All code and data are available at https://github.com/michaelchen-lab/caft-llm
PDF32June 10, 2025