개념 인식 미세 조정을 통한 대규모 언어 모델 개선

초록

대규모 언어 모델(LLMs)은 현대 인공지능의 초석이 되었습니다. 그러나 기존의 다음 토큰 예측 패러다임은 이들이 일관된 고차원 개념을 형성하는 능력을 근본적으로 제한하며, 이는 인간과 유사한 이해와 추론을 위한 중요한 장벽으로 작용합니다. 예를 들어, "리보핵산(ribonucleic acid)"이라는 구문을 살펴보면, LLM은 이를 먼저 토큰, 즉 인공적인 텍스트 조각("rib", "on", ...)으로 분해한 후 각 토큰을 순차적으로 학습합니다. 이는 구문을 통합적이고 일관된 의미론적 개체로 파악하는 대신, 단편적인 표현을 통해 더 깊은 개념적 이해와 궁극적으로 진정한 지능 시스템의 발전을 방해합니다. 이에 대응하여, 우리는 개념 인식 미세 조정(Concept-Aware Fine-Tuning, CAFT)이라는 새로운 다중 토큰 학습 방법을 소개합니다. 이 방법은 다중 토큰에 걸친 시퀀스 학습을 가능하게 함으로써 더 강력한 개념 인식 학습을 촉진합니다. 우리의 실험은 텍스트 요약과 같은 전통적인 응용 분야부터 데노보 단백질 설계와 같은 도메인 특화 작업에 이르기까지 다양한 작업에서 기존의 다음 토큰 미세 조정 방법에 비해 상당한 개선을 보여줍니다. 다중 토큰 예측은 이전에는 비용이 매우 많이 드는 사전 학습 단계에서만 가능했으나, CAFT는 우리가 아는 한 사후 학습 단계에 다중 토큰 설정을 도입한 최초의 방법으로, 이를 통해 더 넓은 실무자 및 연구자 커뮤니티가 그 혜택을 누릴 수 있게 합니다. 마지막으로, 우리가 제안한 방법의 예상치 못한 효과는 머신러닝 연구 커뮤니티에 더 넓은 함의를 시사합니다. 모든 코드와 데이터는 https://github.com/michaelchen-lab/caft-llm에서 확인할 수 있습니다.

English

Large language models (LLMs) have become the cornerstone of modern AI. However, the existing paradigm of next-token prediction fundamentally limits their ability to form coherent, high-level concepts, making it a critical barrier to human-like understanding and reasoning. Take the phrase "ribonucleic acid" as an example: an LLM will first decompose it into tokens, i.e., artificial text fragments ("rib", "on", ...), then learn each token sequentially, rather than grasping the phrase as a unified, coherent semantic entity. This fragmented representation hinders deeper conceptual understanding and, ultimately, the development of truly intelligent systems. In response, we introduce Concept-Aware Fine-Tuning (CAFT), a novel multi-token training method that redefines how LLMs are fine-tuned. By enabling the learning of sequences that span multiple tokens, this method fosters stronger concept-aware learning. Our experiments demonstrate significant improvements compared to conventional next-token finetuning methods across diverse tasks, including traditional applications like text summarization and domain-specific ones like de novo protein design. Multi-token prediction was previously only possible in the prohibitively expensive pretraining phase; CAFT, to our knowledge, is the first to bring the multi-token setting to the post-training phase, thus effectively democratizing its benefits for the broader community of practitioners and researchers. Finally, the unexpected effectiveness of our proposed method suggests wider implications for the machine learning research community. All code and data are available at https://github.com/michaelchen-lab/caft-llm

개념 인식 미세 조정을 통한 대규모 언어 모델 개선

Improving large language models with concept-aware fine-tuning

초록

Support