SONAR-LLM: 문장 임베딩으로 사고하고 토큰으로 발화하는 자기회귀 트랜스포머

초록

최근 제안된 대형 개념 모델(Large Concept Model, LCM)은 문장 수준 임베딩 시퀀스를 예측하고 평균 제곱 오차 또는 확산 목적 함수를 사용하여 학습함으로써 텍스트를 생성한다. 본 논문에서는 동일한 연속 SONAR 임베딩 공간에서 "사고"하지만, 고정된 SONAR 디코더를 통해 전파된 토큰 수준의 교차 엔트로피로 지도 학습되는 디코더 전용 트랜스포머인 SONAR-LLM을 제안한다. 이 하이브리드 목적 함수는 LCM의 의미론적 추상화를 유지하면서 확산 샘플러를 제거하고 가능도 기반 학습 신호를 복원한다. 39M에서 1.3B 파라미터에 이르는 다양한 모델 크기에서 SONAR-LLM은 경쟁력 있는 생성 품질을 달성한다. 본 논문은 스케일링 경향, 제거 실험, 벤치마크 결과를 보고하며, 재현성과 향후 연구를 촉진하기 위해 전체 학습 코드와 모든 사전 학습된 체크포인트를 공개한다.

English

The recently proposed Large Concept Model (LCM) generates text by predicting a sequence of sentence-level embeddings and training with either mean-squared error or diffusion objectives. We present SONAR-LLM, a decoder-only transformer that "thinks" in the same continuous SONAR embedding space, yet is supervised through token-level cross-entropy propagated via the frozen SONAR decoder. This hybrid objective retains the semantic abstraction of LCM while eliminating its diffusion sampler and restoring a likelihood-based training signal. Across model sizes from 39M to 1.3B parameters, SONAR-LLM attains competitive generation quality. We report scaling trends, ablations, benchmark results, and release the complete training code and all pretrained checkpoints to foster reproducibility and future research.

SONAR-LLM: 문장 임베딩으로 사고하고 토큰으로 발화하는 자기회귀 트랜스포머

SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens

초록

Support