SONAR-LLM：基于句子嵌入思考并以词元输出的自回归Transformer模型

摘要

近期提出的大型概念模型（LCM）通过预测句子级嵌入序列，并采用均方误差或扩散目标进行训练来生成文本。我们推出了SONAR-LLM，这是一款仅含解码器的Transformer模型，它在相同的连续SONAR嵌入空间中“思考”，但通过冻结的SONAR解码器传播的令牌级交叉熵进行监督。这种混合目标保留了LCM的语义抽象能力，同时消除了其扩散采样器，并恢复了基于似然的训练信号。在参数量从3900万到13亿不等的模型规模下，SONAR-LLM均展现出具有竞争力的生成质量。我们报告了扩展趋势、消融实验、基准测试结果，并公开了完整的训练代码及所有预训练检查点，以促进可重复性和未来研究。

English

The recently proposed Large Concept Model (LCM) generates text by predicting a sequence of sentence-level embeddings and training with either mean-squared error or diffusion objectives. We present SONAR-LLM, a decoder-only transformer that "thinks" in the same continuous SONAR embedding space, yet is supervised through token-level cross-entropy propagated via the frozen SONAR decoder. This hybrid objective retains the semantic abstraction of LCM while eliminating its diffusion sampler and restoring a likelihood-based training signal. Across model sizes from 39M to 1.3B parameters, SONAR-LLM attains competitive generation quality. We report scaling trends, ablations, benchmark results, and release the complete training code and all pretrained checkpoints to foster reproducibility and future research.

SONAR-LLM：基于句子嵌入思考并以词元输出的自回归Transformer模型

SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens

摘要

Support