ChatPaper.aiChatPaper

SONAR-LLM:基于句子嵌入思考并以词元输出的自回归Transformer模型

SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens

August 7, 2025
作者: Nikita Dragunov, Temurbek Rahmatullaev, Elizaveta Goncharova, Andrey Kuznetsov, Anton Razzhigaev
cs.AI

摘要

近期提出的大型概念模型(LCM)通过预测句子级嵌入序列,并采用均方误差或扩散目标进行训练来生成文本。我们推出了SONAR-LLM,这是一款仅含解码器的Transformer模型,它在相同的连续SONAR嵌入空间中“思考”,但通过冻结的SONAR解码器传播的令牌级交叉熵进行监督。这种混合目标保留了LCM的语义抽象能力,同时消除了其扩散采样器,并恢复了基于似然的训练信号。在参数量从3900万到13亿不等的模型规模下,SONAR-LLM均展现出具有竞争力的生成质量。我们报告了扩展趋势、消融实验、基准测试结果,并公开了完整的训练代码及所有预训练检查点,以促进可重复性和未来研究。
English
The recently proposed Large Concept Model (LCM) generates text by predicting a sequence of sentence-level embeddings and training with either mean-squared error or diffusion objectives. We present SONAR-LLM, a decoder-only transformer that "thinks" in the same continuous SONAR embedding space, yet is supervised through token-level cross-entropy propagated via the frozen SONAR decoder. This hybrid objective retains the semantic abstraction of LCM while eliminating its diffusion sampler and restoring a likelihood-based training signal. Across model sizes from 39M to 1.3B parameters, SONAR-LLM attains competitive generation quality. We report scaling trends, ablations, benchmark results, and release the complete training code and all pretrained checkpoints to foster reproducibility and future research.
PDF332August 12, 2025