SONAR-LLM:基於句子嵌入思考並以詞元發聲的自回歸變換器
SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens
August 7, 2025
作者: Nikita Dragunov, Temurbek Rahmatullaev, Elizaveta Goncharova, Andrey Kuznetsov, Anton Razzhigaev
cs.AI
摘要
近期提出的大型概念模型(LCM)通过预测句子级嵌入序列,并采用均方误差或扩散目标进行训练来生成文本。我们介绍了SONAR-LLM,这是一种仅含解码器的Transformer模型,它在相同的连续SONAR嵌入空间中“思考”,但通过冻结的SONAR解码器传播的令牌级交叉熵进行监督。这种混合目标保留了LCM的语义抽象,同时消除了其扩散采样器,并恢复了基于似然的训练信号。在参数量从3900万到13亿的模型规模范围内,SONAR-LLM均展现出具有竞争力的生成质量。我们报告了扩展趋势、消融实验、基准测试结果,并公开了完整的训练代码及所有预训练检查点,以促进可重复性和未来研究。
English
The recently proposed Large Concept Model (LCM) generates text by predicting
a sequence of sentence-level embeddings and training with either mean-squared
error or diffusion objectives. We present SONAR-LLM, a decoder-only transformer
that "thinks" in the same continuous SONAR embedding space, yet is supervised
through token-level cross-entropy propagated via the frozen SONAR decoder. This
hybrid objective retains the semantic abstraction of LCM while eliminating its
diffusion sampler and restoring a likelihood-based training signal. Across
model sizes from 39M to 1.3B parameters, SONAR-LLM attains competitive
generation quality. We report scaling trends, ablations, benchmark results, and
release the complete training code and all pretrained checkpoints to foster
reproducibility and future research.