SONAR-LLM: 文埋め込みで思考し、トークンで発話する自己回帰型トランスフォーマー

要旨

最近提案されたLarge Concept Model（LCM）は、文レベルの埋め込みのシーケンスを予測し、平均二乗誤差または拡散目的を用いて学習することでテキストを生成する。本論文では、SONAR-LLMを紹介する。これはデコーダのみのトランスフォーマーであり、同じ連続的なSONAR埋め込み空間で「思考」するが、凍結されたSONARデコーダを介して伝播されるトークンレベルの交差エントロピーによって教師される。このハイブリッド目的関数は、LCMの意味的抽象化を保持しつつ、その拡散サンプラーを排除し、尤度ベースの学習信号を復元する。39Mから1.3Bパラメータまでのモデルサイズにおいて、SONAR-LLMは競争力のある生成品質を達成する。スケーリングトレンド、アブレーション研究、ベンチマーク結果を報告し、再現性と将来の研究を促進するために、完全な学習コードとすべての事前学習済みチェックポイントを公開する。

English

The recently proposed Large Concept Model (LCM) generates text by predicting a sequence of sentence-level embeddings and training with either mean-squared error or diffusion objectives. We present SONAR-LLM, a decoder-only transformer that "thinks" in the same continuous SONAR embedding space, yet is supervised through token-level cross-entropy propagated via the frozen SONAR decoder. This hybrid objective retains the semantic abstraction of LCM while eliminating its diffusion sampler and restoring a likelihood-based training signal. Across model sizes from 39M to 1.3B parameters, SONAR-LLM attains competitive generation quality. We report scaling trends, ablations, benchmark results, and release the complete training code and all pretrained checkpoints to foster reproducibility and future research.

SONAR-LLM: 文埋め込みで思考し、トークンで発話する自己回帰型トランスフォーマー

SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens

要旨

Support