LLaDA2.1: トークン編集によるテキスト拡散の高速化

要旨

LLaDA2.0は100Bレベルのブロック拡散モデルのスケーリング可能性とその本質的な並列化能力を示したが、デコード速度と生成品質の間の微妙な均衡は未だ捉えがたい課題として残されていた。本日、我々はこのトレードオフを超越するパラダイムシフトとなるLLaDA2.1を発表する。従来のMask-to-Token（M2T）スキームにToken-to-Token（T2T）編集をシームレスに織り交ぜることで、共同で設定可能な閾値デコード方式を導入した。この構造的革新により、二つの異なる動作モードが誕生する。一つは、従来の制約を回避するためにM2T閾値を大胆に下げ、T2Tに出力の洗練を依存する「スピーディーモード（Sモード）」。もう一つは、管理可能な効率低下の範囲で優れたベンチマーク性能を確保するため、保守的な閾値設定に傾倒する「クオリティモード（Qモード）」である。この進化をさらに推し進め、大規模なコンテキストウィンドウを基盤として、拡散大型言語モデル（dLLM）に特化した初の大規模強化学習（RL）フレームワークを実装した。これは安定した勾配推定のための専門技術によって支えられている。このアライメントは推論精度を鋭くするだけでなく、指示追従の忠実度を高め、拡散ダイナミクスと複雑な人間の意図との間の隔たりを埋める。本研究成果の集大成として、LLaDA2.1-Mini（16B）とLLaDA2.1-Flash（100B）を公開する。33の厳格なベンチマークにおいて、LLaDA2.1は強力なタスク性能と lightning-fast なデコード速度を実現した。100Bという規模にもかかわらず、コーディングタスクでは、HumanEval+で驚異的な892 TPS、BigCodeBenchで801 TPS、LiveCodeBenchで663 TPSを達成している。

English

While LLaDA2.0 showcased the scaling potential of 100B-level block-diffusion models and their inherent parallelization, the delicate equilibrium between decoding speed and generation quality has remained an elusive frontier. Today, we unveil LLaDA2.1, a paradigm shift designed to transcend this trade-off. By seamlessly weaving Token-to-Token (T2T) editing into the conventional Mask-to-Token (M2T) scheme, we introduce a joint, configurable threshold-decoding scheme. This structural innovation gives rise to two distinct personas: the Speedy Mode (S Mode), which audaciously lowers the M2T threshold to bypass traditional constraints while relying on T2T to refine the output; and the Quality Mode (Q Mode), which leans into conservative thresholds to secure superior benchmark performances with manageable efficiency degrade. Furthering this evolution, underpinned by an expansive context window, we implement the first large-scale Reinforcement Learning (RL) framework specifically tailored for dLLMs, anchored by specialized techniques for stable gradient estimation. This alignment not only sharpens reasoning precision but also elevates instruction-following fidelity, bridging the chasm between diffusion dynamics and complex human intent. We culminate this work by releasing LLaDA2.1-Mini (16B) and LLaDA2.1-Flash (100B). Across 33 rigorous benchmarks, LLaDA2.1 delivers strong task performance and lightning-fast decoding speed. Despite its 100B volume, on coding tasks it attains an astounding 892 TPS on HumanEval+, 801 TPS on BigCodeBench, and 663 TPS on LiveCodeBench.

LLaDA2.1: トークン編集によるテキスト拡散の高速化

LLaDA2.1: Speeding Up Text Diffusion via Token Editing

要旨

Support