第三步：规模大且经济实惠——面向成本效益解码的模型系统协同设计

摘要

大型语言模型（LLMs）在解码过程中面临硬件效率低下的问题，尤其是在长上下文推理任务中。本文介绍了Step-3，一个拥有3210亿参数的视觉语言模型（VLM），通过硬件感知的模型-系统协同设计，旨在最小化解码成本。Step-3在两大关键维度上实现了创新：（1）一种新颖的多矩阵分解注意力机制（MFA），在保持高注意力表达能力的同时，显著减少了键值缓存大小和计算量；（2）注意力-前馈网络解耦（AFD），一种分布式推理系统，将注意力层和前馈网络层（FFN）分离为专门的子系统。这种协同设计实现了前所未有的成本效率：与DeepSeek-V3和Qwen3 MoE 235B等模型相比，Step-3显著降低了理论解码成本，且随着上下文长度的增加，优势更为明显。Step-3在每令牌激活380亿参数（超过DeepSeek-V3和Qwen3 MoE 235B）的情况下实现了低成本，证明了硬件对齐的注意力算术强度、MoE稀疏性和AFD对成本效益至关重要。我们在DeepSeek-V3的有利场景下进行了直接对比。在Hopper GPU上的实现，在50ms TPOT SLA（4K上下文，FP8，无MTP）条件下，每GPU的解码吞吐量高达4,039令牌/秒，高于相同设置下DeepSeek-V3的2,324，为LLM解码设立了新的帕累托前沿。

English

Large language models (LLMs) face low hardware efficiency during decoding, especially for long-context reasoning tasks. This paper introduces Step-3, a 321B-parameter VLM with hardware-aware model-system co-design optimized for minimizing decoding costs. Step-3 innovates in two key dimensions: (1) A novel Multi-Matrix Factorization Attention (MFA) mechanism that significantly reduces both KV cache size and computation while maintaining high attention expressiveness, and (2) Attention-FFN Disaggregation (AFD), a distributed inference system that decouples attention and Feed-Forward Network (FFN) layers into specialized subsystems. This co-design achieves unprecedented cost efficiency: Step-3 significantly reduces theoretical decoding costs compared with models like DeepSeek-V3 and Qwen3 MoE 235B, with the gains widening at longer context. Step-3 achieves low cost while activating 38B parameters per token (more than DeepSeek-V3 and Qwen3 MoE 235B), demonstrating that hardware-aligned attention arithmetic intensity, MoE sparsity, and AFD are critical to cost-effectiveness. We perform a head-to-head comparison with DeepSeek-V3 in its favorable scenarios. Our implementation on Hopper GPUs achieves a decoding throughput of up to 4,039 tokens per second per GPU under 50ms TPOT SLA (4K context, FP8, no MTP). It is higher than DeepSeek-V3's 2,324 in the same setup and sets a new Pareto frontier for LLM decoding.