3단계: 크기는 크지만 경제적인 디코딩을 위한 모델-시스템 공동 설계
Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding
July 25, 2025
저자: StepFun, Bin Wang, Bojun Wang, Changyi Wan, Guanzhe Huang, Hanpeng Hu, Haonan Jia, Hao Nie, Mingliang Li, Nuo Chen, Siyu Chen, Song Yuan, Wuxun Xie, Xiaoniu Song, Xing Chen, Xingping Yang, Xuelin Zhang, Yanbo Yu, Yaoyu Wang, Yibo Zhu, Yimin Jiang, Yu Zhou, Yuanwei Lu, Houyi Li, Jingcheng Hu, Ka Man Lo, Ailin Huang, Binxing Jiao, Bo Li, Boyu Chen, Changxin Miao, Chang Lou, Chen Hu, Chen Xu, Chenfeng Yu, Chengyuan Yao, Daokuan Lv, Dapeng Shi, Deshan Sun, Ding Huang, Dingyuan Hu, Dongqing Pang, Enle Liu, Fajie Zhang, Fanqi Wan, Gulin Yan, Han Zhang, Han Zhou, Hanghao Wu, Hangyu Guo, Hanqi Chen, Hanshan Zhang, Hao Wu, Haocheng Zhang, Haolong Yan, Haoran Lv, Haoran Wei, Hebin Zhou, Heng Wang, Heng Wang, Hongxin Li, Hongyu Zhou, Hongyuan Wang, Huiyong Guo, Jia Wang, Jiahao Gong, Jialing Xie, Jian Zhou, Jianjian Sun, Jiaoren Wu, Jiaran Zhang, Jiayu Liu, Jie Cheng, Jie Luo, Jie Yan, Jie Yang, Jieyi Hou, Jinguang Zhang, Jinlan Cao, Jisheng Yin, Junfeng Liu, Junhao Huang, Junzhe Lin, Kaijun Tan, Kaixiang Li, Kang An, Kangheng Lin, Kenkun Liu, Lei Yang, Liang Zhao, Liangyu Chen, Lieyu Shi, Liguo Tan, Lin Lin, Lin Zhang, Lina Chen, Liwen Huang, Liying Shi, Longlong Gu, Mei Chen, Mengqiang Ren, Ming Li, Mingzhe Chen, Na Wang, Nan Wu, Qi Han, Qian Zhao, Qiang Zhang, Qianni Liu, Qiaohui Chen, Qiling Wu, Qinglin He, Qinyuan Tan, Qiufeng Wang, Qiuping Wu, Qiuyan Liang, Quan Sun, Rui Li, Ruihang Miao, Ruosi Wan, Ruyan Guo, Shangwu Zhong, Shaoliang Pang, Shengjie Fan, Shijie Shang, Shilei Jiang, Shiliang Yang, Shiming Hao, Shuli Gao, Siming Huang, Siqi Liu, Tiancheng Cao, Tianhao Cheng, Tianhao Peng, Wang You, Wei Ji, Wen Sun, Wenjin Deng, Wenqing He, Wenzhen Zheng, Xi Chen, Xiangwen Kong, Xianzhen Luo, Xiaobo Yang, Xiaojia Liu, Xiaoxiao Ren, Xin Han, Xin Li, Xin Wu, Xu Zhao, Yanan Wei, Yang Li, Yangguang Li, Yangshijie Xu, Yanming Xu, Yaqiang Shi, Yeqing Shen, Yi Yang, Yifei Yang, Yifeng Gong, Yihan Chen, Yijing Yang, Yinmin Zhang, Yizhuang Zhou, Yuanhao Ding, Yuantao Fan, Yuanzhen Yang, Yuchu Luo, Yue Peng, Yufan Lu, Yuhang Deng, Yuhe Yin, Yujie Liu, Yukun Chen, Yuling Zhao, Yun Mou, Yunlong Li, Yunzhou Ju, Yusheng Li, Yuxiang Yang, Yuxiang Zhang, Yuyang Chen, Zejia Weng, Zhe Xie, Zheng Ge, Zheng Gong, Zhenyi Lu, Zhewei Huang, Zhichao Chang, Zhiguo Huang, Zhirui Wang, Zidong Yang, Zili Wang, Ziqi Wang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Xiangyu Zhang
cs.AI
초록
대형 언어 모델(LLMs)은 디코딩 과정에서 특히 장문 맥락 추론 작업에서 하드웨어 효율성이 낮은 문제를 겪습니다. 본 논문은 디코딩 비용을 최소화하기 위해 하드웨어 인식 모델-시스템 공동 설계로 최적화된 321B 파라미터의 시각적 언어 모델(VLM)인 Step-3를 소개합니다. Step-3는 두 가지 주요 차원에서 혁신을 이루었습니다: (1) KV 캐시 크기와 계산량을 크게 줄이면서도 높은 어텐션 표현력을 유지하는 새로운 다중 행렬 분해 어텐션(MFA) 메커니즘, 그리고 (2) 어텐션과 피드포워드 네트워크(FFN) 레이어를 전문화된 하위 시스템으로 분리하는 분산 추론 시스템인 어텐션-FFN 분리(AFD). 이 공동 설계는 전례 없는 비용 효율성을 달성합니다: Step-3는 DeepSeek-V3 및 Qwen3 MoE 235B와 같은 모델에 비해 이론적 디코딩 비용을 크게 줄이며, 특히 더 긴 맥락에서 그 이점이 더욱 두드러집니다. Step-3는 토큰당 38B 파라미터를 활성화하면서도(DeepSeek-V3 및 Qwen3 MoE 235B보다 많음) 낮은 비용을 달성하며, 하드웨어에 맞춘 어텐션 산술 강도, MoE 희소성, 그리고 AFD가 비용 효율성에 중요한 요소임을 입증합니다. 우리는 DeepSeek-V3와 유리한 시나리오에서 직접 비교를 수행했습니다. Hopper GPU에서의 구현은 50ms TPOT SLA(4K 맥락, FP8, MTP 없음) 조건에서 GPU당 최대 4,039 토큰/초의 디코딩 처리량을 달성했습니다. 이는 동일한 설정에서 DeepSeek-V3의 2,324 토큰/초보다 높으며, LLM 디코딩에 대한 새로운 파레토 프론티어를 설정합니다.
English
Large language models (LLMs) face low hardware efficiency during decoding,
especially for long-context reasoning tasks. This paper introduces Step-3, a
321B-parameter VLM with hardware-aware model-system co-design optimized for
minimizing decoding costs. Step-3 innovates in two key dimensions: (1) A novel
Multi-Matrix Factorization Attention (MFA) mechanism that significantly reduces
both KV cache size and computation while maintaining high attention
expressiveness, and (2) Attention-FFN Disaggregation (AFD), a distributed
inference system that decouples attention and Feed-Forward Network (FFN) layers
into specialized subsystems. This co-design achieves unprecedented cost
efficiency: Step-3 significantly reduces theoretical decoding costs compared
with models like DeepSeek-V3 and Qwen3 MoE 235B, with the gains widening at
longer context. Step-3 achieves low cost while activating 38B parameters per
token (more than DeepSeek-V3 and Qwen3 MoE 235B), demonstrating that
hardware-aligned attention arithmetic intensity, MoE sparsity, and AFD are
critical to cost-effectiveness. We perform a head-to-head comparison with
DeepSeek-V3 in its favorable scenarios. Our implementation on Hopper GPUs
achieves a decoding throughput of up to 4,039 tokens per second per GPU under
50ms TPOT SLA (4K context, FP8, no MTP). It is higher than DeepSeek-V3's 2,324
in the same setup and sets a new Pareto frontier for LLM decoding.