Stap 3 is Groot maar Betaalbaar: Model-systeem Co-ontwerp voor Kosteneffectief Decoderen

Samenvatting

Grote taalmmodellen (LLMs) kampen met een lage hardware-efficiëntie tijdens het decoderen, vooral bij taken die langetermijncontext vereisen. Dit artikel introduceert Step-3, een VLM met 321B parameters, waarbij een hardwarebewuste model-systeem co-design is toegepast om de decoderingkosten te minimaliseren. Step-3 introduceert innovaties op twee belangrijke vlakken: (1) Een nieuw Multi-Matrix Factorization Attention (MFA) mechanisme dat zowel de KV-cachegrootte als de rekenkracht aanzienlijk vermindert, terwijl het een hoge aandachtsexpressiviteit behoudt, en (2) Attention-FFN Disaggregation (AFD), een gedistribueerd inferentiesysteem dat aandacht- en Feed-Forward Network (FFN)-lagen ontkoppelt in gespecialiseerde subsystemen. Deze co-design bereikt een ongekende kostenefficiëntie: Step-3 vermindert de theoretische decoderingkosten aanzienlijk in vergelijking met modellen zoals DeepSeek-V3 en Qwen3 MoE 235B, waarbij de voordelen toenemen bij langere context. Step-3 behaalt lage kosten terwijl het 38B parameters per token activeert (meer dan DeepSeek-V3 en Qwen3 MoE 235B), wat aantoont dat hardware-afgestemde aandacht-aritmetische intensiteit, MoE-sparsity en AFD cruciaal zijn voor kosteneffectiviteit. We voeren een directe vergelijking uit met DeepSeek-V3 in gunstige scenario's. Onze implementatie op Hopper GPU's bereikt een decoderingdoorvoer van maximaal 4.039 tokens per seconde per GPU onder een 50ms TPOT SLA (4K context, FP8, geen MTP). Dit is hoger dan DeepSeek-V3's 2.324 in dezelfde opstelling en zet een nieuwe Pareto-grens voor LLM-decodering.

English

Large language models (LLMs) face low hardware efficiency during decoding, especially for long-context reasoning tasks. This paper introduces Step-3, a 321B-parameter VLM with hardware-aware model-system co-design optimized for minimizing decoding costs. Step-3 innovates in two key dimensions: (1) A novel Multi-Matrix Factorization Attention (MFA) mechanism that significantly reduces both KV cache size and computation while maintaining high attention expressiveness, and (2) Attention-FFN Disaggregation (AFD), a distributed inference system that decouples attention and Feed-Forward Network (FFN) layers into specialized subsystems. This co-design achieves unprecedented cost efficiency: Step-3 significantly reduces theoretical decoding costs compared with models like DeepSeek-V3 and Qwen3 MoE 235B, with the gains widening at longer context. Step-3 achieves low cost while activating 38B parameters per token (more than DeepSeek-V3 and Qwen3 MoE 235B), demonstrating that hardware-aligned attention arithmetic intensity, MoE sparsity, and AFD are critical to cost-effectiveness. We perform a head-to-head comparison with DeepSeek-V3 in its favorable scenarios. Our implementation on Hopper GPUs achieves a decoding throughput of up to 4,039 tokens per second per GPU under 50ms TPOT SLA (4K context, FP8, no MTP). It is higher than DeepSeek-V3's 2,324 in the same setup and sets a new Pareto frontier for LLM decoding.

Stap 3 is Groot maar Betaalbaar: Model-systeem Co-ontwerp voor Kosteneffectief Decoderen

Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding

Samenvatting

Support