장문 생성을 위한 효율적 추론을 위한 디코더-하이브리드-디코더 아키텍처

초록

최근 언어 모델링 분야의 발전은 효율적인 시퀀스 모델링을 위해 상태 공간 모델(State Space Models, SSMs)의 효과성을 입증해 왔다. Samba와 디코더-디코더 아키텍처인 YOCO와 같은 하이브리드 아키텍처가 트랜스포머 대비 유망한 성능 향상을 보여주었지만, 기존 연구들은 SSM 레이어 간의 표현 공유의 효율성 잠재력을 탐구하지 않았다. 본 논문에서는 레이어 간 효율적인 메모리 공유를 위한 간단하면서도 효과적인 메커니즘인 게이트 메모리 유닛(Gated Memory Unit, GMU)을 소개한다. 이를 적용하여 Samba 기반의 자기 디코더에서 메모리 읽기 상태를 공유하는 크로스 디코더에 GMU를 통합한 디코더-하이브리드-디코더 아키텍처인 SambaY를 개발하였다. SambaY는 디코딩 효율성을 크게 향상시키고, 선형적인 사전 채우기 시간 복잡도를 유지하며, 긴 문맥 성능을 강화하는 동시에 명시적인 위치 인코딩의 필요성을 제거한다. 광범위한 스케일링 실험을 통해, 우리의 모델이 강력한 YOCO 기준선 대비 상당히 낮은 불가역적 손실을 보이며, 대규모 컴퓨팅 환경에서 우수한 성능 확장성을 나타냄을 입증한다. Differential Attention으로 강화된 우리의 가장 큰 모델인 Phi4-mini-Flash-Reasoning은 강화 학습 없이도 Math500, AIME24/25, GPQA Diamond와 같은 추론 작업에서 Phi4-mini-Reasoning보다 상당히 우수한 성능을 달성하며, vLLM 추론 프레임워크 하에서 2K 길이의 프롬프트와 32K 생성 길이에서 최대 10배 높은 디코딩 처리량을 제공한다. 우리는 오픈소스 데이터에 대한 학습 코드베이스를 https://github.com/microsoft/ArchScale에서 공개한다.

English

Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers. We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in the cross-decoder to share memory readout states from a Samba-based self-decoder. SambaY significantly enhances decoding efficiency, preserves linear pre-filling time complexity, and boosts long-context performance, all while eliminating the need for explicit positional encoding. Through extensive scaling experiments, we demonstrate that our model exhibits a significantly lower irreducible loss compared to a strong YOCO baseline, indicating superior performance scalability under large-scale compute regimes. Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves significantly better performance than Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24/25, and GPQA Diamond without any reinforcement learning, while delivering up to 10x higher decoding throughput on 2K-length prompts with 32K generation length under the vLLM inference framework. We release our training codebase on open-source data at https://github.com/microsoft/ArchScale.

장문 생성을 위한 효율적 추론을 위한 디코더-하이브리드-디코더 아키텍처

Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation

초록

Support