테일러-캘리브레이트: 하이브리드 선형 어텐션 증류를 위한 원칙적인 초기화

초록

하이브리드 선형 어텐션 모델은 긴 문맥 추론을 더 빠르게 수행할 수 있는 매력적인 경로를 제공한다. 전체 소프트맥스 어텐션의 이차 비용과 KV-캐시 부담을 줄이면서도 트랜스포머 모델의 품질을 상당 부분 유지하기 때문이다. 이러한 모델을 얻는 실용적인 방법은 새로운 아키텍처를 처음부터 사전 학습하는 대신 사전 학습된 트랜스포머를 변환하는 것이지만, 이러한 변환은 여전히 깨지기 쉽다. 단순히 교사 어텐션 투영을 게이티드 델타넷(GDN) 학생 모델에 복사하는 것만으로는 새로운 순환 감쇠, 쓰기 및 출력 게이팅 동역학이 명시되지 않는다. 결과적으로 변환된 모델은 종종 좋지 않은 동역학 영역에서 시작하여, 교사의 나머지 행동을 학습하기보다는 초기화를 복구하는 데 많은 증류 토큰을 소비해야 한다. 본 논문에서는 하이브리드 GDN 학생을 위한 경량 초기화 방법인 테일러-캘리브레이트를 제안한다. 이 방법은 테일러 유도 교사 어텐션 통계를 사용하여 값 투영, 메모리 시간 척도, 쓰기 게이트 및 출력 게이트를 설정한 후, 짧은 레이어별 정렬 단계를 적용하여 변환된 각 레이어를 교사 출력에 맞춘다. 네 가지 교사 설정과 세 가지 유지 레이어 정책에서 테일러-캘리브레이트는 대표적인 절제 실험에서 최대 88배 개선된 훨씬 강력한 제로샷 학생 모델을 제공하며, 단순 변환보다 4.9배에서 9.2배 적은 학습 토큰으로 일치된 복구 목표에 도달한다.

English

Hybrid linear attention models offer an appealing path to faster long-context inference: they reduce the quadratic cost and KV-cache burden of full softmax attention while retaining much of the quality of Transformer models. A practical way to obtain such models is to convert a pretrained Transformer instead of pretraining a new architecture from scratch, but this conversion is still brittle. Simply copying the teacher attention projections into a Gated DeltaNet (GDN) student does not specify the new recurrent decay, write, and output-gating dynamics. As a result, the converted model often starts in a poor dynamical regime and must spend many distillation tokens repairing initialization rather than learning the remaining teacher behavior. We propose Taylor-Calibrate, a lightweight initialization method for hybrid GDN students. The method uses Taylor-guided teacher attention statistics to set the value projection, memory timescale, write gates, and output gate, then applies a short per-layer alignment step to match each converted layer to the teacher output. Across four teacher settings and three retained-layer policies, Taylor-Calibrate gives substantially stronger zero-shot students, with up to an 88x improvement in a representative ablation, and reaches matched recovery targets with 4.9x--9.2x fewer training tokens than naive conversion.