TIDE를 돌리다: 확산 대규모 언어 모델을 위한 크로스 아키텍처 지식 증류

초록

확산 대규모 언어 모델(dLLM)은 병렬 디코딩과 양방향 컨텍스트를 제공하지만, 최첨단 dLLM은 경쟁력 있는 성능을 위해 수십억 개의 매개변수가 필요합니다. 기존 dLLM 지식 증류 방법은 단일 아키텍처 내에서 추론 단계를 줄이지만, 교사 모델과 학생 모델의 아키텍처, 어텐션 메커니즘, 토크나이저가 다른 교차 아키텍처 지식 전달을 다루지 않습니다. 본 논문에서는 교차 아키텍처 dLLM 지식 증류를 위한 최초의 프레임워크인 TIDE를 제시하며, 이는 세 가지 모듈식 구성 요소로 구성됩니다: (1) 교사 모델의 노이즈 의존적 신뢰도를 고려하여 훈련 진행도와 확산 타임스텝에 걸쳐 증류 강도를 공동으로 조절하는 TIDAL; (2) 상보적 마스크 분할을 통해 교사 모델의 컨텍스트를 풍부하게 하여 심각한 마스킹 조건에서의 예측을 개선하는 CompDemo; (3) 청크 수준 우도 매칭을 역전시켜 유계 기울기와 양단 노이즈 필터링을 제공하는 교차 토크나이저 목적 함수인 Reverse CALM. 8B Dense 및 16B MoE 교사 모델을 두 개의 이기종 파이프라인을 통해 0.6B 학생 모델로 증류한 결과, 8개 벤치마크에서 평균 1.53점으로 기준선을 능가하며, 코드 생성 분야에서 특히 큰 향상을 보여 HumanEval 점수가 AR 기준선의 32.3에 비해 48.78에 도달했습니다.

English

Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross-architecture knowledge transfer, in which the teacher and student differ in architecture, attention mechanism, and tokenizer. We present TIDE, the first framework for cross-architecture dLLM distillation, comprising three modular components: (1) TIDAL, which jointly modulates distillation strength across training progress and diffusion timestep to account for the teacher's noise-dependent reliability; (2) CompDemo, which enriches the teacher's context via complementary mask splitting to improve predictions under heavy masking; and (3) Reverse CALM, a cross-tokenizer objective that inverts chunk-level likelihood matching, yielding bounded gradients and dual-end noise filtering. Distilling 8B dense and 16B MoE teachers into a 0.6B student via two heterogeneous pipelines outperforms the baseline by an average of 1.53 points across eight benchmarks, yielding notable gains in code generation, where HumanEval scores reach 48.78 compared to 32.3 for the AR baseline.

TIDE를 돌리다: 확산 대규모 언어 모델을 위한 크로스 아키텍처 지식 증류

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

초록

Support