Forzatura Causale: Distillazione Autoregressiva di Diffusion Fatta Correttamente per la Generazione Video Interattiva in Tempo Reale di Alta Qualità

Abstract

Per ottenere una generazione video interattiva in tempo reale, i metodi attuali distillano modelli bidirezionali di diffusione video preaddestrati in modelli autoregressivi (AR) a pochi passi, affrontando un divario architetturale quando l'attenzione completa viene sostituita da un'attenzione causale. Tuttavia, gli approcci esistenti non colmano teoricamente questo divario. Essi inizializzano lo studente AR tramite distillazione ODE, che richiede l'iniettività a livello di frame, dove ogni frame rumoroso deve mappare univocamente su un frame pulito sotto la PF-ODE di un insegnante AR. Distillare uno studente AR da un insegnante bidirezionale viola questa condizione, impedendo il recupero della mappa di flusso dell'insegnante e inducendo invece una soluzione di aspettativa condizionata, che degrada le prestazioni. Per affrontare questo problema, proponiamo il Causal Forcing che utilizza un insegnante AR per l'inizializzazione ODE, colmando così il divario architetturale. I risultati empirici mostrano che il nostro metodo supera tutte le baseline in tutte le metriche, superando lo stato dell'arte Self Forcing del 19,3% nel Dynamic Degree, dell'8,7% in VisionReward e del 16,7% nell'Instruction Following. Pagina del progetto e codice: https://thu-ml.github.io/CausalForcing.github.io/

English

To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap theoretically. They initialize the AR student via ODE distillation, which requires frame-level injectivity, where each noisy frame must map to a unique clean frame under the PF-ODE of an AR teacher. Distilling an AR student from a bidirectional teacher violates this condition, preventing recovery of the teacher's flow map and instead inducing a conditional-expectation solution, which degrades performance. To address this issue, we propose Causal Forcing that uses an AR teacher for ODE initialization, thereby bridging the architectural gap. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self Forcing by 19.3\% in Dynamic Degree, 8.7\% in VisionReward, and 16.7\% in Instruction Following. Project page and the code: https://thu-ml.github.io/CausalForcing.github.io/{https://thu-ml.github.io/CausalForcing.github.io/}

Forzatura Causale: Distillazione Autoregressiva di Diffusion Fatta Correttamente per la Generazione Video Interattiva in Tempo Reale di Alta Qualità

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Abstract

Support