경량 상호작용: 대화형 비디오 세계 모델을 위한 훈련 없는 추론 가속

초록

대화형 비디오 세계 모델은 사용자가 제어하는 카메라 움직임에 응답하여 비디오 청크를 생성함으로써 실시간 게임 시뮬레이션, 가상 장면 탐색, 체화된 AI 훈련과 같은 응용을 가능하게 합니다. 그러나 긴 대화형 궤적으로 확장하는 것은 증가하는 컨텍스트 메모리, 이차 어텐션 복잡성, 반복적인 노이즈 제거 단계로 인해 엄청난 비용이 듭니다. 우리는 대화형 비디오 세계 모델을 위한 학습 없는 추론 가속 프레임워크인 Light Interaction을 제시합니다. 우리의 핵심 통찰은 상호작용이 궤적 의존적 적응 연산을 자연스럽게 가능하게 한다는 것입니다: 새로운 탐색 중에는 검색된 공간 메모리를 폐기할 수 있고, 국소 잠재 역학에 따라 시간적 컨텍스트를 조정할 수 있으며, 카메라가 익숙한 영역을 다시 방문할 때 초기 단계의 모델 출력을 재사용할 수 있습니다. 이 통찰을 바탕으로 Light Interaction은 적응형 컨텍스트 관리, 노이즈 제거 캐시 가속, 그리고 융합된 Triton 커널을 갖춘 하드웨어-소프트웨어 공동 설계 3D 블록 희소 어텐션을 결합합니다. HY-WorldPlay 및 Matrix-Game-3.0에서 평가된 Light Interaction은 모델 재학습 없이 최대 2.59배의 속도 향상을 달성하면서 경쟁력 있는 시각적 품질을 유지합니다.

English

Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long interactive trajectories is prohibitively expensive due to growing context memory, quadratic attention complexity, and repeated denoising steps. We present Light Interaction, a training-free inference acceleration framework for interactive video world models. Our key insight is that interaction naturally enables trajectory-dependent adaptive computation: retrieved spatial memory can be discarded during novel exploration, temporal context can be adjusted according to local latent dynamics, and early-step model outputs can be reused when the camera revisits familiar regions. Based on this insight, Light Interaction combines adaptive context management, denoising cache acceleration, and hardware-software co-designed 3D block sparse attention with fused Triton kernels. Evaluated on HY-WorldPlay and Matrix-Game-3.0, Light Interaction achieves up to 2.59x speedup without model retraining while maintaining competitive visual quality.