Ctrl-Crash: 현실적인 자동차 충돌을 위한 제어 가능한 확산 모델

초록

최근 비디오 확산 기술이 크게 발전했음에도 불구하고, 대부분의 운전 데이터셋에서 사고 사례가 부족하기 때문에 현실적인 자동차 충돌 영상을 생성하는 데 어려움을 겪고 있다. 교통 안전을 개선하기 위해서는 현실적이고 제어 가능한 사고 시뮬레이션이 필요하다. 이 문제를 해결하기 위해, 우리는 바운딩 박스, 충돌 유형, 초기 이미지 프레임과 같은 신호를 조건으로 하는 제어 가능한 자동차 충돌 비디오 생성 모델인 Ctrl-Crash를 제안한다. 우리의 접근 방식은 입력의 작은 변화가 극적으로 다른 충돌 결과를 초래할 수 있는 반사실적 시나리오 생성을 가능하게 한다. 추론 시 세밀한 제어를 지원하기 위해, 우리는 각 조건 신호에 대해 독립적으로 조정 가능한 스케일을 가진 분류자 없는 지도를 활용한다. Ctrl-Crash는 정량적 비디오 품질 지표(예: FVD 및 JEDi)와 이전의 확산 기반 방법과 비교한 물리적 현실감 및 비디오 품질에 대한 인간 평가 기반의 정성적 측정에서 최첨단 성능을 달성한다.

English

Video diffusion techniques have advanced significantly in recent years; however, they struggle to generate realistic imagery of car crashes due to the scarcity of accident events in most driving datasets. Improving traffic safety requires realistic and controllable accident simulations. To tackle the problem, we propose Ctrl-Crash, a controllable car crash video generation model that conditions on signals such as bounding boxes, crash types, and an initial image frame. Our approach enables counterfactual scenario generation where minor variations in input can lead to dramatically different crash outcomes. To support fine-grained control at inference time, we leverage classifier-free guidance with independently tunable scales for each conditioning signal. Ctrl-Crash achieves state-of-the-art performance across quantitative video quality metrics (e.g., FVD and JEDi) and qualitative measurements based on a human-evaluation of physical realism and video quality compared to prior diffusion-based methods.