Ctrl-Crash: リアルな自動車衝突のための制御可能な拡散モデル

要旨

近年、ビデオ拡散技術は大きく進歩しているが、ほとんどの運転データセットにおいて事故イベントが少ないため、自動車事故の現実的な映像生成には苦戦している。交通安全の向上には、現実的で制御可能な事故シミュレーションが必要である。この問題に対処するため、我々はCtrl-Crashを提案する。これは、バウンディングボックス、事故タイプ、初期画像フレームなどの信号を条件とする制御可能な自動車事故ビデオ生成モデルである。本手法は、入力のわずかな変化が劇的に異なる事故結果を引き起こす反事実的シナリオ生成を可能にする。推論時に細かい制御をサポートするため、各条件信号に対して独立に調整可能なスケールを持つクラシファイアーフリーガイダンスを活用する。Ctrl-Crashは、定量的なビデオ品質指標（例：FVD、JEDi）および物理的リアリズムとビデオ品質に基づく人間評価による定性的測定において、従来の拡散ベースの手法と比較して最先端の性能を達成している。

English

Video diffusion techniques have advanced significantly in recent years; however, they struggle to generate realistic imagery of car crashes due to the scarcity of accident events in most driving datasets. Improving traffic safety requires realistic and controllable accident simulations. To tackle the problem, we propose Ctrl-Crash, a controllable car crash video generation model that conditions on signals such as bounding boxes, crash types, and an initial image frame. Our approach enables counterfactual scenario generation where minor variations in input can lead to dramatically different crash outcomes. To support fine-grained control at inference time, we leverage classifier-free guidance with independently tunable scales for each conditioning signal. Ctrl-Crash achieves state-of-the-art performance across quantitative video quality metrics (e.g., FVD and JEDi) and qualitative measurements based on a human-evaluation of physical realism and video quality compared to prior diffusion-based methods.