TRAVL: 물리적 비현실성을 더 잘 판단할 수 있는 비디오-언어 모델을 만드는 레시피

초록

인상적인 시각적 충실도에도 불구하고, 현대의 비디오 생성 모델들은 종종 직관적인 물리 법칙을 위반하는 시퀀스를 생성합니다. 예를 들어, 물체가 공중에 떠 있거나, 순간이동하거나, 인과관계를 무시하는 방식으로 형태를 바꾸는 등의 현상이 발생합니다. 인간은 이러한 비현실적인 요소를 쉽게 감지할 수 있지만, 비디오에서 물리적 현실성을 정량적으로 평가할 수 있는 강력한 방법은 아직 존재하지 않습니다. 본 연구에서는 비디오-언어 모델(VLMs)이 물리적 타당성을 판단하는 신뢰할 수 있는 평가자로 훈련될 수 있는지 탐구합니다. 기존의 VLMs는 물리 법칙 위반을 식별하는 데 어려움을 겪으며, 이는 시간적 및 인과적 추론에서의 근본적인 한계를 드러냅니다. 이를 해결하기 위해, 우리는 TRAVL을 도입했습니다. 이는 균형 잡힌 훈련 데이터셋과 궤적 인식 주의 모듈을 결합하여 VLMs의 움직임 인코딩 및 판별 능력을 향상시키는 미세 조정 방법입니다. 물리적 추론을 더 엄격하게 평가하기 위해, 우리는 ImplausiBench를 제안합니다. 이는 언어적 편향을 제거하고 시각-시간적 이해를 분리한 300개의 비디오(실제 150개, 생성 150개)로 구성된 벤치마크입니다. 성능은 인간 판단의 금본위 기준과 더 엄격한 LLM-as-judge 지표를 통해 보고됩니다. TRAVL과 ImplausiBench는 다중모달 모델에서 물리적 타당성을 탐구하고 개선하기 위한 통합된 프레임워크를 제공하며, 시각-시간적 이해의 어려운 그리고 덜 탐구된 측면에 빛을 비춥니다.

English

Despite impressive visual fidelity, modern video generative models frequently produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing in ways that defy causality. While humans can easily detect such implausibilities, there remains no robust method for quantitatively assessing physical realism in video. In this work, we explore whether Video-Language Models (VLMs) can be trained to serve as reliable judges of physical plausibility. We find that existing VLMs struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning. To address this, we introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding and discrimination in VLMs. To evaluate physical reasoning more rigorously, we propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding. Performance is reported both with gold-standard human judgments and stricter LLM-as-judge metrics. Together, TRAVL and ImplausiBench offer a unified framework for probing and improving physical plausibility in multimodal models, shedding light on a challenging and underexplored aspect of visual-temporal understanding.

TRAVL: 물리적 비현실성을 더 잘 판단할 수 있는 비디오-언어 모델을 만드는 레시피

TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility

초록

Support