VisPhyWorld: Fysiek Redeneren Onderzoeken via Code-Gestuurde Videoreconstructie

Samenvatting

Het evalueren of Multimodale Grote Taalmodellen (MLLMs) daadwerkelijk redeneren over fysische dynamieken blijft een uitdaging. De meeste bestaande benchmarks vertrouwen op herkenningsgerichte protocollen zoals Visuele Vraagbeantwoording (VQA) en Overtreding van Verwachting (VoE), die vaak beantwoord kunnen worden zonder een expliciete, toetsbare fysische hypothese aan te nemen. Wij stellen VisPhyWorld voor, een op uitvoering gebaseerd raamwerk dat fysisch redeneren evalueert door modellen te verplichten uitvoerbare simulatorcode te genereren op basis van visuele waarnemingen. Door het produceren van uitvoerbare code is de afgeleide wereldrepresentatie direct inspecteerbaar, aanpasbaar en falsifieerbaar. Dit scheidt fysisch redeneren van rendering. Voortbouwend op dit raamwerk introduceren we VisPhyBench, bestaande uit 209 evaluatiescènes afgeleid van 108 fysische templates en een systematisch protocol dat evalueert hoe goed modellen de verschijningsvorm reconstrueren en fysisch plausibele beweging reproduceren. Onze pijplijn produceert geldige gereconstrueerde video's in 97,7% van de gevallen op de benchmark. Experimenten tonen aan dat, hoewel state-of-the-art MLLMs een sterk semantisch scènebegrip bereiken, ze moeite hebben om fysische parameters nauwkeurig af te leiden en consistente fysische dynamieken te simuleren.

English

Evaluating whether Multimodal Large Language Models (MLLMs) genuinely reason about physical dynamics remains challenging. Most existing benchmarks rely on recognition-style protocols such as Visual Question Answering (VQA) and Violation of Expectation (VoE), which can often be answered without committing to an explicit, testable physical hypothesis. We propose VisPhyWorld, an execution-based framework that evaluates physical reasoning by requiring models to generate executable simulator code from visual observations. By producing runnable code, the inferred world representation is directly inspectable, editable, and falsifiable. This separates physical reasoning from rendering. Building on this framework, we introduce VisPhyBench, comprising 209 evaluation scenes derived from 108 physical templates and a systematic protocol that evaluates how well models reconstruct appearance and reproduce physically plausible motion. Our pipeline produces valid reconstructed videos in 97.7% on the benchmark. Experiments show that while state-of-the-art MLLMs achieve strong semantic scene understanding, they struggle to accurately infer physical parameters and to simulate consistent physical dynamics.

VisPhyWorld: Fysiek Redeneren Onderzoeken via Code-Gestuurde Videoreconstructie

VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction

Samenvatting

Support