PhysGame：ゲームプレイ動画における物理的な常識違反の発見

要旨

最近の動画ベースの大規模言語モデル（Video LLMs）の進歩により、動的な視覚コンテンツを推論および解釈する多様な能力が登場しています。その中で、ゲームプレイ動画はしばしば物理的な常識に反するグリッチが含まれており、これが動画 LLMs の物理的な常識理解の未開拓領域を評価するための効果的なベンチマークとなっています。本論文では、PhysGame を提案し、ゲームプレイ動画における物理的な常識違反を評価する先駆的なベンチマークとして位置付けます。PhysGame は、4つの基本領域（つまり、力学、運動学、光学、および物質特性）を横断し、12の異なる物理的な常識にわたるグリッチを含む880の動画から構成されています。最先端の動画 LLMs を広範囲に評価することにより、現行のオープンソースの動画 LLMs の性能が専用の対応物に大きく遅れていることが明らかになりました。このギャップを埋めるために、物理的な常識学習を促進するために、140,057の質疑応答ペアを備えた指示調整データセット PhysInstruct を編纂します。さらに、誤ったタイトル（すなわち、メタ情報ハッキング）、少ないフレーム（すなわち、時間的ハッキング）、および低い空間分解能（すなわち、空間的ハッキング）に基づいて生成された非好ましい応答に条件付けられた、34,358のトレーニングペアを備えた好み最適化データセット PhysDPO を提案します。これらのデータセットに基づいて、物理的な知識を強化した動画 LLM である PhysVLM を提案します。物理指向のベンチマーク PhysGame および一般的な動画理解のベンチマークでの広範な実験により、PhysVLM の最先端の性能を示しました。

English

Recent advancements in video-based large language models (Video LLMs) have witnessed the emergence of diverse capabilities to reason and interpret dynamic visual content. Among them, gameplay videos stand out as a distinctive data source, often containing glitches that defy physics commonsense. This characteristic renders them an effective benchmark for assessing the under-explored capability of physical commonsense understanding in video LLMs. In this paper, we propose PhysGame as a pioneering benchmark to evaluate physical commonsense violations in gameplay videos. PhysGame comprises 880 videos associated with glitches spanning four fundamental domains (i.e., mechanics, kinematics, optics, and material properties) and across 12 distinct physical commonsense. Through extensively evaluating various state-ofthe-art video LLMs, our findings reveal that the performance of current open-source video LLMs significantly lags behind that of proprietary counterparts. To bridge this gap, we curate an instruction tuning dataset PhysInstruct with 140,057 question-answering pairs to facilitate physical commonsense learning. In addition, we also propose a preference optimization dataset PhysDPO with 34,358 training pairs, where the dis-preferred responses are generated conditioned on misleading titles (i.e., meta information hacking), fewer frames (i.e., temporal hacking) and lower spatial resolutions (i.e., spatial hacking). Based on the suite of datasets, we propose PhysVLM as a physical knowledge-enhanced video LLM. Extensive experiments on both physical-oriented benchmark PhysGame and general video understanding benchmarks demonstrate the state-ofthe-art performance of PhysVLM.

PhysGame：ゲームプレイ動画における物理的な常識違反の発見

PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos

要旨

Support