静力学から動力学へ：潜在遷移事前分布を用いた物理法則を考慮した画像編集

要旨

指示に基づく画像編集は意味的整合性において目覚ましい成功を収めているが、屈折や材料変形など複雑な因果的ダイナミクスを含む編集において、最先端のモデルでも物理的に妥当な結果を生成できないことが多い。我々はこの限界を、編集を画像ペア間の離散的な写像として扱う主流のパラダイムに帰因する。このパラダイムは境界条件のみを提供し、遷移ダイナミクスを十分に特定しない。この問題に対処するため、物理を考慮した編集を予測的な物理状態遷移として再定式化し、5つの物理領域にわたる38Kの遷移軌跡を含む大規模ビデオベースのデータセットPhysicTran38Kを導入する。これは2段階のフィルタリングと制約を考慮したアノテーションパイプラインを通じて構築された。この監督信号に基づき、テキスト・視覚的二重思考メカニズムを備えたエンドツーエンドフレームワークPhysicEditを提案する。これは、物理に根ざした推論のために凍結されたQwen2.5-VLと、拡散モデルバックボーンに時間ステップ適応型の視覚的ガイダンスを提供する学習可能な遷移クエリを組み合わせる。実験により、PhysicEditは物理的な現実感においてQwen-Image-Editを5.9%、知識に基づく編集において10.1%上回り、オープンソース手法において新たな最先端を確立すると同時に、主要なプロプライエタリモデルに対しても競争力のある性能を示すことが確認された。

English

Instruction-based image editing has achieved remarkable success in semantic alignment, yet state-of-the-art models frequently fail to render physically plausible results when editing involves complex causal dynamics, such as refraction or material deformation. We attribute this limitation to the dominant paradigm that treats editing as a discrete mapping between image pairs, which provides only boundary conditions and leaves transition dynamics underspecified. To address this, we reformulate physics-aware editing as predictive physical state transitions and introduce PhysicTran38K, a large-scale video-based dataset comprising 38K transition trajectories across five physical domains, constructed via a two-stage filtering and constraint-aware annotation pipeline. Building on this supervision, we propose PhysicEdit, an end-to-end framework equipped with a textual-visual dual-thinking mechanism. It combines a frozen Qwen2.5-VL for physically grounded reasoning with learnable transition queries that provide timestep-adaptive visual guidance to a diffusion backbone. Experiments show that PhysicEdit improves over Qwen-Image-Edit by 5.9% in physical realism and 10.1% in knowledge-grounded editing, setting a new state-of-the-art for open-source methods, while remaining competitive with leading proprietary models.

静力学から動力学へ：潜在遷移事前分布を用いた物理法則を考慮した画像編集

From Statics to Dynamics: Physics-Aware Image Editing with Latent Transition Priors

要旨

Support