从静态到动态：基于潜在转换先验的物理感知图像编辑

摘要

基于指令的图像编辑在语义对齐方面取得了显著成功，但在涉及复杂因果动态（如折射或材料形变）的编辑任务中，现有先进模型往往难以生成物理合理的结果。我们认为这一局限源于当前主流范式将编辑视为图像对之间的离散映射，该方法仅提供边界条件而未能明确定义过渡动态。为此，我们将物理感知编辑重新定义为预测性物理状态转换，并推出PhysicTran38K——一个基于视频的大规模数据集，包含五大物理领域的3.8万条过渡轨迹，通过两阶段筛选与约束感知标注流程构建。基于此监督机制，我们提出PhysicEdit端到端框架，该框架配备文本-视觉双思维机制：结合冻结式Qwen2.5-VL模型进行物理基础推理，同时通过可学习的过渡查询为扩散主干网络提供时间自适应的视觉引导。实验表明，PhysicEdit在物理真实性上较Qwen-Image-Edit提升5.9%，在知识驱动编辑方面提升10.1%，为开源方法树立了新标杆，同时与领先的专有模型保持竞争力。

English

Instruction-based image editing has achieved remarkable success in semantic alignment, yet state-of-the-art models frequently fail to render physically plausible results when editing involves complex causal dynamics, such as refraction or material deformation. We attribute this limitation to the dominant paradigm that treats editing as a discrete mapping between image pairs, which provides only boundary conditions and leaves transition dynamics underspecified. To address this, we reformulate physics-aware editing as predictive physical state transitions and introduce PhysicTran38K, a large-scale video-based dataset comprising 38K transition trajectories across five physical domains, constructed via a two-stage filtering and constraint-aware annotation pipeline. Building on this supervision, we propose PhysicEdit, an end-to-end framework equipped with a textual-visual dual-thinking mechanism. It combines a frozen Qwen2.5-VL for physically grounded reasoning with learnable transition queries that provide timestep-adaptive visual guidance to a diffusion backbone. Experiments show that PhysicEdit improves over Qwen-Image-Edit by 5.9% in physical realism and 10.1% in knowledge-grounded editing, setting a new state-of-the-art for open-source methods, while remaining competitive with leading proprietary models.

从静态到动态：基于潜在转换先验的物理感知图像编辑

From Statics to Dynamics: Physics-Aware Image Editing with Latent Transition Priors

摘要

Support