ByteMorph：基於非剛性運動的指令引導圖像編輯基準測試

摘要

基於指令進行圖像編輯以反映非剛性運動、相機視角轉換、物體形變、人體關節活動及複雜互動，是計算機視覺領域中一個具有挑戰性卻尚未充分探索的問題。現有的方法與數據集主要集中於靜態場景或剛性變換，限制了其處理涉及動態運動的豐富編輯的能力。為填補這一空白，我們提出了ByteMorph，這是一個強調非剛性運動的基於指令的圖像編輯綜合框架。ByteMorph包含一個大規模數據集ByteMorph-6M，以及一個基於擴散變換器（DiT）構建的強大基線模型，名為ByteMorpher。ByteMorph-6M包含超過600萬對高分辨率圖像編輯對用於訓練，並精心策劃了一個評估基準ByteMorph-Bench。兩者均涵蓋了多樣環境、人體形象及物體類別中的廣泛非剛性運動類型。該數據集通過運動引導的數據生成、分層合成技術及自動化標註來構建，確保了多樣性、真實性及語義連貫性。我們還對來自學術界和商業領域的最新基於指令的圖像編輯方法進行了全面評估。

English

Editing images with instructions to reflect non-rigid motions, camera viewpoint shifts, object deformations, human articulations, and complex interactions, poses a challenging yet underexplored problem in computer vision. Existing approaches and datasets predominantly focus on static scenes or rigid transformations, limiting their capacity to handle expressive edits involving dynamic motion. To address this gap, we introduce ByteMorph, a comprehensive framework for instruction-based image editing with an emphasis on non-rigid motions. ByteMorph comprises a large-scale dataset, ByteMorph-6M, and a strong baseline model built upon the Diffusion Transformer (DiT), named ByteMorpher. ByteMorph-6M includes over 6 million high-resolution image editing pairs for training, along with a carefully curated evaluation benchmark ByteMorph-Bench. Both capture a wide variety of non-rigid motion types across diverse environments, human figures, and object categories. The dataset is constructed using motion-guided data generation, layered compositing techniques, and automated captioning to ensure diversity, realism, and semantic coherence. We further conduct a comprehensive evaluation of recent instruction-based image editing methods from both academic and commercial domains.

ByteMorph：基於非剛性運動的指令引導圖像編輯基準測試

ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions

摘要

Support