精修萬物：多模態區域精煉技術，成就完美局部細節

摘要

我們提出「區域特定圖像精修」作為一個專門的問題設定：給定輸入圖像及使用者指定的區域（如塗鴉遮罩或邊界框），目標是在嚴格保持所有未編輯像素不變的前提下，恢復細粒度細節。儘管圖像生成技術快速發展，當代模型仍常出現局部細節崩塌（如扭曲的文字、標誌和細薄結構）。現有的指令驅動編輯模型側重於粗粒度的語義編輯，往往忽略細微的局部缺陷，或無意中改變背景，尤其在關注區域僅佔固定分辨率輸入圖像一小部分時更為明顯。我們提出 RefineAnything，這是一個基於多模態擴散模型的可同時支援參考圖與無參考圖精修的架構。基於「裁剪並縮放能顯著提升固定 VAE 輸入分辨率下的局部重建效果」這一反直覺的觀察，我們提出 Focus-and-Refine 策略：通過區域聚焦的精修與貼回操作，將分辨率預算重新分配至目標區域，從而提升精修效能與效率，並透過混合遮罩貼回機制確保背景的嚴格保留。我們進一步引入邊界感知的邊界一致性損失函數，以減少接縫瑕疵並提升貼回的自然度。為支援此新設定，我們構建了 Refine-30K 數據集（包含 2 萬個參考圖樣本與 1 萬個無參考圖樣本），並提出 RefineEval 基準測試，同時評估編輯區域的保真度與背景一致性。在 RefineEval 上，RefineAnything 相較於競爭基線模型實現了顯著提升，且達到近乎完美的背景保留效果，為高精度局部精修提供了實用解決方案。項目頁面：https://limuloo.github.io/RefineAnything/。

English

We introduce region-specific image refinement as a dedicated problem setting: given an input image and a user-specified region (e.g., a scribble mask or a bounding box), the goal is to restore fine-grained details while keeping all non-edited pixels strictly unchanged. Despite rapid progress in image generation, modern models still frequently suffer from local detail collapse (e.g., distorted text, logos, and thin structures). Existing instruction-driven editing models emphasize coarse-grained semantic edits and often either overlook subtle local defects or inadvertently change the background, especially when the region of interest occupies only a small portion of a fixed-resolution input. We present RefineAnything, a multimodal diffusion-based refinement model that supports both reference-based and reference-free refinement. Building on a counter-intuitive observation that crop-and-resize can substantially improve local reconstruction under a fixed VAE input resolution, we propose Focus-and-Refine, a region-focused refinement-and-paste-back strategy that improves refinement effectiveness and efficiency by reallocating the resolution budget to the target region, while a blended-mask paste-back guarantees strict background preservation. We further introduce a boundary-aware Boundary Consistency Loss to reduce seam artifacts and improve paste-back naturalness. To support this new setting, we construct Refine-30K (20K reference-based and 10K reference-free samples) and introduce RefineEval, a benchmark that evaluates both edited-region fidelity and background consistency. On RefineEval, RefineAnything achieves strong improvements over competitive baselines and near-perfect background preservation, establishing a practical solution for high-precision local refinement. Project Page: https://limuloo.github.io/RefineAnything/.

精修萬物：多模態區域精煉技術，成就完美局部細節

RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

摘要

Support