RefineAnything: ローカル詳細を完璧にするマルチモーダルな領域特化型精緻化

要旨

我々は、領域特化型画像精緻化を新たな問題設定として提案する。これは、入力画像とユーザー指定領域（スクリブルマスクやバウンディングボックスなど）が与えられた時、未編集ピクセルを厳密に保持しつつ、対象領域の微細なディテールを復元することを目的とする。画像生成技術が急速に進歩しているにも関わらず、現代のモデルは局所的なディテール崩壊（文字やロゴ、細い構造物の歪みなど）に頻繁に悩まされている。既存の指示駆動型編集モデルは粗い意味的編集を重視するため、微妙な局所的な不具合を見落としたり、特に関心領域が固定解像度入力のごく一部を占める場合に背景を意図せず変更したりすることが多い。本論文では、参照あり／参照なし両方の精緻化をサポートするマルチモーダル拡散モデル「RefineAnything」を発表する。固定VAE入力解像度下において、意外なことにクロップ＆リサイズが局所再構成品質を大幅に向上させ得るという観察に基づき、解像度バジェットを対象領域に再配分する「Focus-and-Refine」戦略を提案する。領域集中型の精緻化＆貼り戻し手法により精緻化の効率と効果を改善し、ブレンドマスクを用いた貼り戻しで背景の厳密な保存を保証する。さらに、シームアーティファクト低減と貼り戻しの自然さ向上のため、境界領域認識のBoundary Consistency Lossを導入する。この新設定を支援するため、Refine-30K（参照あり2万サンプル、参照なし1万サンプル）データセットを構築し、編集領域の忠実度と背景一貫性を評価するベンチマーク「RefineEval」を提案する。RefineEvalにおける実験で、RefineAnythingは競合ベースラインを大きく上回る改善を示し、ほぼ完璧な背景保存を達成し、高精度局所精緻化の実用的ソリューションを確立した。プロジェクトページ: https://limuloo.github.io/RefineAnything/

English

We introduce region-specific image refinement as a dedicated problem setting: given an input image and a user-specified region (e.g., a scribble mask or a bounding box), the goal is to restore fine-grained details while keeping all non-edited pixels strictly unchanged. Despite rapid progress in image generation, modern models still frequently suffer from local detail collapse (e.g., distorted text, logos, and thin structures). Existing instruction-driven editing models emphasize coarse-grained semantic edits and often either overlook subtle local defects or inadvertently change the background, especially when the region of interest occupies only a small portion of a fixed-resolution input. We present RefineAnything, a multimodal diffusion-based refinement model that supports both reference-based and reference-free refinement. Building on a counter-intuitive observation that crop-and-resize can substantially improve local reconstruction under a fixed VAE input resolution, we propose Focus-and-Refine, a region-focused refinement-and-paste-back strategy that improves refinement effectiveness and efficiency by reallocating the resolution budget to the target region, while a blended-mask paste-back guarantees strict background preservation. We further introduce a boundary-aware Boundary Consistency Loss to reduce seam artifacts and improve paste-back naturalness. To support this new setting, we construct Refine-30K (20K reference-based and 10K reference-free samples) and introduce RefineEval, a benchmark that evaluates both edited-region fidelity and background consistency. On RefineEval, RefineAnything achieves strong improvements over competitive baselines and near-perfect background preservation, establishing a practical solution for high-precision local refinement. Project Page: https://limuloo.github.io/RefineAnything/.

RefineAnything: ローカル詳細を完璧にするマルチモーダルな領域特化型精緻化

RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

要旨

Support