精修万物：多模态区域精细化，成就完美局部细节

摘要

我们提出区域特定图像精细化作为一个专门的问题设定：给定输入图像和用户指定区域（如涂鸦掩码或边界框），目标是在严格保持所有未编辑像素不变的同时恢复细粒度细节。尽管图像生成技术发展迅速，现代模型仍常出现局部细节崩塌问题（如扭曲的文字、标识和纤细结构）。现有的指令驱动编辑模型侧重于粗粒度语义编辑，往往忽略细微局部缺陷或意外改变背景，尤其在感兴趣区域仅占固定分辨率输入图像一小部分时更为明显。基于反直觉的观察——裁剪缩放能显著改善固定VAE输入分辨率下的局部重建效果，我们提出Focus-and-Refine策略：通过区域聚焦的精细化-粘贴回方法，将分辨率预算重新分配给目标区域，同时采用混合掩码粘贴回机制确保严格背景保留。我们还引入边界感知的边界一致性损失函数来减少接缝伪影并提升粘贴自然度。为支持这一新设定，我们构建了Refine-30K数据集（含2万参考样本和1万无参考样本），并提出RefineEval基准测试，同时评估编辑区域保真度与背景一致性。在RefineEval上，RefineAnything相较于基线模型实现显著改进，达到近乎完美的背景保留效果，为高精度局部精细化提供了实用解决方案。项目页面：https://limuloo.github.io/RefineAnything/。

English

We introduce region-specific image refinement as a dedicated problem setting: given an input image and a user-specified region (e.g., a scribble mask or a bounding box), the goal is to restore fine-grained details while keeping all non-edited pixels strictly unchanged. Despite rapid progress in image generation, modern models still frequently suffer from local detail collapse (e.g., distorted text, logos, and thin structures). Existing instruction-driven editing models emphasize coarse-grained semantic edits and often either overlook subtle local defects or inadvertently change the background, especially when the region of interest occupies only a small portion of a fixed-resolution input. We present RefineAnything, a multimodal diffusion-based refinement model that supports both reference-based and reference-free refinement. Building on a counter-intuitive observation that crop-and-resize can substantially improve local reconstruction under a fixed VAE input resolution, we propose Focus-and-Refine, a region-focused refinement-and-paste-back strategy that improves refinement effectiveness and efficiency by reallocating the resolution budget to the target region, while a blended-mask paste-back guarantees strict background preservation. We further introduce a boundary-aware Boundary Consistency Loss to reduce seam artifacts and improve paste-back naturalness. To support this new setting, we construct Refine-30K (20K reference-based and 10K reference-free samples) and introduce RefineEval, a benchmark that evaluates both edited-region fidelity and background consistency. On RefineEval, RefineAnything achieves strong improvements over competitive baselines and near-perfect background preservation, establishing a practical solution for high-precision local refinement. Project Page: https://limuloo.github.io/RefineAnything/.