RefineAnything: 완벽한 로컬 디테일을 위한 멀티모달 영역 특화 정밀 보정

초록

지역 특화 이미지 정교화를 새로운 문제 설정으로 소개합니다: 입력 이미지와 사용자가 지정한 영역(예: 스크리블 마스크 또는 경계 상자)이 주어졌을 때, 수정되지 않은 모든 픽셀을 엄격히 유지하면서 세밀한 디테일을 복원하는 것이 목표입니다. 이미지 생성 기술의 급속한 발전에도 불구하고, 현대 모델들은 여전히 지역적 디테일 붕괴(예: 왜곡된 텍스트, 로고, 얇은 구조물) 문제를 자주 겪습니다. 기존의 지시 기반 편집 모델들은 coarse-grained 의미론적 편집에 중점을 두어 미묘한 지역적 결함을 간과하거나, 특히 관심 영역이 고정 해상도 입력의 작은 부분만을 차지할 경우 배경을 의도치 않게 변경하는 경우가 많습니다. 우리는 참조 기반 및 참조 없는 정교화를 모두 지원하는 멀티모달 diffusion 기반 정교화 모델인 RefineAnything을 제시합니다. 크롭-및-리사이즈가 고정 VAE 입력 해상도 하에서 지역 재구성을 크게 개선할 수 있다는 반직관적 관찰에 기반하여, 해상도 예산을 대상 영역에 재배분함으로써 정교화 효과와 효율을 향상시키는 지역 집중 정교화-및-붙여넣기 전략인 Focus-and-Refine을 제안합니다. blended-mask 붙여넣기 기법은 엄격한 배경 보존을 보장합니다. 또한 경계 인식 Boundary Consistency Loss를 도입하여 이음새 아티팩트를 줄이고 붙여넣기 자연스러움을 개선합니다. 이 새로운 설정을 지원하기 위해 Refine-30K(참조 기반 20K, 참조 없는 10K 샘플) 데이터셋을 구축하고, 수정 영역 충실도와 배경 일관성을 모두 평가하는 벤치마크인 RefineEval을 소개합니다. RefineEval에서 RefineAnything은 경쟁력 있는 베이스라인 대비 강력한 개선과近乎 완벽에 가까운 배경 보존을 달성하여 고정밀 지역 정교화를 위한 실용적인 해결책을 제시합니다. 프로젝트 페이지: https://limuloo.github.io/RefineAnything/.

English

We introduce region-specific image refinement as a dedicated problem setting: given an input image and a user-specified region (e.g., a scribble mask or a bounding box), the goal is to restore fine-grained details while keeping all non-edited pixels strictly unchanged. Despite rapid progress in image generation, modern models still frequently suffer from local detail collapse (e.g., distorted text, logos, and thin structures). Existing instruction-driven editing models emphasize coarse-grained semantic edits and often either overlook subtle local defects or inadvertently change the background, especially when the region of interest occupies only a small portion of a fixed-resolution input. We present RefineAnything, a multimodal diffusion-based refinement model that supports both reference-based and reference-free refinement. Building on a counter-intuitive observation that crop-and-resize can substantially improve local reconstruction under a fixed VAE input resolution, we propose Focus-and-Refine, a region-focused refinement-and-paste-back strategy that improves refinement effectiveness and efficiency by reallocating the resolution budget to the target region, while a blended-mask paste-back guarantees strict background preservation. We further introduce a boundary-aware Boundary Consistency Loss to reduce seam artifacts and improve paste-back naturalness. To support this new setting, we construct Refine-30K (20K reference-based and 10K reference-free samples) and introduce RefineEval, a benchmark that evaluates both edited-region fidelity and background consistency. On RefineEval, RefineAnything achieves strong improvements over competitive baselines and near-perfect background preservation, establishing a practical solution for high-precision local refinement. Project Page: https://limuloo.github.io/RefineAnything/.

RefineAnything: 완벽한 로컬 디테일을 위한 멀티모달 영역 특화 정밀 보정

RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

초록

Support