편집기에서 조밀한 기하학 추정기로

초록

사전 학습된 텍스트-이미지(T2I) 생성 모델의 시각적 사전 지식을 활용하는 것이 밀집 예측(dense prediction)에서 성공을 거둔 바 있습니다. 그러나 밀집 예측은 본질적으로 이미지-이미지 작업이기 때문에, T2I 생성 모델보다는 이미지 편집 모델이 미세 조정(fine-tuning)을 위한 더 적합한 기반이 될 수 있습니다. 이에 동기를 받아, 우리는 밀집 기하학 추정(dense geometry estimation)을 위해 편집 모델과 생성 모델의 미세 조정 행동을 체계적으로 분석했습니다. 우리의 연구 결과에 따르면, 편집 모델은 내재적인 구조적 사전 지식을 가지고 있어, 고유한 특징을 "정제"함으로써 더 안정적으로 수렴하고, 궁극적으로 생성 모델보다 더 높은 성능을 달성할 수 있습니다. 이러한 발견을 바탕으로, 우리는 Diffusion Transformer(DiT) 아키텍처 기반의 고급 편집 모델을 밀집 기하학 예측에 적응시키는 선구적인 프레임워크인 FE2E를 소개합니다. 구체적으로, 이 결정론적 작업에 편집 모델을 맞추기 위해, 편집 모델의 원래 흐름 매칭 손실(flow matching loss)을 "일관된 속도(consistent velocity)" 훈련 목표로 재구성했습니다. 또한, 편집 모델의 기본 BFloat16 형식과 우리 작업의 높은 정밀도 요구 사이의 충돌을 해결하기 위해 로그 양자화(logarithmic quantization)를 사용했습니다. 추가적으로, 우리는 DiT의 전역 주의(global attention)를 활용하여 단일 순방향 전달(single forward pass)에서 깊이(depth)와 법선(normals)의 공동 추정을 무비용으로 수행함으로써, 이들의 감독 신호가 서로를 강화할 수 있도록 했습니다. 훈련 데이터를 확장하지 않고도, FE2E는 여러 데이터셋에서 제로샷(zero-shot) 단안 깊이 및 법선 추정에서 인상적인 성능 향상을 달성했습니다. 특히, ETH3D 데이터셋에서 35% 이상의 성능 향상을 보였으며, 100배의 데이터로 훈련된 DepthAnything 시리즈를 능가했습니다. 프로젝트 페이지는 https://amap-ml.github.io/FE2E/{여기}에서 확인할 수 있습니다.

English

Leveraging visual priors from pre-trained text-to-image (T2I) generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning. Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by ``refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce FE2E, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the ``consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor's native BFloat16 format and the high precision demand of our tasks. Additionally, we leverage the DiT's global attention for a cost-free joint estimation of depth and normals in a single forward pass, enabling their supervisory signals to mutually enhance each other. Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35\% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100times data. The project page can be accessed https://amap-ml.github.io/FE2E/{here}.

편집기에서 조밀한 기하학 추정기로

From Editor to Dense Geometry Estimator

초록

Support