Da Editor a Stimatore di Geometria Densa

Abstract

Sfruttare i prior visivi provenienti da modelli generativi pre-addestrati testo-immagine (T2I) ha dimostrato successo nella predizione densa. Tuttavia, la predizione densa è intrinsecamente un compito immagine-immagine, suggerendo che i modelli di editing di immagini, piuttosto che i modelli generativi T2I, possano rappresentare una base più adatta per il fine-tuning. Spinti da questa considerazione, conduciamo un'analisi sistematica del comportamento di fine-tuning sia degli editor che dei generatori per la stima della geometria densa. I nostri risultati mostrano che i modelli di editing possiedono prior strutturali intrinseci, che consentono loro di convergere in modo più stabile "affinando" le loro caratteristiche innate, raggiungendo infine prestazioni superiori rispetto alle loro controparti generative. Sulla base di queste scoperte, introduciamo FE2E, un framework che adatta in modo pionieristico un avanzato modello di editing basato sull'architettura Diffusion Transformer (DiT) per la predizione della geometria densa. Nello specifico, per adattare l'editor a questo compito deterministico, riformuliamo la funzione di perdita originale di flow matching dell'editor nell'obiettivo di addestramento "velocità consistente". Utilizziamo inoltre la quantizzazione logaritmica per risolvere il conflitto di precisione tra il formato nativo BFloat16 dell'editor e l'elevata richiesta di precisione dei nostri compiti. Inoltre, sfruttiamo l'attenzione globale del DiT per una stima congiunta gratuita di profondità e normali in un singolo passaggio in avanti, consentendo ai loro segnali di supervisione di rafforzarsi reciprocamente. Senza aumentare i dati di addestramento, FE2E ottiene miglioramenti impressionanti nelle prestazioni di stima zero-shot della profondità monoculare e delle normali su più dataset. In particolare, raggiunge un guadagno di prestazioni superiore al 35% sul dataset ETH3D e supera la serie DepthAnything, addestrata su 100 volte più dati. La pagina del progetto è accessibile {qui} https://amap-ml.github.io/FE2E/.

English

Leveraging visual priors from pre-trained text-to-image (T2I) generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning. Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by ``refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce FE2E, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the ``consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor's native BFloat16 format and the high precision demand of our tasks. Additionally, we leverage the DiT's global attention for a cost-free joint estimation of depth and normals in a single forward pass, enabling their supervisory signals to mutually enhance each other. Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35\% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100times data. The project page can be accessed https://amap-ml.github.io/FE2E/{here}.

Da Editor a Stimatore di Geometria Densa

From Editor to Dense Geometry Estimator

Abstract

Support