從編輯器到密集幾何估計器

摘要

利用预训练的文本到图像（T2I）生成模型中的视觉先验知识，在密集预测任务中已显示出成功。然而，密集预测本质上是一项图像到图像的任务，这表明图像编辑模型，而非T2I生成模型，可能更适合作为微调的基础。受此启发，我们对编辑器和生成器在密集几何估计中的微调行为进行了系统分析。我们的研究发现，编辑模型具有固有的结构先验，这使得它们能够通过“精炼”其内在特征更稳定地收敛，并最终实现比生成模型更高的性能。基于这些发现，我们引入了FE2E框架，该框架首次将基于扩散变换器（DiT）架构的高级编辑模型应用于密集几何预测。具体而言，为了使编辑器适应这一确定性任务，我们将编辑器的原始流匹配损失重新表述为“一致速度”训练目标。此外，我们使用对数量化来解决编辑器原生BFloat16格式与我们任务对高精度需求之间的精度冲突。同时，我们利用DiT的全局注意力机制，在单次前向传播中实现深度和法线的无成本联合估计，使它们的监督信号能够相互增强。在不扩大训练数据规模的情况下，FE2E在多个数据集上的零样本单目深度和法线估计中实现了显著的性能提升。值得注意的是，它在ETH3D数据集上实现了超过35%的性能提升，并超越了在100倍数据上训练的DepthAnything系列。项目页面可通过https://amap-ml.github.io/FE2E/{此处}访问。

English

Leveraging visual priors from pre-trained text-to-image (T2I) generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning. Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by ``refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce FE2E, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the ``consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor's native BFloat16 format and the high precision demand of our tasks. Additionally, we leverage the DiT's global attention for a cost-free joint estimation of depth and normals in a single forward pass, enabling their supervisory signals to mutually enhance each other. Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35\% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100times data. The project page can be accessed https://amap-ml.github.io/FE2E/{here}.

從編輯器到密集幾何估計器

From Editor to Dense Geometry Estimator

摘要

Support