从编辑到密集几何估计器

摘要

利用预训练的文本到图像（T2I）生成模型中的视觉先验知识，在密集预测任务中已展现出显著成效。然而，密集预测本质上属于图像到图像的任务，这表明图像编辑模型，而非T2I生成模型，可能更适合作为微调的基础。基于此，我们系统分析了编辑器和生成器在密集几何估计任务中的微调行为。研究发现，编辑模型具备固有的结构先验，使其能够通过“精炼”内在特征更稳定地收敛，并最终超越生成模型的表现。基于这些发现，我们提出了FE2E框架，首次将基于扩散变换器（DiT）架构的高级编辑模型应用于密集几何预测。具体而言，为使编辑模型适应这一确定性任务，我们将其原有的流匹配损失重新表述为“一致速度”训练目标，并采用对数量化解决编辑模型原生BFloat16格式与任务高精度需求之间的冲突。此外，我们利用DiT的全局注意力机制，在单次前向传播中无额外成本地联合估计深度和法线，使它们的监督信号相互增强。在不扩大训练数据规模的情况下，FE2E在多个数据集的零样本单目深度和法线估计任务中实现了显著的性能提升。特别是在ETH3D数据集上，性能提升超过35%，并超越了基于100倍数据训练的DepthAnything系列。项目页面可通过https://amap-ml.github.io/FE2E/访问。

English

Leveraging visual priors from pre-trained text-to-image (T2I) generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning. Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by ``refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce FE2E, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the ``consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor's native BFloat16 format and the high precision demand of our tasks. Additionally, we leverage the DiT's global attention for a cost-free joint estimation of depth and normals in a single forward pass, enabling their supervisory signals to mutually enhance each other. Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35\% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100times data. The project page can be accessed https://amap-ml.github.io/FE2E/{here}.

从编辑到密集几何估计器

From Editor to Dense Geometry Estimator

摘要

Support