从编辑到密集几何估计器
From Editor to Dense Geometry Estimator
September 4, 2025
作者: JiYuan Wang, Chunyu Lin, Lei Sun, Rongying Liu, Lang Nie, Mingxing Li, Kang Liao, Xiangxiang Chu, Yao Zhao
cs.AI
摘要
利用预训练的文本到图像(T2I)生成模型中的视觉先验知识,在密集预测任务中已展现出显著成效。然而,密集预测本质上属于图像到图像的任务,这表明图像编辑模型,而非T2I生成模型,可能更适合作为微调的基础。基于此,我们系统分析了编辑器和生成器在密集几何估计任务中的微调行为。研究发现,编辑模型具备固有的结构先验,使其能够通过“精炼”内在特征更稳定地收敛,并最终超越生成模型的表现。
基于这些发现,我们提出了FE2E框架,首次将基于扩散变换器(DiT)架构的高级编辑模型应用于密集几何预测。具体而言,为使编辑模型适应这一确定性任务,我们将其原有的流匹配损失重新表述为“一致速度”训练目标,并采用对数量化解决编辑模型原生BFloat16格式与任务高精度需求之间的冲突。此外,我们利用DiT的全局注意力机制,在单次前向传播中无额外成本地联合估计深度和法线,使它们的监督信号相互增强。
在不扩大训练数据规模的情况下,FE2E在多个数据集的零样本单目深度和法线估计任务中实现了显著的性能提升。特别是在ETH3D数据集上,性能提升超过35%,并超越了基于100倍数据训练的DepthAnything系列。项目页面可通过https://amap-ml.github.io/FE2E/访问。
English
Leveraging visual priors from pre-trained text-to-image (T2I) generative
models has shown success in dense prediction. However, dense prediction is
inherently an image-to-image task, suggesting that image editing models, rather
than T2I generative models, may be a more suitable foundation for fine-tuning.
Motivated by this, we conduct a systematic analysis of the fine-tuning
behaviors of both editors and generators for dense geometry estimation. Our
findings show that editing models possess inherent structural priors, which
enable them to converge more stably by ``refining" their innate features, and
ultimately achieve higher performance than their generative counterparts.
Based on these findings, we introduce FE2E, a framework that
pioneeringly adapts an advanced editing model based on Diffusion Transformer
(DiT) architecture for dense geometry prediction. Specifically, to tailor the
editor for this deterministic task, we reformulate the editor's original flow
matching loss into the ``consistent velocity" training objective. And we use
logarithmic quantization to resolve the precision conflict between the editor's
native BFloat16 format and the high precision demand of our tasks.
Additionally, we leverage the DiT's global attention for a cost-free joint
estimation of depth and normals in a single forward pass, enabling their
supervisory signals to mutually enhance each other.
Without scaling up the training data, FE2E achieves impressive performance
improvements in zero-shot monocular depth and normal estimation across multiple
datasets. Notably, it achieves over 35\% performance gains on the ETH3D dataset
and outperforms the DepthAnything series, which is trained on 100times data.
The project page can be accessed https://amap-ml.github.io/FE2E/{here}.