エディタから高密度ジオメトリ推定器へ

要旨

事前学習済みのテキストから画像を生成する（T2I）モデルから得られる視覚的プライアを活用することは、密な予測タスクにおいて成功を収めてきました。しかし、密な予測は本質的に画像から画像へのタスクであるため、T2I生成モデルではなく、画像編集モデルがファインチューニングの基盤としてより適している可能性があります。この動機に基づき、我々は密な幾何推定のための編集モデルと生成モデルのファインチューニング挙動を系統的に分析しました。その結果、編集モデルは固有の構造的プライアを持ち、それらの内在的特徴を「洗練」することでより安定して収束し、最終的には生成モデルよりも高い性能を達成できることがわかりました。これらの知見に基づき、我々はFE2Eというフレームワークを提案します。これは、Diffusion Transformer（DiT）アーキテクチャに基づく先進的な編集モデルを密な幾何予測に初めて適用したものです。具体的には、この決定論的タスクに編集モデルを適合させるため、編集モデルの元々のフローマッチング損失を「一貫した速度」の学習目標に再定式化しました。また、編集モデルのネイティブなBFloat16フォーマットと我々のタスクが要求する高精度との間の矛盾を解決するため、対数量子化を使用しました。さらに、DiTのグローバルアテンションを活用して、深度と法線の共同推定を単一のフォワードパスで無償で行い、それらの教師信号が互いに強化し合うようにしました。トレーニングデータを拡大することなく、FE2Eは複数のデータセットにおいて、ゼロショットの単眼深度推定と法線推定で印象的な性能向上を達成しました。特に、ETH3Dデータセットでは35％以上の性能向上を達成し、100倍のデータでトレーニングされたDepthAnythingシリーズを上回りました。プロジェクトページはhttps://amap-ml.github.io/FE2E/{こちら}からアクセスできます。

English

Leveraging visual priors from pre-trained text-to-image (T2I) generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning. Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by ``refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce FE2E, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the ``consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor's native BFloat16 format and the high precision demand of our tasks. Additionally, we leverage the DiT's global attention for a cost-free joint estimation of depth and normals in a single forward pass, enabling their supervisory signals to mutually enhance each other. Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35\% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100times data. The project page can be accessed https://amap-ml.github.io/FE2E/{here}.

エディタから高密度ジオメトリ推定器へ

From Editor to Dense Geometry Estimator

要旨

Support