DepthFM: Snelle Monoculaire Diepteschatting met Flow Matching

Samenvatting

Monoculaire diepteschatting is cruciaal voor talrijke downstream visietaken en toepassingen. Huidige discriminerende benaderingen voor dit probleem zijn beperkt door wazige artefacten, terwijl state-of-the-art generatieve methoden te kampen hebben met trage sampling vanwege hun SDE-aard. In plaats van vanuit ruis te beginnen, streven we naar een directe mapping van invoerbeeld naar dieptekaart. We observeren dat dit effectief kan worden geformuleerd met behulp van flow matching, aangezien de rechte trajecten door de oplossingsruimte efficiëntie en hoge kwaliteit bieden. Onze studie toont aan dat een vooraf getraind beelddiffusiemodel kan dienen als een adequaat prior voor een flow matching dieptemodel, waardoor efficiënte training op alleen synthetische data mogelijk is om te generaliseren naar echte beelden. We constateren dat een aanvullende oppervlaktenormalenverlies de diepteschattingen verder verbetert. Vanwege de generatieve aard van onze aanpak, voorspelt ons model betrouwbaar het vertrouwen van zijn diepteschattingen. Op standaard benchmarks van complexe natuurlijke scènes vertoont onze lichtgewicht aanpak state-of-the-art prestaties tegen een gunstige lage rekenkost, ondanks dat deze slechts op weinig synthetische data is getraind.

English

Monocular depth estimation is crucial for numerous downstream vision tasks and applications. Current discriminative approaches to this problem are limited due to blurry artifacts, while state-of-the-art generative methods suffer from slow sampling due to their SDE nature. Rather than starting from noise, we seek a direct mapping from input image to depth map. We observe that this can be effectively framed using flow matching, since its straight trajectories through solution space offer efficiency and high quality. Our study demonstrates that a pre-trained image diffusion model can serve as an adequate prior for a flow matching depth model, allowing efficient training on only synthetic data to generalize to real images. We find that an auxiliary surface normals loss further improves the depth estimates. Due to the generative nature of our approach, our model reliably predicts the confidence of its depth estimates. On standard benchmarks of complex natural scenes, our lightweight approach exhibits state-of-the-art performance at favorable low computational cost despite only being trained on little synthetic data.

DepthFM: Snelle Monoculaire Diepteschatting met Flow Matching

DepthFM: Fast Monocular Depth Estimation with Flow Matching

Samenvatting

Support