Lotus: Diffusie-gebaseerd visueel grondmodel voor hoogwaardige dichte voorspelling.

Samenvatting

Het benutten van de visuele voorkennis van vooraf getrainde tekst-naar-afbeelding diffusiemodellen biedt een veelbelovende oplossing om nul-shot generalisatie te verbeteren bij dichte voorspelingstaken. Echter maken bestaande methoden vaak kritiekloos gebruik van de oorspronkelijke diffusieformulering, die mogelijk niet optimaal is vanwege de fundamentele verschillen tussen dichte voorspelling en beeldgeneratie. In dit artikel bieden we een systematische analyse van de diffusieformulering voor dichte voorspelling, met de focus op zowel kwaliteit als efficiëntie. We constateren dat het oorspronkelijke parameterisatietype voor beeldgeneratie, dat leert ruis te voorspellen, schadelijk is voor dichte voorspelling; het meerstaps ruisen/ruisverwijderingsdiffusieproces is ook onnodig en uitdagend om te optimaliseren. Op basis van deze inzichten introduceren we Lotus, een diffusiegebaseerd visueel basismodel met een eenvoudig maar effectief aanpassingsprotocol voor dichte voorspelling. Specifiek wordt Lotus getraind om rechtstreeks annotaties te voorspellen in plaats van ruis, waardoor schadelijke variantie wordt vermeden. We hervormen ook het diffusieproces tot een eenstapsprocedure, wat optimalisatie vereenvoudigt en de inferentiesnelheid aanzienlijk verhoogt. Daarnaast introduceren we een nieuwe afstemmingsstrategie genaamd 'detail preserver', die nauwkeurigere en gedetailleerdere voorspellingen oplevert. Zonder de trainingsgegevens of modelcapaciteit op te schalen, behaalt Lotus state-of-the-art prestaties in nul-shot diepte- en normaalschatting over verschillende datasets. Het verbetert ook aanzienlijk de efficiëntie, aangezien het honderden malen sneller is dan de meeste bestaande diffusiegebaseerde methoden.

English

Leveraging the visual priors of pre-trained text-to-image diffusion models offers a promising solution to enhance zero-shot generalization in dense prediction tasks. However, existing methods often uncritically use the original diffusion formulation, which may not be optimal due to the fundamental differences between dense prediction and image generation. In this paper, we provide a systemic analysis of the diffusion formulation for the dense prediction, focusing on both quality and efficiency. And we find that the original parameterization type for image generation, which learns to predict noise, is harmful for dense prediction; the multi-step noising/denoising diffusion process is also unnecessary and challenging to optimize. Based on these insights, we introduce Lotus, a diffusion-based visual foundation model with a simple yet effective adaptation protocol for dense prediction. Specifically, Lotus is trained to directly predict annotations instead of noise, thereby avoiding harmful variance. We also reformulate the diffusion process into a single-step procedure, simplifying optimization and significantly boosting inference speed. Additionally, we introduce a novel tuning strategy called detail preserver, which achieves more accurate and fine-grained predictions. Without scaling up the training data or model capacity, Lotus achieves SoTA performance in zero-shot depth and normal estimation across various datasets. It also significantly enhances efficiency, being hundreds of times faster than most existing diffusion-based methods.

Lotus: Diffusie-gebaseerd visueel grondmodel voor hoogwaardige dichte voorspelling.

Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction

Samenvatting

Summary

Support

Support