Lotus-2:基于强大图像生成模型的几何密集预测技术突破
Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model
November 30, 2025
作者: Jing He, Haodong Li, Mingzhi Sheng, Ying-Cong Chen
cs.AI
摘要
从单张图像中恢复像素级几何属性本质上是一个不适定问题,这源于外观歧义性以及二维观测与三维结构之间的非单射映射。虽然判别式回归模型通过大规模监督学习实现了强劲性能,但其成功受限于可用数据的规模、质量与多样性,且缺乏物理推理能力。近期扩散模型展现出强大的世界先验,能够编码从海量图文数据中学到的几何与语义信息,但直接沿用其随机生成范式进行确定性几何推断并非最优:前者以多样化和高保真图像生成为优化目标,而后者需要稳定精确的预测。本文提出Lotus-2这一两阶段确定性框架,旨在实现稳定、准确且细粒度的几何密集预测,通过最优适配方案充分挖掘预训练生成先验。具体而言,第一阶段的核心预测器采用单步确定性公式,结合洁净数据目标与轻量级局部连续性模块(LCM),生成全局一致且无网格伪影的结构;第二阶段通过细节锐化器在核心预测器定义的流形内执行约束式多步修正流优化,借助无噪声确定性流匹配增强细粒度几何细节。仅使用59K训练样本(不足现有大规模数据集的1%),Lotus-2在单目深度估计任务中刷新了最优性能,在表面法线预测领域也展现出高度竞争力。这些结果表明,扩散模型可作为确定性世界先验,实现超越传统判别式与生成范式的高质量几何推理。
English
Recovering pixel-wise geometric properties from a single image is fundamentally ill-posed due to appearance ambiguity and non-injective mappings between 2D observations and 3D structures. While discriminative regression models achieve strong performance through large-scale supervision, their success is bounded by the scale, quality and diversity of available data and limited physical reasoning. Recent diffusion models exhibit powerful world priors that encode geometry and semantics learned from massive image-text data, yet directly reusing their stochastic generative formulation is suboptimal for deterministic geometric inference: the former is optimized for diverse and high-fidelity image generation, whereas the latter requires stable and accurate predictions. In this work, we propose Lotus-2, a two-stage deterministic framework for stable, accurate and fine-grained geometric dense prediction, aiming to provide an optimal adaption protocol to fully exploit the pre-trained generative priors. Specifically, in the first stage, the core predictor employs a single-step deterministic formulation with a clean-data objective and a lightweight local continuity module (LCM) to generate globally coherent structures without grid artifacts. In the second stage, the detail sharpener performs a constrained multi-step rectified-flow refinement within the manifold defined by the core predictor, enhancing fine-grained geometry through noise-free deterministic flow matching. Using only 59K training samples, less than 1% of existing large-scale datasets, Lotus-2 establishes new state-of-the-art results in monocular depth estimation and highly competitive surface normal prediction. These results demonstrate that diffusion models can serve as deterministic world priors, enabling high-quality geometric reasoning beyond traditional discriminative and generative paradigms.