從理想至現實：面向真實場景的統一且數據高效之密集預測

摘要

密集预测任务在计算机视觉领域占据着重要地位，其目标是为输入图像学习像素级的标注标签。尽管该领域已取得诸多进展，现有方法主要集中于理想化条件下的研究，对现实场景的泛化能力有限，且面临真实世界数据稀缺的挑战。为系统性地研究这一问题，我们首先引入了DenseWorld，这是一个涵盖25种密集预测任务的基准测试集，这些任务对应着紧迫的现实应用需求，并实现了跨任务的统一评估。随后，我们提出了DenseDiT，它最大限度地利用生成模型的视觉先验，通过统一策略执行多样化的现实世界密集预测任务。DenseDiT结合了参数重用机制和两个轻量级分支，这些分支自适应地整合多尺度上下文信息，仅需增加不到0.1%的参数。在DenseWorld上的评估显示，现有通用及专用基线模型性能显著下降，凸显了它们在现实世界泛化能力上的局限。相比之下，DenseDiT仅使用基线模型不到0.01%的训练数据便取得了优异结果，充分证明了其在现实世界部署中的实用价值。我们的数据、检查点及代码可在https://xcltql666.github.io/DenseDiTProj获取。

English

Dense prediction tasks hold significant importance of computer vision, aiming to learn pixel-wise annotated label for an input image. Despite advances in this field, existing methods primarily focus on idealized conditions, with limited generalization to real-world scenarios and facing the challenging scarcity of real-world data. To systematically study this problem, we first introduce DenseWorld, a benchmark spanning a broad set of 25 dense prediction tasks that correspond to urgent real-world applications, featuring unified evaluation across tasks. Then, we propose DenseDiT, which maximally exploits generative models' visual priors to perform diverse real-world dense prediction tasks through a unified strategy. DenseDiT combines a parameter-reuse mechanism and two lightweight branches that adaptively integrate multi-scale context, working with less than 0.1% additional parameters. Evaluations on DenseWorld reveal significant performance drops in existing general and specialized baselines, highlighting their limited real-world generalization. In contrast, DenseDiT achieves superior results using less than 0.01% training data of baselines, underscoring its practical value for real-world deployment. Our data, and checkpoints and codes are available at https://xcltql666.github.io/DenseDiTProj

從理想至現實：面向真實場景的統一且數據高效之密集預測

From Ideal to Real: Unified and Data-Efficient Dense Prediction for Real-World Scenarios

摘要

Support