PolyMaX：具有Mask Transformer的通用密集预测

摘要

密集预测任务，如语义分割、深度估计和表面法线预测，可以很容易地被表述为每像素分类（离散输出）或回归（连续输出）。这种每像素预测范式由于全卷积网络的普及而保持流行。然而，在分割任务的最新前沿，社区目睹了一种从每像素预测转向集群预测的范式转变，这是由于变压器架构的出现，特别是面罩变压器，它直接为面罩而不是像素预测标签。尽管出现了这种转变，仍然基于每像素预测范式的方法主导着需要连续输出的其他密集预测任务的基准测试，如深度估计和表面法线预测。受 DORN 和 AdaBins 在深度估计中通过离散化连续输出空间取得的成功的启发，我们提出将基于集群预测的方法推广到一般密集预测任务。这使我们能够将密集预测任务与面罩变压器框架统一起来。值得注意的是，由此产生的模型 PolyMaX 在 NYUD-v2 数据集的三个基准测试中展现出最先进的性能。我们希望我们简单而有效的设计能激发更多关于如何利用面罩变压器处理更多密集预测任务的研究。代码和模型将会提供。

English

Dense prediction tasks, such as semantic segmentation, depth estimation, and surface normal prediction, can be easily formulated as per-pixel classification (discrete outputs) or regression (continuous outputs). This per-pixel prediction paradigm has remained popular due to the prevalence of fully convolutional networks. However, on the recent frontier of segmentation task, the community has been witnessing a shift of paradigm from per-pixel prediction to cluster-prediction with the emergence of transformer architectures, particularly the mask transformers, which directly predicts a label for a mask instead of a pixel. Despite this shift, methods based on the per-pixel prediction paradigm still dominate the benchmarks on the other dense prediction tasks that require continuous outputs, such as depth estimation and surface normal prediction. Motivated by the success of DORN and AdaBins in depth estimation, achieved by discretizing the continuous output space, we propose to generalize the cluster-prediction based method to general dense prediction tasks. This allows us to unify dense prediction tasks with the mask transformer framework. Remarkably, the resulting model PolyMaX demonstrates state-of-the-art performance on three benchmarks of NYUD-v2 dataset. We hope our simple yet effective design can inspire more research on exploiting mask transformers for more dense prediction tasks. Code and model will be made available.

PolyMaX：具有Mask Transformer的通用密集预测

PolyMaX: General Dense Prediction with Mask Transformer

摘要

Support