PolyMaX：具有遮罩Transformer的通用密集預測

摘要

密集預測任務，如語義分割、深度估計和表面法向預測，可以輕鬆地被定義為每像素分類（離散輸出）或回歸（連續輸出）。這種每像素預測範式由於完全卷積網絡的普及而保持流行。然而，在分割任務的最新前沿上，社群目睹了從每像素預測轉向到集群預測的範式轉變，這是由於變壓器架構的出現，特別是遮罩變壓器，它直接為遮罩預測標籤而不是像素。儘管存在這種轉變，仍然基於每像素預測範式的方法主導著需要連續輸出的其他密集預測任務的基準測試，例如深度估計和表面法向預測。受到 DORN 和 AdaBins 在深度估計中通過離散化連續輸出空間取得成功的啟發，我們提出將基於集群預測的方法概括到一般密集預測任務。這使我們能夠將密集預測任務與遮罩變壓器框架統一起來。值得注意的是，所得到的模型 PolyMaX 在 NYUD-v2 數據集的三個基準測試中展現了最先進的性能。我們希望我們簡單而有效的設計可以激發更多關於如何利用遮罩變壓器進行更多密集預測任務的研究。代碼和模型將可用。

English

Dense prediction tasks, such as semantic segmentation, depth estimation, and surface normal prediction, can be easily formulated as per-pixel classification (discrete outputs) or regression (continuous outputs). This per-pixel prediction paradigm has remained popular due to the prevalence of fully convolutional networks. However, on the recent frontier of segmentation task, the community has been witnessing a shift of paradigm from per-pixel prediction to cluster-prediction with the emergence of transformer architectures, particularly the mask transformers, which directly predicts a label for a mask instead of a pixel. Despite this shift, methods based on the per-pixel prediction paradigm still dominate the benchmarks on the other dense prediction tasks that require continuous outputs, such as depth estimation and surface normal prediction. Motivated by the success of DORN and AdaBins in depth estimation, achieved by discretizing the continuous output space, we propose to generalize the cluster-prediction based method to general dense prediction tasks. This allows us to unify dense prediction tasks with the mask transformer framework. Remarkably, the resulting model PolyMaX demonstrates state-of-the-art performance on three benchmarks of NYUD-v2 dataset. We hope our simple yet effective design can inspire more research on exploiting mask transformers for more dense prediction tasks. Code and model will be made available.

PolyMaX：具有遮罩Transformer的通用密集預測

PolyMaX: General Dense Prediction with Mask Transformer

摘要

Support