PolyMaX:具有遮罩Transformer的通用密集預測
PolyMaX: General Dense Prediction with Mask Transformer
November 9, 2023
作者: Xuan Yang, Liangzhe Yuan, Kimberly Wilber, Astuti Sharma, Xiuye Gu, Siyuan Qiao, Stephanie Debats, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Liang-Chieh Chen
cs.AI
摘要
密集預測任務,如語義分割、深度估計和表面法向預測,可以輕鬆地被定義為每像素分類(離散輸出)或回歸(連續輸出)。這種每像素預測範式由於完全卷積網絡的普及而保持流行。然而,在分割任務的最新前沿上,社群目睹了從每像素預測轉向到集群預測的範式轉變,這是由於變壓器架構的出現,特別是遮罩變壓器,它直接為遮罩預測標籤而不是像素。儘管存在這種轉變,仍然基於每像素預測範式的方法主導著需要連續輸出的其他密集預測任務的基準測試,例如深度估計和表面法向預測。受到 DORN 和 AdaBins 在深度估計中通過離散化連續輸出空間取得成功的啟發,我們提出將基於集群預測的方法概括到一般密集預測任務。這使我們能夠將密集預測任務與遮罩變壓器框架統一起來。值得注意的是,所得到的模型 PolyMaX 在 NYUD-v2 數據集的三個基準測試中展現了最先進的性能。我們希望我們簡單而有效的設計可以激發更多關於如何利用遮罩變壓器進行更多密集預測任務的研究。代碼和模型將可用。
English
Dense prediction tasks, such as semantic segmentation, depth estimation, and
surface normal prediction, can be easily formulated as per-pixel classification
(discrete outputs) or regression (continuous outputs). This per-pixel
prediction paradigm has remained popular due to the prevalence of fully
convolutional networks. However, on the recent frontier of segmentation task,
the community has been witnessing a shift of paradigm from per-pixel prediction
to cluster-prediction with the emergence of transformer architectures,
particularly the mask transformers, which directly predicts a label for a mask
instead of a pixel. Despite this shift, methods based on the per-pixel
prediction paradigm still dominate the benchmarks on the other dense prediction
tasks that require continuous outputs, such as depth estimation and surface
normal prediction. Motivated by the success of DORN and AdaBins in depth
estimation, achieved by discretizing the continuous output space, we propose to
generalize the cluster-prediction based method to general dense prediction
tasks. This allows us to unify dense prediction tasks with the mask transformer
framework. Remarkably, the resulting model PolyMaX demonstrates
state-of-the-art performance on three benchmarks of NYUD-v2 dataset. We hope
our simple yet effective design can inspire more research on exploiting mask
transformers for more dense prediction tasks. Code and model will be made
available.