針對視覺Transformer中補丁網格不穩定性的相位邊緣化

摘要

視覺變壓器（Vision Transformers）使用固定區塊網格進行運作，這可能在密集預測中引入與相位相關的不穩定性：改變區塊劃分方式會改變像素可獲得的標記證據，尤其是在邊界附近。我們將區塊網格相位形式化為一項干擾變數，並提出「相位邊際化」（Phase Marginalization）方法——這是一種事後邊際化技術，透過評估結構化的區塊網格相位、逆向對齊密集輸出，並在原始影像座標系中將其聚合。其核心變體——使用 K=4 的均勻相位邊際化（Uniform Phase Marginalization）——無需額外訓練，且在測量的分割、深度及局部匹配設定中均優於標準的 K=1 基線。在一項受控的 Cityscapes 實驗中，均勻相位邊際化相較於基於通用平移的四次前向測試時增強（TTA），提供了適度的計算匹配優勢（相較於最強測試通用列，平均交並比提升 0.31）。規模化研究進一步顯示，K=4 是實用成本－準確權衡點：K=8 的結果本質上無變化，而 K=16 則在明顯更高的延遲下僅增加極少的準確度。這些結果將區塊網格相位定位為一項可測量的干擾變數，並將相位邊際化確立為密集 ViT 預測中一種簡單的診斷與事後邊際化基線。

English

Vision Transformers operate on fixed patch grids, which can introduce phase-dependent instability for dense prediction: changing the patch partition can change the token evidence available to a pixel, especially near boundaries. We formalize patch-grid phase as a nuisance variable and propose Phase Marginalization, a post-hoc marginalization method that evaluates structured patch-grid phases, inverse-aligns dense outputs, and aggregates them in the original image coordinate system. The central variant, Uniform Phase Marginalization with K = 4, is training-free and improves over the canonical K = 1 baseline across measured segmentation, depth, and local matching settings. In a controlled Cityscapes experiment, Uniform Phase Marginalization provides a modest compute-matched advantage over generic shift-based four-forward test-time augmentation (TTA) (+0.31 mean Intersection-over-Union over the strongest tested generic row). A scaling study further shows that K = 4 is a practical cost-accuracy trade-off: K = 8 is essentially unchanged and K = 16 adds little accuracy at much higher latency. These results position patch-grid phase as a measurable nuisance variable and Phase Marginalization as a simple diagnostic and post-hoc marginalization baseline for dense ViT prediction.