ChatPaper.aiChatPaper

針對視覺Transformer中補丁網格不穩定性的相位邊緣化

Phase Marginalization for Patch-Grid Instability in Vision Transformers

June 6, 2026
作者: Oğuzhan Ercan
cs.AI

摘要

視覺變壓器(Vision Transformers)使用固定區塊網格進行運作,這可能在密集預測中引入與相位相關的不穩定性:改變區塊劃分方式會改變像素可獲得的標記證據,尤其是在邊界附近。我們將區塊網格相位形式化為一項干擾變數,並提出「相位邊際化」(Phase Marginalization)方法——這是一種事後邊際化技術,透過評估結構化的區塊網格相位、逆向對齊密集輸出,並在原始影像座標系中將其聚合。其核心變體——使用 K=4 的均勻相位邊際化(Uniform Phase Marginalization)——無需額外訓練,且在測量的分割、深度及局部匹配設定中均優於標準的 K=1 基線。在一項受控的 Cityscapes 實驗中,均勻相位邊際化相較於基於通用平移的四次前向測試時增強(TTA),提供了適度的計算匹配優勢(相較於最強測試通用列,平均交並比提升 0.31)。規模化研究進一步顯示,K=4 是實用成本-準確權衡點:K=8 的結果本質上無變化,而 K=16 則在明顯更高的延遲下僅增加極少的準確度。這些結果將區塊網格相位定位為一項可測量的干擾變數,並將相位邊際化確立為密集 ViT 預測中一種簡單的診斷與事後邊際化基線。
English
Vision Transformers operate on fixed patch grids, which can introduce phase-dependent instability for dense prediction: changing the patch partition can change the token evidence available to a pixel, especially near boundaries. We formalize patch-grid phase as a nuisance variable and propose Phase Marginalization, a post-hoc marginalization method that evaluates structured patch-grid phases, inverse-aligns dense outputs, and aggregates them in the original image coordinate system. The central variant, Uniform Phase Marginalization with K = 4, is training-free and improves over the canonical K = 1 baseline across measured segmentation, depth, and local matching settings. In a controlled Cityscapes experiment, Uniform Phase Marginalization provides a modest compute-matched advantage over generic shift-based four-forward test-time augmentation (TTA) (+0.31 mean Intersection-over-Union over the strongest tested generic row). A scaling study further shows that K = 4 is a practical cost-accuracy trade-off: K = 8 is essentially unchanged and K = 16 adds little accuracy at much higher latency. These results position patch-grid phase as a measurable nuisance variable and Phase Marginalization as a simple diagnostic and post-hoc marginalization baseline for dense ViT prediction.