视觉Transformer中补丁网格不稳定性的相位边缘化方法
Phase Marginalization for Patch-Grid Instability in Vision Transformers
June 6, 2026
作者: Oğuzhan Ercan
cs.AI
摘要
视觉Transformer基于固定的块网格运行,这可能在密集预测中引入相位依赖性不稳定性:改变块划分会改变像素可用的token证据,尤其是在边界附近。我们将块网格相位形式化为一个干扰变量,并提出相位边缘化(Phase Marginalization),这是一种事后边缘化方法,它评估结构化的块网格相位,反向对齐密集输出,并在原始图像坐标系中聚合这些输出。其核心变体——均匀相位边缘化(K=4)无需训练,且在测量的分割、深度和局部匹配设置中均优于经典的K=1基线。在受控的Cityscapes实验中,均匀相位边缘化相对于基于通用平移的四次前向测试时增强(TTA)方法,在计算量相当的情况下带来了适度优势(比最强测试通用行的平均交并比高出0.31)。进一步的规模研究表明,K=4是一种实用的成本-精度权衡:K=8时性能基本不变,而K=16时精度提升甚微但延迟大幅增加。这些结果表明,块网格相位是一个可测量的干扰变量,而相位边缘化则是一种用于密集ViT预测的简单诊断和事后边缘化基线方法。
English
Vision Transformers operate on fixed patch grids, which can introduce phase-dependent instability for dense prediction: changing the patch partition can change the token evidence available to a pixel, especially near boundaries. We formalize patch-grid phase as a nuisance variable and propose Phase Marginalization, a post-hoc marginalization method that evaluates structured patch-grid phases, inverse-aligns dense outputs, and aggregates them in the original image coordinate system. The central variant, Uniform Phase Marginalization with K = 4, is training-free and improves over the canonical K = 1 baseline across measured segmentation, depth, and local matching settings. In a controlled Cityscapes experiment, Uniform Phase Marginalization provides a modest compute-matched advantage over generic shift-based four-forward test-time augmentation (TTA) (+0.31 mean Intersection-over-Union over the strongest tested generic row). A scaling study further shows that K = 4 is a practical cost-accuracy trade-off: K = 8 is essentially unchanged and K = 16 adds little accuracy at much higher latency. These results position patch-grid phase as a measurable nuisance variable and Phase Marginalization as a simple diagnostic and post-hoc marginalization baseline for dense ViT prediction.