视觉Transformer中补丁网格不稳定性的相位边缘化方法

摘要

视觉Transformer基于固定的块网格运行，这可能在密集预测中引入相位依赖性不稳定性：改变块划分会改变像素可用的token证据，尤其是在边界附近。我们将块网格相位形式化为一个干扰变量，并提出相位边缘化（Phase Marginalization），这是一种事后边缘化方法，它评估结构化的块网格相位，反向对齐密集输出，并在原始图像坐标系中聚合这些输出。其核心变体——均匀相位边缘化（K=4）无需训练，且在测量的分割、深度和局部匹配设置中均优于经典的K=1基线。在受控的Cityscapes实验中，均匀相位边缘化相对于基于通用平移的四次前向测试时增强（TTA）方法，在计算量相当的情况下带来了适度优势（比最强测试通用行的平均交并比高出0.31）。进一步的规模研究表明，K=4是一种实用的成本-精度权衡：K=8时性能基本不变，而K=16时精度提升甚微但延迟大幅增加。这些结果表明，块网格相位是一个可测量的干扰变量，而相位边缘化则是一种用于密集ViT预测的简单诊断和事后边缘化基线方法。

English

Vision Transformers operate on fixed patch grids, which can introduce phase-dependent instability for dense prediction: changing the patch partition can change the token evidence available to a pixel, especially near boundaries. We formalize patch-grid phase as a nuisance variable and propose Phase Marginalization, a post-hoc marginalization method that evaluates structured patch-grid phases, inverse-aligns dense outputs, and aggregates them in the original image coordinate system. The central variant, Uniform Phase Marginalization with K = 4, is training-free and improves over the canonical K = 1 baseline across measured segmentation, depth, and local matching settings. In a controlled Cityscapes experiment, Uniform Phase Marginalization provides a modest compute-matched advantage over generic shift-based four-forward test-time augmentation (TTA) (+0.31 mean Intersection-over-Union over the strongest tested generic row). A scaling study further shows that K = 4 is a practical cost-accuracy trade-off: K = 8 is essentially unchanged and K = 16 adds little accuracy at much higher latency. These results position patch-grid phase as a measurable nuisance variable and Phase Marginalization as a simple diagnostic and post-hoc marginalization baseline for dense ViT prediction.