超越最后一层:多层表示融合用于视觉标记化
Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization
May 12, 2026
作者: Xuanyu Zhu, Yan Bai, Yang Shi, Yihang Lou, Yuanxing Zhang, Jing Jin, Yuan Zhou
cs.AI
摘要
重用冻结的预训练视觉编码器作为视觉分词器的表示自编码器,已在重建和生成质量上取得了显著成果。然而,现有方法普遍仅从最后一个编码器层提取特征,丢弃了分布于中间层的丰富层次化信息。我们证明,低级视觉细节仅在经过多层语义抽象后,以衰减残差形式存留于最后一层,而显式融合多层特征可以显著恢复这些丢失的信息。我们提出DRoRAE(深度路由表示自编码器),这是一种轻量级融合模块,通过能量约束路由和增量校正自适应聚合所有编码器层,生成与冻结预训练解码器兼容的增强潜变量。采用三阶段解耦训练策略:首先在冻结解码器的隐式分布约束下学习融合,然后微调解码器以充分利用增强的表示。在ImageNet-256上,DRoRAE将rFID从0.57降至0.29,生成FID(带自动引导)从1.74提升至1.65,且增益可迁移至文本到图像合成任务。此外,我们揭示了融合能力与重建质量之间的对数线性缩放定律(R²=0.86),将表示丰富度确立为视觉分词器新的、可预测扩展维度,类似于NLP中词汇规模的作用。
English
Representation autoencoders that reuse frozen pretrained vision encoders as visual tokenizers have achieved strong reconstruction and generation quality. However, existing methods universally extract features from only the last encoder layer, discarding the rich hierarchical information distributed across intermediate layers. We show that low-level visual details survive in the last layer merely as attenuated residuals after multiple layers of semantic abstraction, and that explicitly fusing multi-layer features can substantially recover this lost information. We propose DRoRAE (Depth-Routed Representation AutoEncoder), a lightweight fusion module that adaptively aggregates all encoder layers via energy-constrained routing and incremental correction, producing an enriched latent compatible with a frozen pretrained decoder. A three-phase decoupled training strategy first learns the fusion under the implicit distributional constraint of the frozen decoder, then fine-tunes the decoder to fully exploit the enriched representation. On ImageNet-256, DRoRAE reduces rFID from 0.57 to 0.29 and improves generation FID from 1.74 to 1.65 (with AutoGuidance), with gains also transferring to text-to-image synthesis. Furthermore, we uncover a log-linear scaling law (R^2{=}0.86) between fusion capacity and reconstruction quality, identifying representation richness as a new, predictably scalable dimension for visual tokenizers analogous to vocabulary size in NLP.