超越最後一層:多層表徵融合於視覺標記化
Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization
May 12, 2026
作者: Xuanyu Zhu, Yan Bai, Yang Shi, Yihang Lou, Yuanxing Zhang, Jing Jin, Yuan Zhou
cs.AI
摘要
重用凍結的預訓練視覺編碼器作為視覺標記器的表示自編碼器已達到強勁的重建與生成品質。然而,現有方法普遍僅從最後一個編碼器層提取特徵,丟棄了分佈在中間層的豐富層次資訊。我們證明,低層視覺細節在經過多層語義抽象後,僅以衰減殘差的形式存留於最後一層,而明確融合多層特徵可大幅恢復此遺失資訊。我們提出DRoRAE(深度路由表示自編碼器),這是一個輕量融合模組,透過能量約束路由與增量校正自適應聚合所有編碼器層,產生與凍結預訓練解碼器相容的豐富潛在表示。三階段解耦訓練策略首先在凍結解碼器的隱式分布約束下學習融合,隨後微調解碼器以充分運用豐富化表示。在ImageNet-256上,DRoRAE將rFID從0.57降至0.29,並將生成FID從1.74提升至1.65(搭配AutoGuidance),且增益亦擴展至文字到圖像合成。此外,我們揭示了融合能力與重建品質之間的對數線性縮放定律(R²=0.86),將表示豐富性確立為視覺標記器中類似於NLP詞彙量的新型可預測擴展維度。
English
Representation autoencoders that reuse frozen pretrained vision encoders as visual tokenizers have achieved strong reconstruction and generation quality. However, existing methods universally extract features from only the last encoder layer, discarding the rich hierarchical information distributed across intermediate layers. We show that low-level visual details survive in the last layer merely as attenuated residuals after multiple layers of semantic abstraction, and that explicitly fusing multi-layer features can substantially recover this lost information. We propose DRoRAE (Depth-Routed Representation AutoEncoder), a lightweight fusion module that adaptively aggregates all encoder layers via energy-constrained routing and incremental correction, producing an enriched latent compatible with a frozen pretrained decoder. A three-phase decoupled training strategy first learns the fusion under the implicit distributional constraint of the frozen decoder, then fine-tunes the decoder to fully exploit the enriched representation. On ImageNet-256, DRoRAE reduces rFID from 0.57 to 0.29 and improves generation FID from 1.74 to 1.65 (with AutoGuidance), with gains also transferring to text-to-image synthesis. Furthermore, we uncover a log-linear scaling law (R^2{=}0.86) between fusion capacity and reconstruction quality, identifying representation richness as a new, predictably scalable dimension for visual tokenizers analogous to vocabulary size in NLP.