DecQ: 表現オートエンコーダにおける再構築と生成を向上させるための詳細凝縮クエリ

要旨

表現オートエンコーダ（RAE）は、凍結された視覚基盤モデル（VFM）をトークナイザエンコーダとして活用し、堅牢な高次表現を提供することで、潜在拡散モデルにおける高速な収束と高品質な生成を促進する。しかしながら、VFMを凍結すると、その空間再構成能力が本質的に制約され、詳細な生成や画像編集が制限される。一方、ファインチューニングによる再構成指向の信号の組み込みは、事前学習された意味空間を乱し、生成の忠実度を低下させる。このトレードオフに対処するため、我々はRAE向けのシンプルかつ効果的なフレームワークであるDecQを提案する。具体的には、DecQは軽量な詳細凝縮クエリを導入し、凝縮モジュールを介して中間VFM特徴から詳細な情報を抽出する。これらのクエリはデコーダに組み込まれて再構成を支援し、生成モデリング中にパッチトークンと共に生成される。浅い層と深い層の両方からの情報を集約することで、DecQは再構成と生成のトレードオフを効果的に緩和し、再構成品質と生成性能の両方を向上させる。実験結果は以下のことを示している。（1）わずか8個の追加クエリと3.9%の追加計算で、DecQは凍結されたDINOv2ベースのRAEよりも再構成を改善し、PSNRを19.13 dBから22.76 dBに向上させる。（2）生成モデリングにおいて、DecQはRAEよりも3.3倍高速な収束を達成し、ガイダンスなしでFID 1.41、ガイダンスありで1.05を達成する。

English

Representation Autoencoders (RAEs) leverage frozen vision foundation models (VFMs) as tokenizer encoders, providing robust high-level representations that facilitate fast convergence and high-quality generation in latent diffusion models. However, freezing the VFM inherently constrains its spatial reconstruction capacity, limiting fine-grained generation and image editing; in contrast, incorporating reconstruction-oriented signals via fine-tuning disrupts the pretrained semantic space and degrades generative fidelity. To address this trade-off, we propose DecQ, a simple yet effective framework for RAEs. Specifically, DecQ introduces lightweight detail-condensing queries that extract fine-grained information from intermediate VFM features through condenser modules. These queries are incorporated into the decoder to support reconstruction and are jointly generated with patch tokens during generative modeling. By aggregating information from both shallow and deep layers, DecQ effectively mitigates the reconstruction--generation trade-off, improving both reconstruction quality and generative performance. Our experiments demonstrate that: (1) with only 8 additional queries and 3.9% extra computation, DecQ improves reconstruction over the frozen DINOv2-based RAE, increasing PSNR from 19.13 dB to 22.76 dB; and (2) for generative modeling, DecQ achieves 3.3times faster convergence than RAE, attaining an FID of 1.41 without guidance and 1.05 with guidance.