DecQ: 표현 오토인코더에서 향상된 재구성 및 생성을 위한 세부 정보 압축 질의

초록

표현 오토인코더(RAE)는 고정된 비전 기반 모델(VFM)을 토크나이저 인코더로 활용하여 강력한 고수준 표현을 제공하며, 이를 통해 잠재 확산 모델의 빠른 수렴과 고품질 생성을 가능하게 한다. 그러나 VFM을 고정하면 본질적으로 공간 재구성 능력이 제한되어 세밀한 생성과 이미지 편집에 한계가 있다. 반대로 재구성 중심 신호를 미세 조정을 통해 통합하면 사전 학습된 의미 공간이 손상되어 생성 충실도가 저하된다. 이러한 트레이드오프를 해결하기 위해, 우리는 RAE를 위한 간단하면서도 효과적인 프레임워크인 DecQ를 제안한다. 구체적으로, DecQ는 경량의 세부 정보 집약 쿼리를 도입하여 응축기 모듈을 통해 중간 VFM 특성에서 미세한 정보를 추출한다. 이 쿼리들은 디코더에 통합되어 재구성을 지원하며, 생성 모델링 중 패치 토큰과 함께 공동으로 생성된다. DecQ는 얕은 층과 깊은 층의 정보를 모두 집계함으로써 재구성-생성 트레이드오프를 효과적으로 완화하여 재구성 품질과 생성 성능을 모두 개선한다. 실험 결과는 다음과 같다. (1) 단 8개의 추가 쿼리와 3.9%의 추가 연산만으로 DecQ는 고정된 DINOv2 기반 RAE 대비 재구성 성능을 개선하여 PSNR을 19.13dB에서 22.76dB로 향상시킨다. (2) 생성 모델링에서 DecQ는 RAE보다 3.3배 빠른 수렴 속도를 보이며, 가이던스 없이 FID 1.41, 가이던스 적용 시 FID 1.05를 달성한다.

English

Representation Autoencoders (RAEs) leverage frozen vision foundation models (VFMs) as tokenizer encoders, providing robust high-level representations that facilitate fast convergence and high-quality generation in latent diffusion models. However, freezing the VFM inherently constrains its spatial reconstruction capacity, limiting fine-grained generation and image editing; in contrast, incorporating reconstruction-oriented signals via fine-tuning disrupts the pretrained semantic space and degrades generative fidelity. To address this trade-off, we propose DecQ, a simple yet effective framework for RAEs. Specifically, DecQ introduces lightweight detail-condensing queries that extract fine-grained information from intermediate VFM features through condenser modules. These queries are incorporated into the decoder to support reconstruction and are jointly generated with patch tokens during generative modeling. By aggregating information from both shallow and deep layers, DecQ effectively mitigates the reconstruction--generation trade-off, improving both reconstruction quality and generative performance. Our experiments demonstrate that: (1) with only 8 additional queries and 3.9% extra computation, DecQ improves reconstruction over the frozen DINOv2-based RAE, increasing PSNR from 19.13 dB to 22.76 dB; and (2) for generative modeling, DecQ achieves 3.3times faster convergence than RAE, attaining an FID of 1.41 without guidance and 1.05 with guidance.