「主成分」開啟了圖像表達的新語言
"Principal Components" Enable A New Language of Images
March 11, 2025
作者: Xin Wen, Bingchen Zhao, Ismail Elezi, Jiankang Deng, Xiaojuan Qi
cs.AI
摘要
我們提出了一種新穎的視覺標記化框架,該框架將可證明的類PCA結構嵌入到潛在標記空間中。現有的視覺標記化方法主要優化重建保真度,但往往忽略了潛在空間的結構特性——這對於可解釋性和下游任務至關重要。我們的方法為圖像生成一維因果標記序列,其中每個後續標記貢獻不重疊的信息,並具有數學上保證的遞減解釋方差,類似於主成分分析。這種結構約束確保標記化器首先提取最顯著的視覺特徵,每個後續標記則添加遞減但互補的信息。此外,我們識別並解決了語義-頻譜耦合效應,該效應導致高層語義內容和低層頻譜細節在標記中產生不必要的糾纏,通過利用擴散解碼器來實現。實驗表明,我們的方法達到了最先進的重建性能,並實現了更好的可解釋性,與人類視覺系統保持一致。此外,在我們的標記序列上訓練的自回歸模型達到了與當前最先進方法相當的性能,同時在訓練和推理過程中需要更少的標記。
English
We introduce a novel visual tokenization framework that embeds a provable
PCA-like structure into the latent token space. While existing visual
tokenizers primarily optimize for reconstruction fidelity, they often neglect
the structural properties of the latent space -- a critical factor for both
interpretability and downstream tasks. Our method generates a 1D causal token
sequence for images, where each successive token contributes non-overlapping
information with mathematically guaranteed decreasing explained variance,
analogous to principal component analysis. This structural constraint ensures
the tokenizer extracts the most salient visual features first, with each
subsequent token adding diminishing yet complementary information.
Additionally, we identified and resolved a semantic-spectrum coupling effect
that causes the unwanted entanglement of high-level semantic content and
low-level spectral details in the tokens by leveraging a diffusion decoder.
Experiments demonstrate that our approach achieves state-of-the-art
reconstruction performance and enables better interpretability to align with
the human vision system. Moreover, auto-regressive models trained on our token
sequences achieve performance comparable to current state-of-the-art methods
while requiring fewer tokens for training and inference.Summary
AI-Generated Summary