ChatPaper.aiChatPaper

稜鏡假說:透過統一自編碼實現語義與像素表徵的和諧統一

The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

December 22, 2025
作者: Weichen Fan, Haiwen Diao, Quan Wang, Dahua Lin, Ziwei Liu
cs.AI

摘要

跨模態的深度表徵本質上相互交織。本文系統性分析了各類語義編碼器與像素編碼器的頻譜特性。有趣的是,我們的研究揭示了一個極具啟發性卻鮮少被探索的對應關係:編碼器的特徵頻譜與其功能角色存在內在聯繫——語義編碼器主要捕捉編碼抽象意義的低頻分量,而像素編碼器則額外保留傳遞細粒度細節的高頻信息。這一啟發性發現提供了統一視角,將編碼器行為與其底層頻譜結構相聯繫。我們將其定義為「稜鏡假說」:每種數據模態均可視為自然世界在共享特徵頻譜上的投影,正如光通過稜鏡折射的現象。基於此洞見,我們提出統一自編碼模型(UAE),通過創新的頻帶調製器協調語義結構與像素細節,實現二者的無縫共存。在ImageNet和MS-COCO基準上的大量實驗表明,UAE能以前沿性能將語義抽象與像素級保真度有效統一於單一潛在空間。
English
Deep representations across modalities are inherently intertwined. In this paper, we systematically analyze the spectral characteristics of various semantic and pixel encoders. Interestingly, our study uncovers a highly inspiring and rarely explored correspondence between an encoder's feature spectrum and its functional role: semantic encoders primarily capture low-frequency components that encode abstract meaning, whereas pixel encoders additionally retain high-frequency information that conveys fine-grained detail. This heuristic finding offers a unifying perspective that ties encoder behavior to its underlying spectral structure. We define it as the Prism Hypothesis, where each data modality can be viewed as a projection of the natural world onto a shared feature spectrum, just like the prism. Building on this insight, we propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details via an innovative frequency-band modulator, enabling their seamless coexistence. Extensive experiments on ImageNet and MS-COCO benchmarks validate that our UAE effectively unifies semantic abstraction and pixel-level fidelity into a single latent space with state-of-the-art performance.
PDF533December 24, 2025