棱镜假说：通过统一自编码实现语义与像素表征的和谐统一

摘要

跨模态的深度表征本质上是相互交织的。本文系统分析了多种语义编码器与像素编码器的频谱特性。有趣的是，我们的研究揭示了一个极具启发性却鲜被探索的对应关系：编码器的特征频谱与其功能角色存在内在关联——语义编码器主要捕获编码抽象含义的低频分量，而像素编码器则额外保留传递细粒度细节的高频信息。这一启发式发现提供了将编码器行为与其底层频谱结构相统一的新视角。我们将其定义为"棱镜假说"：每种数据模态都可视为自然世界在共享特征频谱上的投影，恰如棱镜分光现象。基于此洞见，我们提出了统一自编码模型（UAE），该模型通过创新的频带调制器协调语义结构与像素细节，实现二者的无缝共存。在ImageNet和MS-COCO基准上的大量实验表明，我们的UAE模型以最先进的性能成功将语义抽象与像素级保真度统一至单一潜在空间。

English

Deep representations across modalities are inherently intertwined. In this paper, we systematically analyze the spectral characteristics of various semantic and pixel encoders. Interestingly, our study uncovers a highly inspiring and rarely explored correspondence between an encoder's feature spectrum and its functional role: semantic encoders primarily capture low-frequency components that encode abstract meaning, whereas pixel encoders additionally retain high-frequency information that conveys fine-grained detail. This heuristic finding offers a unifying perspective that ties encoder behavior to its underlying spectral structure. We define it as the Prism Hypothesis, where each data modality can be viewed as a projection of the natural world onto a shared feature spectrum, just like the prism. Building on this insight, we propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details via an innovative frequency-band modulator, enabling their seamless coexistence. Extensive experiments on ImageNet and MS-COCO benchmarks validate that our UAE effectively unifies semantic abstraction and pixel-level fidelity into a single latent space with state-of-the-art performance.