重新思考多层感知器的结构范式
Rethinking the shape convention of an MLP
October 2, 2025
作者: Meng-Hsi Chen, Yu-Ang Lee, Feng-Ting Liao, Da-shan Shiu
cs.AI
摘要
多層感知器(MLPs)傳統上遵循窄-寬-窄的設計模式,其中跳躍連接在輸入/輸出維度上運作,而處理則在擴展的隱藏空間中進行。我們挑戰這一慣例,提出寬-窄-寬(沙漏型)MLP模塊,其中跳躍連接在擴展維度上運作,而殘差計算則流經狹窄的瓶頸。這種反轉利用高維空間進行增量精煉,同時通過參數匹配的設計保持計算效率。實現沙漏型MLPs需要初始投影將輸入信號提升至擴展維度。我們提出,這一投影可以在整個訓練過程中保持隨機初始化不變,從而實現高效的訓練和推理實施。我們在流行圖像數據集上的生成任務中評估這兩種架構,通過系統架構搜索來描述性能-參數帕累托前沿。結果顯示,與傳統設計相比,沙漏型架構始終實現更優的帕累托前沿。隨著參數預算的增加,最佳的沙漏型配置傾向於具有更寬跳躍連接和更窄瓶頸的更深網絡——這一擴展模式與傳統MLPs截然不同。我們的研究結果建議重新考慮現代架構中跳躍連接的佈局,其潛在應用可延伸至Transformer及其他殘差網絡。
English
Multi-layer perceptrons (MLPs) conventionally follow a narrow-wide-narrow
design where skip connections operate at the input/output dimensions while
processing occurs in expanded hidden spaces. We challenge this convention by
proposing wide-narrow-wide (Hourglass) MLP blocks where skip connections
operate at expanded dimensions while residual computation flows through narrow
bottlenecks. This inversion leverages higher-dimensional spaces for incremental
refinement while maintaining computational efficiency through parameter-matched
designs. Implementing Hourglass MLPs requires an initial projection to lift
input signals to expanded dimensions. We propose that this projection can
remain fixed at random initialization throughout training, enabling efficient
training and inference implementations. We evaluate both architectures on
generative tasks over popular image datasets, characterizing
performance-parameter Pareto frontiers through systematic architectural search.
Results show that Hourglass architectures consistently achieve superior Pareto
frontiers compared to conventional designs. As parameter budgets increase,
optimal Hourglass configurations favor deeper networks with wider skip
connections and narrower bottlenecks-a scaling pattern distinct from
conventional MLPs. Our findings suggest reconsidering skip connection placement
in modern architectures, with potential applications extending to Transformers
and other residual networks.