重新思考多层感知器的结构范式

摘要

多層感知器（MLPs）傳統上遵循窄-寬-窄的設計模式，其中跳躍連接在輸入/輸出維度上運作，而處理則在擴展的隱藏空間中進行。我們挑戰這一慣例，提出寬-窄-寬（沙漏型）MLP模塊，其中跳躍連接在擴展維度上運作，而殘差計算則流經狹窄的瓶頸。這種反轉利用高維空間進行增量精煉，同時通過參數匹配的設計保持計算效率。實現沙漏型MLPs需要初始投影將輸入信號提升至擴展維度。我們提出，這一投影可以在整個訓練過程中保持隨機初始化不變，從而實現高效的訓練和推理實施。我們在流行圖像數據集上的生成任務中評估這兩種架構，通過系統架構搜索來描述性能-參數帕累托前沿。結果顯示，與傳統設計相比，沙漏型架構始終實現更優的帕累托前沿。隨著參數預算的增加，最佳的沙漏型配置傾向於具有更寬跳躍連接和更窄瓶頸的更深網絡——這一擴展模式與傳統MLPs截然不同。我們的研究結果建議重新考慮現代架構中跳躍連接的佈局，其潛在應用可延伸至Transformer及其他殘差網絡。

English

Multi-layer perceptrons (MLPs) conventionally follow a narrow-wide-narrow design where skip connections operate at the input/output dimensions while processing occurs in expanded hidden spaces. We challenge this convention by proposing wide-narrow-wide (Hourglass) MLP blocks where skip connections operate at expanded dimensions while residual computation flows through narrow bottlenecks. This inversion leverages higher-dimensional spaces for incremental refinement while maintaining computational efficiency through parameter-matched designs. Implementing Hourglass MLPs requires an initial projection to lift input signals to expanded dimensions. We propose that this projection can remain fixed at random initialization throughout training, enabling efficient training and inference implementations. We evaluate both architectures on generative tasks over popular image datasets, characterizing performance-parameter Pareto frontiers through systematic architectural search. Results show that Hourglass architectures consistently achieve superior Pareto frontiers compared to conventional designs. As parameter budgets increase, optimal Hourglass configurations favor deeper networks with wider skip connections and narrower bottlenecks-a scaling pattern distinct from conventional MLPs. Our findings suggest reconsidering skip connection placement in modern architectures, with potential applications extending to Transformers and other residual networks.

重新思考多层感知器的结构范式

Rethinking the shape convention of an MLP

摘要

Support