重新思考多层感知机的结构范式

摘要

传统的多层感知机（MLPs）通常采用窄-宽-窄的设计模式，其中跳跃连接在输入/输出维度上运作，而处理过程则在扩展的隐藏空间中进行。我们对此传统提出挑战，提出了一种宽-窄-宽（沙漏型）MLP模块，其中跳跃连接在扩展维度上运作，而残差计算则通过狭窄的瓶颈流动。这种反转利用高维空间进行增量优化，同时通过参数匹配设计保持计算效率。实现沙漏型MLPs需要初始投影将输入信号提升至扩展维度。我们提出，这一投影可以在整个训练过程中保持随机初始化不变，从而实现高效的训练和推理实施。我们在流行图像数据集上的生成任务中评估了这两种架构，通过系统性的架构搜索描绘了性能-参数的帕累托前沿。结果显示，沙漏型架构在帕累托前沿上始终优于传统设计。随着参数预算的增加，最优的沙漏型配置倾向于更深的网络、更宽的跳跃连接和更窄的瓶颈——这一扩展模式与传统MLPs截然不同。我们的研究结果表明，在现代架构中重新考虑跳跃连接的布局具有潜力，其应用可能扩展到Transformer及其他残差网络。

English

Multi-layer perceptrons (MLPs) conventionally follow a narrow-wide-narrow design where skip connections operate at the input/output dimensions while processing occurs in expanded hidden spaces. We challenge this convention by proposing wide-narrow-wide (Hourglass) MLP blocks where skip connections operate at expanded dimensions while residual computation flows through narrow bottlenecks. This inversion leverages higher-dimensional spaces for incremental refinement while maintaining computational efficiency through parameter-matched designs. Implementing Hourglass MLPs requires an initial projection to lift input signals to expanded dimensions. We propose that this projection can remain fixed at random initialization throughout training, enabling efficient training and inference implementations. We evaluate both architectures on generative tasks over popular image datasets, characterizing performance-parameter Pareto frontiers through systematic architectural search. Results show that Hourglass architectures consistently achieve superior Pareto frontiers compared to conventional designs. As parameter budgets increase, optimal Hourglass configurations favor deeper networks with wider skip connections and narrower bottlenecks-a scaling pattern distinct from conventional MLPs. Our findings suggest reconsidering skip connection placement in modern architectures, with potential applications extending to Transformers and other residual networks.

重新思考多层感知机的结构范式

Rethinking the shape convention of an MLP

摘要

Support