虚拟宽度网络

摘要

我们提出虚拟宽度网络（VWN），该框架能够在避免隐藏层尺寸增加带来二次方计算成本的同时，获得更宽表征的优势。VWN将表征宽度与主干网络宽度解耦，在保持主干计算量近乎不变的前提下扩展嵌入空间。在大规模实验中，8倍扩展使下一词元预测的优化速度提升2倍以上，下一至二词元预测速度提升3倍。随着训练进行，损失差距持续扩大且收敛加速比不断提升，表明VWN不仅具有词元效率，更会随规模扩大持续增强效果。此外，我们发现虚拟宽度与损失减少之间存在近似对数线性的缩放关系，这为探索虚拟宽度缩放作为大模型效率的新维度提供了初步实证依据和研究动机。

English

We introduce Virtual Width Networks (VWN), a framework that delivers the benefits of wider representations without incurring the quadratic cost of increasing the hidden size. VWN decouples representational width from backbone width, expanding the embedding space while keeping backbone compute nearly constant. In our large-scale experiment, an 8-times expansion accelerates optimization by over 2 times for next-token and 3 times for next-2-token prediction. The advantage amplifies over training as both the loss gap grows and the convergence-speedup ratio increases, showing that VWN is not only token-efficient but also increasingly effective with scale. Moreover, we identify an approximately log-linear scaling relation between virtual width and loss reduction, offering an initial empirical basis and motivation for exploring virtual-width scaling as a new dimension of large-model efficiency.