虛擬寬度網路

摘要

我們提出虛擬寬度網路（VWN），這套框架能在不增加隱藏層大小所引發的二次方計算成本的前提下，實現更寬表徵的優勢。VWN將表徵寬度與骨幹網路寬度解耦，在保持骨幹計算量近乎不變的同時擴展嵌入空間。大規模實驗顯示，8倍擴展使下個詞元預測的優化速度提升逾2倍，下兩個詞元預測速度提升3倍。隨著訓練進行，損失差距擴大且收斂加速比持續增長，表明VWN不僅具有詞元效率，更會隨規模擴展持續增強效能。此外，我們發現虛擬寬度與損失降低之間存在近似對數線性的縮放關係，這為探索虛擬寬度縮放作為大型模型效率的新維度提供了實證基礎與研究動機。

English

We introduce Virtual Width Networks (VWN), a framework that delivers the benefits of wider representations without incurring the quadratic cost of increasing the hidden size. VWN decouples representational width from backbone width, expanding the embedding space while keeping backbone compute nearly constant. In our large-scale experiment, an 8-times expansion accelerates optimization by over 2 times for next-token and 3 times for next-2-token prediction. The advantage amplifies over training as both the loss gap grows and the convergence-speedup ratio increases, showing that VWN is not only token-efficient but also increasingly effective with scale. Moreover, we identify an approximately log-linear scaling relation between virtual width and loss reduction, offering an initial empirical basis and motivation for exploring virtual-width scaling as a new dimension of large-model efficiency.