仮想幅ネットワーク

要旨

我々はVirtual Width Networks（VWN）を提案する。このフレームワークは、隠れ層サイズの増大に伴う二次コストを発生させることなく、より広い表現の利点を実現する。VWNは表現幅とバックボーン幅を分離し、埋め込み空間を拡張しながらバックボーンの計算量をほぼ一定に保つ。大規模実験では、8倍の拡張により、次トークン予測では2倍以上、次々トークン予測では3倍以上の最適化加速が確認された。この利点は訓練の進行に伴い、損失差の拡大と収束速度向上率の増加という形で増幅され、VWNがトークン効率が良いだけでなく、スケールに応じて効果が持続的に高まることを示している。さらに、仮想幅と損失減少の間に近似的に対数線形のスケーリング関係が存在することを確認し、大規模モデル効率化の新たな次元として仮想幅スケーリングを探求する実証的基盤と動機を提供する。

English

We introduce Virtual Width Networks (VWN), a framework that delivers the benefits of wider representations without incurring the quadratic cost of increasing the hidden size. VWN decouples representational width from backbone width, expanding the embedding space while keeping backbone compute nearly constant. In our large-scale experiment, an 8-times expansion accelerates optimization by over 2 times for next-token and 3 times for next-2-token prediction. The advantage amplifies over training as both the loss gap grows and the convergence-speedup ratio increases, showing that VWN is not only token-efficient but also increasingly effective with scale. Moreover, we identify an approximately log-linear scaling relation between virtual width and loss reduction, offering an initial empirical basis and motivation for exploring virtual-width scaling as a new dimension of large-model efficiency.