RWKV-7「Goose」における表現力豊かな動的状態進化

要旨

RWKV-7「Goose」を紹介します。これは新しいシーケンスモデリングアーキテクチャであり、事前学習された言語モデルとともに、3億パラメータ規模での多言語タスクにおける下流性能で新たな最先端を確立し、他のトップ3Bモデルと比べて大幅に少ないトークンで学習されているにもかかわらず、現在の英語言語性能の最先端に匹敵します。それにもかかわらず、RWKV-7モデルはトークンあたりのメモリ使用量と推論時間が一定です。RWKV-7は、ベクトル値ゲーティングとコンテキスト内学習率を備えた新たに一般化されたデルタルールの定式化、および緩和された値置換ルールを導入します。RWKV-7が状態追跡を実行し、すべての正規言語を認識できることを示します。これにより、標準的な複雑性予想の下でTC^0に限定されるTransformerの能力を超えています。RWKV-7の言語モデリング能力を実証するために、3.1兆トークンの拡張オープンソース多言語コーパスも提示し、このデータセット上で0.19億から29億パラメータまでの4つのRWKV-7モデルを学習させました。オープン性、再現性、採用を促進するために、モデルとデータセットコンポーネントのリストをhttps://huggingface.co/RWKVで、学習および推論コードをhttps://github.com/RWKV/RWKV-LMで、すべてApache 2.0ライセンスの下で公開しています。

English

We present RWKV-7 "Goose", a new sequence modeling architecture, along with pre-trained language models that establish a new state-of-the-art in downstream performance at the 3 billion parameter scale on multilingual tasks, and match current SoTA English language performance despite being trained on dramatically fewer tokens than other top 3B models. Nevertheless, RWKV-7 models require only constant memory usage and constant inference time per token. RWKV-7 introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training. This exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to TC^0. To demonstrate RWKV-7's language modeling capability, we also present an extended open source 3.1 trillion token multilingual corpus, and train four RWKV-7 models ranging from 0.19 billion to 2.9 billion parameters on this dataset. To foster openness, reproduction, and adoption, we release our models and dataset component listing at https://huggingface.co/RWKV, and our training and inference code at https://github.com/RWKV/RWKV-LM all under the Apache 2.0 License.

RWKV-7「Goose」における表現力豊かな動的状態進化

RWKV-7 "Goose" with Expressive Dynamic State Evolution

要旨

Support