ARWKV：預訓練不是我們所需的，一個源自Transformer的RNN-注意力語言模型誕生

摘要

眾所周知，在多頭架構中，混合二次和次二次注意力模型已經超越了Transformer和線性RNN模型，這些工作主要集中在減少KV複雜度並提高效率。為了進一步研究表達能力，我們介紹了一系列從Qwen 2.5提煉出的模型，基於純本地RWKV-7注意力，旨在使RNN更具表達力，展示超越Transformer的狀態追踪能力。我們使用基於RWKV-6架構的QRWK 32B進行研究，這是另一種方法，將整個知識處理時間縮短為僅8小時，使用16個AMD MI300X GPU，同時保持Qwen 2.5的性能。事實上，提煉過程可以利用任何LLM，而不僅僅是Qwen，並且可以實現從更大的LLM向較小的LLM進行知識轉移，並使用更少的標記。我們將解釋詳細的過程，並分享我們在構建更強大基礎模型方面的見解。請注意，這是一項持續進行的工作，將不斷更新。模型檢查點和源代碼可在以下鏈接找到：https://github.com/yynil/RWKVInside，https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1。

English

As is known, hybrid quadratic and subquadratic attention models in multi-head architectures have surpassed both Transformer and Linear RNN models , with these works primarily focusing on reducing KV complexity and improving efficiency. For further research on expressiveness, we introduce our series of models distilled from Qwen 2.5, based on pure native RWKV-7 attention, which aims to make RNN more expressive and demonstrates state tracking ability beyond transformers. We work with QRWK 32B based on RWKV-6 architecture, another approach that reduces the entire knowledge processing time to just 8 hours using 16 AMD MI300X GPUs while maintaining Qwen 2.5's performance. In fact, the distillation process can utilize any LLM, not just Qwen, and enables knowledge transfer from larger LLMs to smaller ones with more fewer tokens. We will explain the detailed process and share our insights on building more powerful foundation models. Please note that this is an ongoing work that will be updated continuously. The model checkpoints and source code are available at https://github.com/yynil/RWKVInside{https://github.com/yynil/RWKVInside}, https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}.

ARWKV：預訓練不是我們所需的，一個源自Transformer的RNN-注意力語言模型誕生

ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer

摘要

Support