Trasformatore Parallelo a Blocchi per Modelli di Grande Dimensione con Contesti Lunghi

Abstract

I Transformer sono emersi come il fulcro dei modelli all'avanguardia per l'elaborazione del linguaggio naturale, dimostrando prestazioni eccezionali in un'ampia gamma di applicazioni di intelligenza artificiale. Tuttavia, le esigenze di memoria imposte dal meccanismo di self-attention e dalla grande rete feedforward nei Transformer limitano la loro capacità di gestire sequenze lunghe, creando così sfide per i compiti che coinvolgono più sequenze lunghe o dipendenze a lungo termine. Presentiamo un approccio innovativo, il Blockwise Parallel Transformer (BPT), che sfrutta il calcolo a blocchi della self-attention e la fusione della rete feedforward per minimizzare i costi di memoria. Elaborando sequenze di input più lunghe mantenendo al contempo l'efficienza della memoria, il BPT consente di addestrare sequenze fino a 32 volte più lunghe rispetto ai Transformer tradizionali e da 2 a 4 volte più lunghe rispetto ai precedenti metodi efficienti in termini di memoria. Esperimenti estesi su compiti di modellazione del linguaggio e apprendimento per rinforzo dimostrano l'efficacia del BPT nel ridurre i requisiti di memoria e migliorare le prestazioni.

English

Transformers have emerged as the cornerstone of state-of-the-art natural language processing models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands posed by the self-attention mechanism and the large feedforward network in Transformers limit their ability to handle long sequences, thereby creating challenges for tasks involving multiple long sequences or long-term dependencies. We present a distinct approach, Blockwise Parallel Transformer (BPT), that leverages blockwise computation of self-attention and feedforward network fusion to minimize memory costs. By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences up to 32 times longer than vanilla Transformers and 2 to 4 times longer than previous memory-efficient methods. Extensive experiments on language modeling and reinforcement learning tasks demonstrate the effectiveness of BPT in reducing memory requirements and improving performance.

Trasformatore Parallelo a Blocchi per Modelli di Grande Dimensione con Contesti Lunghi

Blockwise Parallel Transformer for Long Context Large Models

Abstract

Support