캐시드 트랜스포머: 미분 가능 메모리 캐시를 활용한 트랜스포머 성능 개선

초록

본 연구는 Gated Recurrent Cached (GRC) attention을 사용하여 자기 주의(self-attention) 메커니즘을 토큰의 미분 가능한 메모리 캐시로 확장한 새로운 Transformer 모델인 Cached Transformer를 소개합니다. GRC attention은 과거와 현재 토큰 모두에 주의를 기울일 수 있게 하여 주의의 수용 영역을 증가시키고 장거리 의존성을 탐구할 수 있도록 합니다. 반복적인 게이팅 유닛을 사용하여 캐시를 지속적으로 업데이트함으로써, 우리의 모델은 언어 모델링, 기계 번역, ListOPs, 이미지 분류, 객체 탐지, 인스턴스 세그멘테이션을 포함한 여섯 가지 언어 및 비전 작업에서 상당한 발전을 이루었습니다. 더 나아가, 우리의 접근 방식은 언어 모델링과 같은 작업에서 이전의 메모리 기반 기술을 능가하며 더 넓은 범위의 상황에 적용될 수 있는 능력을 보여줍니다.

English

This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens. GRC attention enables attending to both past and current tokens, increasing the receptive field of attention and allowing for exploring long-range dependencies. By utilizing a recurrent gating unit to continuously update the cache, our model achieves significant advancements in six language and vision tasks, including language modeling, machine translation, ListOPs, image classification, object detection, and instance segmentation. Furthermore, our approach surpasses previous memory-based techniques in tasks such as language modeling and displays the ability to be applied to a broader range of situations.

캐시드 트랜스포머: 미분 가능 메모리 캐시를 활용한 트랜스포머 성능 개선

Cached Transformers: Improving Transformers with Differentiable Memory Cache

초록

Support