참석 위치: 포물선을 활용한 원리 기반 시각 중심 위치 인코딩

초록

우리는 어텐션 기반 아키텍처에서 비전 모달리티를 위한 포물선 기반 위치 인코딩인 Parabolic Position Encoding(PaPE)을 제안한다. 이미지, 포인트 클라우드, 비디오, 이벤트 카메라 스트림과 같은 일련의 비전 토큰이 주어졌을 때, 우리의 목표는 비전 모달리티의 특성을 고려하여 이들의 위치를 인코딩하는 것이다. 기존 연구들은 주로 언어 모델의 1D 시퀀스용 위치 인코딩을 비전의 nD 구조로 확장해왔지만, 비전의 특성을 부분적으로만 반영해왔다. 우리는 선행 연구에서 도출된 원리—병진 불변성, 회전 불변성(PaPE-RI), 거리 감쇠, 방향성, 맥락 인식—를 바탕으로 PaPE를 설계하여 이 격차를 해소한다. 우리는 4가지 모달리티에 걸친 8개의 데이터셋에서 PaPE를 평가했다. 그 결과, 8개 데이터셋 중 7개에서 PaPE 또는 PaPE-RI가 최고 성능을 달성했다. ImageNet-1K에 대한 외삽 실험에서 PaPE는 매우 우수한 외삽 성능을 보였으며, 차선책 위치 인코딩 대비 최대 10.5%의 절대적 성능 향상을 기록했다. 코드는 https://github.com/DTU-PAS/parabolic-position-encoding에서 확인할 수 있다.

English

We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as images, point clouds, videos, or event camera streams-our objective is to encode their positions while accounting for the characteristics of vision modalities. Prior works have largely extended position encodings from 1D-sequences in language to nD-structures in vision, but only with partial account of vision characteristics. We address this gap by designing PaPE from principles distilled from prior work: translation invariance, rotation invariance (PaPE-RI), distance decay, directionality, and context awareness. We evaluate PaPE on 8 datasets that span 4 modalities. We find that either PaPE or PaPE-RI achieves the top performance on 7 out of 8 datasets. Extrapolation experiments on ImageNet-1K show that PaPE extrapolates remarkably well, improving in absolute terms by up to 10.5% over the next-best position encoding. Code is available at https://github.com/DTU-PAS/parabolic-position-encoding.

참석 위치: 포물선을 활용한 원리 기반 시각 중심 위치 인코딩

Where to Attend: A Principled Vision-Centric Position Encoding with Parabolas

초록

Support