다음 토큰 예측으로서의 객체 인식

초록

본 논문에서는 객체 인식을 다음 토큰 예측 문제로 재구성하는 접근법을 제시한다. 이 아이디어는 이미지 임베딩으로부터 텍스트 토큰을 자동 회귀적으로 예측하여 레이블을 형성하는 언어 디코더를 적용하는 것이다. 이 예측 과정을 자동 회귀에 기반하게 하기 위해, 디코더에 비인과적 어텐션 마스크를 사용자 정의하여 두 가지 주요 특징을 통합하였다: 서로 다른 레이블의 토큰을 독립적으로 모델링하는 것과 이미지 토큰을 접두사로 취급하는 것이다. 이 마스킹 메커니즘은 효율적인 방법인 원샷 샘플링을 가능하게 하여, 추론 과정에서 다중 레이블의 토큰을 병렬로 샘플링하고 생성된 레이블을 확률에 따라 순위를 매길 수 있도록 한다. 효율성을 더욱 향상시키기 위해, 사전 훈련된 언어 모델의 중간 블록을 단순히 제거함으로써 간결한 디코더를 구성하는 전략을 제안한다. 이 접근법은 전체 모델의 성능을 유지하면서도 훨씬 더 효율적인 디코더를 제공한다. 코드는 https://github.com/kaiyuyue/nxtp에서 확인할 수 있다.

English

We present an approach to pose object recognition as next token prediction. The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels. To ground this prediction process in auto-regression, we customize a non-causal attention mask for the decoder, incorporating two key features: modeling tokens from different labels to be independent, and treating image tokens as a prefix. This masking mechanism inspires an efficient method - one-shot sampling - to simultaneously sample tokens of multiple labels in parallel and rank generated labels by their probabilities during inference. To further enhance the efficiency, we propose a simple strategy to construct a compact decoder by simply discarding the intermediate blocks of a pretrained language model. This approach yields a decoder that matches the full model's performance while being notably more efficient. The code is available at https://github.com/kaiyuyue/nxtp

다음 토큰 예측으로서의 객체 인식

Object Recognition as Next Token Prediction

초록

Support