ChatPaper.aiChatPaper

物件辨識作為下一個標記預測

Object Recognition as Next Token Prediction

December 4, 2023
作者: Kaiyu Yue, Bor-Chun Chen, Jonas Geiping, Hengduo Li, Tom Goldstein, Ser-Nam Lim
cs.AI

摘要

我們提出了一種將物體識別定位為下一個標記預測的方法。這個想法是應用一個語言解碼器,通過自回歸地從圖像嵌入中預測文本標記以形成標籤。為了將這個預測過程基於自回歸,我們定制了一個非因果關係的注意力遮罩給解碼器,包含兩個關鍵特徵:將來自不同標籤的標記建模為獨立,並將圖像標記視為前綴。這種遮罩機制激發了一種高效的方法 - 一次性取樣 - 同時並行取樣多個標記的標記,並在推斷期間按其概率對生成的標籤進行排名。為了進一步提高效率,我們提出了一種簡單的策略,通過簡單地丟棄預訓練語言模型的中間塊來構建一個緊湊的解碼器。這種方法產生了一個與完整模型性能匹配的解碼器,同時更加高效。代碼可在 https://github.com/kaiyuyue/nxtp 找到。
English
We present an approach to pose object recognition as next token prediction. The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels. To ground this prediction process in auto-regression, we customize a non-causal attention mask for the decoder, incorporating two key features: modeling tokens from different labels to be independent, and treating image tokens as a prefix. This masking mechanism inspires an efficient method - one-shot sampling - to simultaneously sample tokens of multiple labels in parallel and rank generated labels by their probabilities during inference. To further enhance the efficiency, we propose a simple strategy to construct a compact decoder by simply discarding the intermediate blocks of a pretrained language model. This approach yields a decoder that matches the full model's performance while being notably more efficient. The code is available at https://github.com/kaiyuyue/nxtp
PDF142December 15, 2024