物件辨識作為下一個標記預測

摘要

我們提出了一種將物體識別定位為下一個標記預測的方法。這個想法是應用一個語言解碼器，通過自回歸地從圖像嵌入中預測文本標記以形成標籤。為了將這個預測過程基於自回歸，我們定制了一個非因果關係的注意力遮罩給解碼器，包含兩個關鍵特徵：將來自不同標籤的標記建模為獨立，並將圖像標記視為前綴。這種遮罩機制激發了一種高效的方法 - 一次性取樣 - 同時並行取樣多個標記的標記，並在推斷期間按其概率對生成的標籤進行排名。為了進一步提高效率，我們提出了一種簡單的策略，通過簡單地丟棄預訓練語言模型的中間塊來構建一個緊湊的解碼器。這種方法產生了一個與完整模型性能匹配的解碼器，同時更加高效。代碼可在 https://github.com/kaiyuyue/nxtp 找到。

English

We present an approach to pose object recognition as next token prediction. The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels. To ground this prediction process in auto-regression, we customize a non-causal attention mask for the decoder, incorporating two key features: modeling tokens from different labels to be independent, and treating image tokens as a prefix. This masking mechanism inspires an efficient method - one-shot sampling - to simultaneously sample tokens of multiple labels in parallel and rank generated labels by their probabilities during inference. To further enhance the efficiency, we propose a simple strategy to construct a compact decoder by simply discarding the intermediate blocks of a pretrained language model. This approach yields a decoder that matches the full model's performance while being notably more efficient. The code is available at https://github.com/kaiyuyue/nxtp

物件辨識作為下一個標記預測

Object Recognition as Next Token Prediction

摘要

Support