ChatPaper.aiChatPaper

目标识别作为下一个标记预测

Object Recognition as Next Token Prediction

December 4, 2023
作者: Kaiyu Yue, Bor-Chun Chen, Jonas Geiping, Hengduo Li, Tom Goldstein, Ser-Nam Lim
cs.AI

摘要

我们提出了一种将物体识别作为下一个标记预测的方法。 这个想法是应用一个语言解码器,自回归地从图像嵌入中预测文本标记以形成标签。为了将这种预测过程基于自回归,我们定制了一个非因果关注蒙版给解码器,结合了两个关键特征:对来自不同标签的标记进行独立建模,以及将图像标记视为前缀。这种蒙版机制启发了一种高效的方法 - 一次性采样 - 可以同时并行地采样多个标签的标记,并在推断过程中根据它们的概率对生成的标签进行排名。为了进一步提高效率,我们提出了一个简单的策略,通过简单地丢弃预训练语言模型的中间块来构建一个紧凑的解码器。这种方法产生了一个与完整模型性能相匹配且明显更高效的解码器。代码可在 https://github.com/kaiyuyue/nxtp 找到。
English
We present an approach to pose object recognition as next token prediction. The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels. To ground this prediction process in auto-regression, we customize a non-causal attention mask for the decoder, incorporating two key features: modeling tokens from different labels to be independent, and treating image tokens as a prefix. This masking mechanism inspires an efficient method - one-shot sampling - to simultaneously sample tokens of multiple labels in parallel and rank generated labels by their probabilities during inference. To further enhance the efficiency, we propose a simple strategy to construct a compact decoder by simply discarding the intermediate blocks of a pretrained language model. This approach yields a decoder that matches the full model's performance while being notably more efficient. The code is available at https://github.com/kaiyuyue/nxtp
PDF142December 15, 2024