分析视觉令牌的语言
Analyzing The Language of Visual Tokens
November 7, 2024
作者: David M. Chan, Rodolfo Corona, Joonyong Park, Cheol Jun Cho, Yutong Bai, Trevor Darrell
cs.AI
摘要
随着基于Transformer的视觉和语言任务模型的引入,如LLaVA和Chameleon,对图像的离散标记表示再次引起了兴趣。这些模型通常将图像补丁视为离散标记,类似于自然语言中的单词,学习视觉和人类语言之间的联合对齐。然而,关于这些视觉语言的统计行为知之甚少 - 它们是否遵循类似的频率分布、语法结构或拓扑结构,如自然语言。在本文中,我们采用以自然语言为中心的方法来分析离散视觉语言,并揭示了显著的相似性和根本性差异。我们证明,尽管视觉语言遵循Zipf分布,但更高的标记创新会驱动更大的熵和更低的压缩,标记主要代表对象部分,表明中间粒度。我们还展示,视觉语言缺乏连贯的语法结构,导致更高的困惑度和比自然语言更弱的层次组织。最后,我们证明,虽然视觉模型与自然语言更为接近,但这种对齐仍然明显弱于自然语言内部的凝聚力。通过这些实验,我们展示了了解离散视觉语言的统计属性如何可以指导设计更有效的计算机视觉模型。
English
With the introduction of transformer-based models for vision and language
tasks, such as LLaVA and Chameleon, there has been renewed interest in the
discrete tokenized representation of images. These models often treat image
patches as discrete tokens, analogous to words in natural language, learning
joint alignments between visual and human languages. However, little is known
about the statistical behavior of these visual languages - whether they follow
similar frequency distributions, grammatical structures, or topologies as
natural languages. In this paper, we take a natural-language-centric approach
to analyzing discrete visual languages and uncover striking similarities and
fundamental differences. We demonstrate that, although visual languages adhere
to Zipfian distributions, higher token innovation drives greater entropy and
lower compression, with tokens predominantly representing object parts,
indicating intermediate granularity. We also show that visual languages lack
cohesive grammatical structures, leading to higher perplexity and weaker
hierarchical organization compared to natural languages. Finally, we
demonstrate that, while vision models align more closely with natural languages
than other models, this alignment remains significantly weaker than the
cohesion found within natural languages. Through these experiments, we
demonstrate how understanding the statistical properties of discrete visual
languages can inform the design of more effective computer vision models.