分析視覺標記的語言
Analyzing The Language of Visual Tokens
November 7, 2024
作者: David M. Chan, Rodolfo Corona, Joonyong Park, Cheol Jun Cho, Yutong Bai, Trevor Darrell
cs.AI
摘要
隨著基於Transformer的視覺和語言任務模型的引入,如LLaVA和Chameleon,對於圖像的離散標記表示再次引起了興趣。這些模型通常將圖像補丁視為離散標記,類似於自然語言中的單詞,學習視覺和人類語言之間的聯合對齊。然而,對於這些視覺語言的統計行為知之甚少 - 它們是否遵循類似的頻率分佈、語法結構或拓撲結構,如同自然語言一樣。在本文中,我們採取了以自然語言為中心的方法來分析離散視覺語言,揭示了顯著的相似之處和基本差異。我們證明,儘管視覺語言遵循Zipfian分佈,但更高的標記創新會帶來更大的熵和更低的壓縮,其中標記主要代表物體部分,表明中間粒度。我們還展示,視覺語言缺乏連貫的語法結構,導致更高的困惑度和比自然語言更弱的階層組織。最後,我們證明,雖然視覺模型與自然語言比其他模型更緊密地對齊,但這種對齊仍然明顯弱於自然語言內部的凝聚力。通過這些實驗,我們展示了了解離散視覺語言的統計特性如何有助於設計更有效的計算機視覺模型。
English
With the introduction of transformer-based models for vision and language
tasks, such as LLaVA and Chameleon, there has been renewed interest in the
discrete tokenized representation of images. These models often treat image
patches as discrete tokens, analogous to words in natural language, learning
joint alignments between visual and human languages. However, little is known
about the statistical behavior of these visual languages - whether they follow
similar frequency distributions, grammatical structures, or topologies as
natural languages. In this paper, we take a natural-language-centric approach
to analyzing discrete visual languages and uncover striking similarities and
fundamental differences. We demonstrate that, although visual languages adhere
to Zipfian distributions, higher token innovation drives greater entropy and
lower compression, with tokens predominantly representing object parts,
indicating intermediate granularity. We also show that visual languages lack
cohesive grammatical structures, leading to higher perplexity and weaker
hierarchical organization compared to natural languages. Finally, we
demonstrate that, while vision models align more closely with natural languages
than other models, this alignment remains significantly weaker than the
cohesion found within natural languages. Through these experiments, we
demonstrate how understanding the statistical properties of discrete visual
languages can inform the design of more effective computer vision models.