困惑於謎題：當視覺-語言模型無法領會提示時

摘要

謎語圖謎，這種通過圖像、空間排列和符號替換來編碼語言的視覺謎題，對當前的視覺-語言模型（VLMs）提出了獨特的挑戰。與傳統的圖像描述或問答任務不同，解謎語圖謎需要多模態抽象、符號推理，以及對文化、語音和語言雙關的理解。本文中，我們通過構建一個手工生成且註釋多樣化的英語謎語圖謎基準，從簡單的象形替換到空間依賴的提示（如“頭”在“腳跟”之上），來探究當代VLMs在解讀和解決謎語圖謎方面的能力。我們分析了不同VLMs的表現，結果顯示，儘管VLMs在解碼簡單視覺線索方面展現出一些令人驚訝的能力，但在需要抽象推理、橫向思維及理解視覺隱喻的任務上，它們仍面臨顯著困難。

English

Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs). Unlike traditional image captioning or question answering tasks, rebus solving requires multi-modal abstraction, symbolic reasoning, and a grasp of cultural, phonetic and linguistic puns. In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles, ranging from simple pictographic substitutions to spatially-dependent cues ("head" over "heels"). We analyze how different VLMs perform, and our findings reveal that while VLMs exhibit some surprising capabilities in decoding simple visual clues, they struggle significantly with tasks requiring abstract reasoning, lateral thinking, and understanding visual metaphors.

困惑於謎題：當視覺-語言模型無法領會提示時

Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint

摘要

Support