パズルに困惑する：視覚言語モデルがヒントを理解できないとき

要旨

リバスパズルは、イメージ、空間配置、象徴的置換を通じて言語を符号化する視覚的謎解きであり、現在の視覚言語モデル（VLM）にとって独特の課題を提起します。従来の画像キャプショニングや質問応答タスクとは異なり、リバスパズルの解決には、マルチモーダルな抽象化、象徴的推論、文化的・音声的・言語的な駄洒落の理解が求められます。本論文では、現代のVLMがリバスパズルを解釈し解決する能力を調査するため、単純な絵文字置換から空間依存の手がかり（例：「頭」が「踵」の上）まで多様な英語のリバスパズルを手作業で生成し、注釈を付けたベンチマークを構築しました。異なるVLMの性能を分析した結果、VLMは単純な視覚的手がかりの解読において驚くべき能力を示す一方で、抽象的推論、ラテラルシンキング、視覚的メタファーの理解を必要とするタスクでは著しく苦戦することが明らかになりました。

English

Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs). Unlike traditional image captioning or question answering tasks, rebus solving requires multi-modal abstraction, symbolic reasoning, and a grasp of cultural, phonetic and linguistic puns. In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles, ranging from simple pictographic substitutions to spatially-dependent cues ("head" over "heels"). We analyze how different VLMs perform, and our findings reveal that while VLMs exhibit some surprising capabilities in decoding simple visual clues, they struggle significantly with tasks requiring abstract reasoning, lateral thinking, and understanding visual metaphors.

パズルに困惑する：視覚言語モデルがヒントを理解できないとき

Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint

要旨

Support