left|,circlearrowright,text{BUS},right|:一个用于评估视觉语言模型理解字谜能力的大规模多样化多模态基准
left|,circlearrowright,text{BUS},right|: A Large and Diverse Multimodal Benchmark for evaluating the ability of Vision-Language Models to understand Rebus Puzzles
November 3, 2025
作者: Trishanu Das, Abhilash Nandy, Khush Bajaj, Deepiha S
cs.AI
摘要
理解画谜(Rebus Puzzles)需要综合运用图像识别、认知技能、常识推理、多步推理、基于图像的双关语等多种能力,这使得即使对当前最先进的视觉语言模型而言也是项具有挑战性的任务。本文推出包含1,333个英文画谜的left|,circlearrowright,text{BUS},right|大型多样化基准数据集,这些画谜涵盖食品、成语、体育、金融、娱乐等18个类别,具有不同的艺术风格和难度等级。我们同时提出RebusDescProgICE——一种模型无关的框架,该框架结合非结构化描述与基于代码的结构化推理,并采用改进的基于推理的上下文示例选择策略,相比思维链推理方法,在使用闭源与开源模型时分别将left|,circlearrowright,text{BUS},right|上的性能提升了2.1-4.1%和20-30%。
English
Understanding Rebus Puzzles (Rebus Puzzles use pictures, symbols, and letters
to represent words or phrases creatively) requires a variety of skills such as
image recognition, cognitive skills, commonsense reasoning, multi-step
reasoning, image-based wordplay, etc., making this a challenging task for even
current Vision-Language Models. In this paper, we present
left|,circlearrowright,text{BUS},right|, a large and diverse
benchmark of 1,333 English Rebus Puzzles containing different artistic styles
and levels of difficulty, spread across 18 categories such as food, idioms,
sports, finance, entertainment, etc. We also propose RebusDescProgICE, a
model-agnostic framework which uses a combination of an unstructured
description and code-based, structured reasoning, along with better,
reasoning-based in-context example selection, improving the performance of
Vision-Language Models on
left|,circlearrowright,text{BUS},right| by 2.1-4.1% and
20-30% using closed-source and open-source models respectively compared to
Chain-of-Thought Reasoning.