left|,circlearrowright,text{BUS},right|：一个用于评估视觉语言模型理解字谜能力的大规模多样化多模态基准

摘要

理解画谜（Rebus Puzzles）需要综合运用图像识别、认知技能、常识推理、多步推理、基于图像的双关语等多种能力，这使得即使对当前最先进的视觉语言模型而言也是项具有挑战性的任务。本文推出包含1,333个英文画谜的left|,circlearrowright,text{BUS},right|大型多样化基准数据集，这些画谜涵盖食品、成语、体育、金融、娱乐等18个类别，具有不同的艺术风格和难度等级。我们同时提出RebusDescProgICE——一种模型无关的框架，该框架结合非结构化描述与基于代码的结构化推理，并采用改进的基于推理的上下文示例选择策略，相比思维链推理方法，在使用闭源与开源模型时分别将left|,circlearrowright,text{BUS},right|上的性能提升了2.1-4.1%和20-30%。

English

Understanding Rebus Puzzles (Rebus Puzzles use pictures, symbols, and letters to represent words or phrases creatively) requires a variety of skills such as image recognition, cognitive skills, commonsense reasoning, multi-step reasoning, image-based wordplay, etc., making this a challenging task for even current Vision-Language Models. In this paper, we present left|,circlearrowright,text{BUS},right|, a large and diverse benchmark of 1,333 English Rebus Puzzles containing different artistic styles and levels of difficulty, spread across 18 categories such as food, idioms, sports, finance, entertainment, etc. We also propose RebusDescProgICE, a model-agnostic framework which uses a combination of an unstructured description and code-based, structured reasoning, along with better, reasoning-based in-context example selection, improving the performance of Vision-Language Models on left|,circlearrowright,text{BUS},right| by 2.1-4.1% and 20-30% using closed-source and open-source models respectively compared to Chain-of-Thought Reasoning.

left|,circlearrowright,text{BUS},right|：一个用于评估视觉语言模型理解字谜能力的大规模多样化多模态基准

left|,circlearrowright,text{BUS},right|: A Large and Diverse Multimodal Benchmark for evaluating the ability of Vision-Language Models to understand Rebus Puzzles

摘要

Support