思维白板：跨模态逐步思考

摘要

当面对涉及视觉思维的问题时，人类自然会切换推理模式，通常会形成心理图像或绘制视觉辅助工具。大型语言模型已经展示出在算术和符号推理方面取得了令人期待的结果，通过将中间推理表达为一系列思维链的文本，但难以将这种能力扩展到回答文本查询，即使进行了广泛的多模态预训练也是如此，这些查询很容易通过视觉推理来解决。我们引入了一种简单的方法，即“思维白板提示”，以解锁多模态大型语言模型在各种模态下的视觉推理能力。思维白板提示为多模态大型语言模型提供了一个比喻性的“白板”，用于将推理步骤绘制为图像，然后将这些图像返回给模型进行进一步处理。我们发现，这可以在没有演示或专门模块的情况下完成，而是利用模型已有的使用诸如Matplotlib和Turtle等库编写代码的能力。这种简单方法在涉及视觉和空间推理的四项困难自然语言任务上展示了最先进的结果。我们确定了多个情景，在这些情景中，使用思维链的GPT-4o会出现严重失败，其中有一个情景中准确率达到了0%，而思维白板提示在这些相同情景中能够实现高达92%的准确率。我们对这种技术成功的详细探讨以及其错误来源进行了阐述。

English

When presented with questions involving visual thinking, humans naturally switch reasoning modalities, often forming mental images or drawing visual aids. Large language models have shown promising results in arithmetic and symbolic reasoning by expressing intermediate reasoning in text as a chain of thought, yet struggle to extend this capability to answer text queries that are easily solved by visual reasoning, even with extensive multimodal pretraining. We introduce a simple method, whiteboard-of-thought prompting, to unlock the visual reasoning capabilities of multimodal large language models across modalities. Whiteboard-of-thought prompting provides multimodal large language models with a metaphorical `whiteboard' to draw out reasoning steps as images, then returns these images back to the model for further processing. We find this can be accomplished with no demonstrations or specialized modules, instead leveraging models' existing ability to write code with libraries such as Matplotlib and Turtle. This simple approach shows state-of-the-art results on four difficult natural language tasks that involve visual and spatial reasoning. We identify multiple settings where GPT-4o using chain-of-thought fails dramatically, including more than one where it achieves 0% accuracy, while whiteboard-of-thought enables up to 92% accuracy in these same settings. We present a detailed exploration of where the technique succeeds as well as its sources of error.

思维白板：跨模态逐步思考

Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

摘要

Support