思緒白板：跨模態逐步思考

摘要

當面對涉及視覺思維的問題時，人類自然會切換推理模式，常常形成心理圖像或繪製視覺輔助工具。大型語言模型在算術和符號推理方面展現出有希望的結果，通過將中間推理表達為一系列文字來進行，但在回答那些容易通過視覺推理解決的文本查詢時卻遇到困難，即使經過廣泛的多模態預訓練也難以做到。我們引入了一種簡單的方法，即“思維白板提示”，以解鎖多模態大型語言模型在各種模式下的視覺推理能力。思維白板提示為多模態大型語言模型提供了一個比喻性的“白板”，用於將推理步驟繪製成圖像，然後將這些圖像返回給模型進行進一步處理。我們發現，這可以在不需要演示或專門模塊的情況下完成，而是利用模型已有的使用Matplotlib和Turtle等庫來編寫代碼的能力。這種簡單方法在涉及視覺和空間推理的四個困難自然語言任務上展示出了最先進的結果。我們確定了多種情況，GPT-4o在其中使用思維鏈失敗嚴重，其中有一個情況下其準確率達到0％，而思維白板提示在這些相同情況下實現了高達92％的準確率。我們詳細探討了這種技術成功的場景以及其錯誤來源。

English

When presented with questions involving visual thinking, humans naturally switch reasoning modalities, often forming mental images or drawing visual aids. Large language models have shown promising results in arithmetic and symbolic reasoning by expressing intermediate reasoning in text as a chain of thought, yet struggle to extend this capability to answer text queries that are easily solved by visual reasoning, even with extensive multimodal pretraining. We introduce a simple method, whiteboard-of-thought prompting, to unlock the visual reasoning capabilities of multimodal large language models across modalities. Whiteboard-of-thought prompting provides multimodal large language models with a metaphorical `whiteboard' to draw out reasoning steps as images, then returns these images back to the model for further processing. We find this can be accomplished with no demonstrations or specialized modules, instead leveraging models' existing ability to write code with libraries such as Matplotlib and Turtle. This simple approach shows state-of-the-art results on four difficult natural language tasks that involve visual and spatial reasoning. We identify multiple settings where GPT-4o using chain-of-thought fails dramatically, including more than one where it achieves 0% accuracy, while whiteboard-of-thought enables up to 92% accuracy in these same settings. We present a detailed exploration of where the technique succeeds as well as its sources of error.

思緒白板：跨模態逐步思考

Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

摘要

Support