Thyme：超越圖像的思考

摘要

繼OpenAI提出「圖像思維」概念後，近期研究探索了在推理過程中激發視覺信息的使用，以提升模型在感知和推理任務中的表現。然而，據我們所知，目前尚無開源工作能提供與專有模型（O3）相媲美的豐富功能集，這些模型能夠執行多樣的圖像操作，並通過代碼同時增強邏輯推理能力。本文中，我們初步嘗試這一方向，引入了Thyme（超越圖像思考），這是一種新穎的範式，旨在使多模態大語言模型（MLLMs）超越現有的「圖像思維」方法，通過可執行代碼自主生成並執行多樣的圖像處理和計算操作。此方法不僅促成了豐富的即時圖像操作（如裁剪、旋轉、對比度增強），還允許進行數學計算，同時在決定何時及如何應用這些操作時保持高度自主性。我們通過兩階段訓練策略激活這一能力：首先在精選的50萬樣本數據集上進行監督微調（SFT）以教授代碼生成，隨後進行強化學習（RL）階段以精煉決策。在RL階段，我們手動收集並設計高分辨率問答對以增加學習難度，並提出GRPO-ATS（自適應溫度採樣的群組相對策略優化），這是一種算法，對文本和代碼生成應用不同的溫度，以平衡推理探索與代碼執行精度。我們進行了廣泛的實驗分析和消融研究。在近20個基準上的全面評估顯示，Thyme在具有挑戰性的高分辨率感知和複雜推理任務中，帶來了顯著且一致的性能提升。

English

Following OpenAI's introduction of the ``thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code. In this paper, we make a preliminary attempt in this direction by introducing Thyme (Think Beyond Images), a novel paradigm for enabling MLLMs to transcend existing ``think with images'' approaches by autonomously generating and executing diverse image processing and computational operations via executable code. This approach not only facilitates a rich, on-the-fly set of image manipulations (e.g., cropping, rotation, contrast enhancement) but also allows for mathematical computations, all while maintaining high autonomy in deciding when and how to apply these operations. We activate this capability through a two-stage training strategy: an initial SFT on a curated dataset of 500K samples to teach code generation, followed by a RL phase to refine decision-making. For the RL stage, we manually collect and design high-resolution question-answer pairs to increase the learning difficulty, and we propose GRPO-ATS (Group Relative Policy Optimization with Adaptive Temperature Sampling), an algorithm that applies distinct temperatures to text and code generation to balance reasoning exploration with code execution precision. We conduct extensive experimental analysis and ablation studies. Comprehensive evaluations on nearly 20 benchmarks show that Thyme yields significant and consistent performance gains, particularly in challenging high-resolution perception and complex reasoning tasks.

Thyme：超越圖像的思考

Thyme: Think Beyond Images

摘要

Support