Thyme:超越圖像的思考
Thyme: Think Beyond Images
August 15, 2025
作者: Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, Haonan Fan, Kaibing Chen, Jiankang Chen, Haojie Ding, Kaiyu Tang, Zhang Zhang, Liang Wang, Fan Yang, Tingting Gao, Guorui Zhou
cs.AI
摘要
繼OpenAI提出「圖像思維」概念後,近期研究探索了在推理過程中激發視覺信息的使用,以提升模型在感知和推理任務中的表現。然而,據我們所知,目前尚無開源工作能提供與專有模型(O3)相媲美的豐富功能集,這些模型能夠執行多樣的圖像操作,並通過代碼同時增強邏輯推理能力。本文中,我們初步嘗試這一方向,引入了Thyme(超越圖像思考),這是一種新穎的範式,旨在使多模態大語言模型(MLLMs)超越現有的「圖像思維」方法,通過可執行代碼自主生成並執行多樣的圖像處理和計算操作。此方法不僅促成了豐富的即時圖像操作(如裁剪、旋轉、對比度增強),還允許進行數學計算,同時在決定何時及如何應用這些操作時保持高度自主性。我們通過兩階段訓練策略激活這一能力:首先在精選的50萬樣本數據集上進行監督微調(SFT)以教授代碼生成,隨後進行強化學習(RL)階段以精煉決策。在RL階段,我們手動收集並設計高分辨率問答對以增加學習難度,並提出GRPO-ATS(自適應溫度採樣的群組相對策略優化),這是一種算法,對文本和代碼生成應用不同的溫度,以平衡推理探索與代碼執行精度。我們進行了廣泛的實驗分析和消融研究。在近20個基準上的全面評估顯示,Thyme在具有挑戰性的高分辨率感知和複雜推理任務中,帶來了顯著且一致的性能提升。
English
Following OpenAI's introduction of the ``thinking with images'' concept,
recent efforts have explored stimulating the use of visual information in the
reasoning process to enhance model performance in perception and reasoning
tasks. However, to the best of our knowledge, no open-source work currently
offers a feature set as rich as proprietary models (O3), which can perform
diverse image manipulations and simultaneously enhance logical reasoning
capabilities through code. In this paper, we make a preliminary attempt in this
direction by introducing Thyme (Think Beyond Images), a novel paradigm for
enabling MLLMs to transcend existing ``think with images'' approaches by
autonomously generating and executing diverse image processing and
computational operations via executable code. This approach not only
facilitates a rich, on-the-fly set of image manipulations (e.g., cropping,
rotation, contrast enhancement) but also allows for mathematical computations,
all while maintaining high autonomy in deciding when and how to apply these
operations. We activate this capability through a two-stage training strategy:
an initial SFT on a curated dataset of 500K samples to teach code generation,
followed by a RL phase to refine decision-making. For the RL stage, we manually
collect and design high-resolution question-answer pairs to increase the
learning difficulty, and we propose GRPO-ATS (Group Relative Policy
Optimization with Adaptive Temperature Sampling), an algorithm that applies
distinct temperatures to text and code generation to balance reasoning
exploration with code execution precision. We conduct extensive experimental
analysis and ablation studies. Comprehensive evaluations on nearly 20
benchmarks show that Thyme yields significant and consistent performance gains,
particularly in challenging high-resolution perception and complex reasoning
tasks.