Thyme：超越图像，深入思考

摘要

继OpenAI提出“图像思维”概念后，近期研究致力于探索在推理过程中激发视觉信息的运用，以提升模型在感知与推理任务中的表现。然而，据我们所知，目前尚无开源工作能提供与专有模型（O3）相媲美的丰富功能集，后者既能执行多样化的图像操作，又能通过代码同步增强逻辑推理能力。本文中，我们在此方向上进行了初步尝试，引入了Thyme（超越图像思考），一种新颖的范式，旨在使多模态大语言模型（MLLMs）超越现有的“图像思维”方法，通过可执行代码自主生成并执行多样化的图像处理与计算操作。此方法不仅支持丰富的实时图像处理（如裁剪、旋转、对比度增强），还允许进行数学计算，同时保持高度自主性，决定何时及如何应用这些操作。我们通过两阶段训练策略激活这一能力：首先在精选的50万样本数据集上进行监督微调（SFT），教授代码生成；随后进入强化学习（RL）阶段，以优化决策过程。在RL阶段，我们手动收集并设计高分辨率问答对以增加学习难度，并提出了GRPO-ATS（基于自适应温度采样的群体相对策略优化），该算法对文本与代码生成应用不同温度，以平衡推理探索与代码执行精度。我们进行了广泛的实验分析与消融研究。在近20个基准上的全面评估显示，Thyme在具有挑战性的高分辨率感知与复杂推理任务中，带来了显著且一致的性能提升。

English

Following OpenAI's introduction of the ``thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code. In this paper, we make a preliminary attempt in this direction by introducing Thyme (Think Beyond Images), a novel paradigm for enabling MLLMs to transcend existing ``think with images'' approaches by autonomously generating and executing diverse image processing and computational operations via executable code. This approach not only facilitates a rich, on-the-fly set of image manipulations (e.g., cropping, rotation, contrast enhancement) but also allows for mathematical computations, all while maintaining high autonomy in deciding when and how to apply these operations. We activate this capability through a two-stage training strategy: an initial SFT on a curated dataset of 500K samples to teach code generation, followed by a RL phase to refine decision-making. For the RL stage, we manually collect and design high-resolution question-answer pairs to increase the learning difficulty, and we propose GRPO-ATS (Group Relative Policy Optimization with Adaptive Temperature Sampling), an algorithm that applies distinct temperatures to text and code generation to balance reasoning exploration with code execution precision. We conduct extensive experimental analysis and ablation studies. Comprehensive evaluations on nearly 20 benchmarks show that Thyme yields significant and consistent performance gains, particularly in challenging high-resolution perception and complex reasoning tasks.

Thyme：超越图像，深入思考

Thyme: Think Beyond Images

摘要

Support