タイム：イメージを超えて考える

要旨

OpenAIが「画像を用いた思考」という概念を導入して以来、最近の研究では、視覚情報を推論プロセスに活用することで、モデルの知覚および推論タスクにおける性能を向上させる試みが進められてきた。しかし、私たちの知る限り、現時点では、多様な画像操作を実行しつつ、コードを通じて論理的推論能力を同時に強化できるプロプライエタリモデル（O3）と同等の機能を提供するオープンソースの研究は存在しない。本論文では、この方向性に向けた予備的な試みとして、Thyme（Think Beyond Images）を提案する。Thymeは、実行可能なコードを通じて多様な画像処理および計算操作を自律的に生成・実行することで、MLLM（マルチモーダル大規模言語モデル）が既存の「画像を用いた思考」アプローチを超越することを可能にする新しいパラダイムである。このアプローチは、画像の切り抜き、回転、コントラスト強調などの豊富なリアルタイム画像操作を容易にするだけでなく、数学的計算も可能にし、これらの操作をいつどのように適用するかについて高い自律性を維持する。この能力を活性化するために、2段階のトレーニング戦略を採用した。まず、コード生成を教えるために50万サンプルの精選されたデータセットで初期のSFT（Supervised Fine-Tuning）を行い、その後、意思決定を洗練させるためのRL（強化学習）フェーズを実施した。RLフェーズでは、学習の難易度を高めるために高解像度の質問-回答ペアを手動で収集・設計し、テキスト生成とコード生成に異なる温度を適用して推論の探索とコード実行の精度をバランスさせるGRPO-ATS（Group Relative Policy Optimization with Adaptive Temperature Sampling）アルゴリズムを提案した。広範な実験分析とアブレーション研究を行い、約20のベンチマークでの包括的評価により、Thymeが特に挑戦的な高解像度知覚および複雑な推論タスクにおいて、顕著かつ一貫した性能向上をもたらすことを示した。

English

Following OpenAI's introduction of the ``thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code. In this paper, we make a preliminary attempt in this direction by introducing Thyme (Think Beyond Images), a novel paradigm for enabling MLLMs to transcend existing ``think with images'' approaches by autonomously generating and executing diverse image processing and computational operations via executable code. This approach not only facilitates a rich, on-the-fly set of image manipulations (e.g., cropping, rotation, contrast enhancement) but also allows for mathematical computations, all while maintaining high autonomy in deciding when and how to apply these operations. We activate this capability through a two-stage training strategy: an initial SFT on a curated dataset of 500K samples to teach code generation, followed by a RL phase to refine decision-making. For the RL stage, we manually collect and design high-resolution question-answer pairs to increase the learning difficulty, and we propose GRPO-ATS (Group Relative Policy Optimization with Adaptive Temperature Sampling), an algorithm that applies distinct temperatures to text and code generation to balance reasoning exploration with code execution precision. We conduct extensive experimental analysis and ablation studies. Comprehensive evaluations on nearly 20 benchmarks show that Thyme yields significant and consistent performance gains, particularly in challenging high-resolution perception and complex reasoning tasks.

タイム：イメージを超えて考える

Thyme: Think Beyond Images

要旨

Support