Thyme: 이미지를 넘어 사고하라

초록

OpenAI가 '이미지를 통한 사고' 개념을 소개한 이후, 최근 연구들은 인지 및 추론 과제에서 모델 성능을 향상시키기 위해 시각 정보를 활용하는 방식을 탐구해 왔습니다. 그러나 우리가 아는 한, 현재 오픈소스 작업 중에는 다양한 이미지 조작을 수행하고 동시에 코드를 통해 논리적 추론 능력을 강화할 수 있는 O3와 같은 독점 모델만큼 풍부한 기능 세트를 제공하는 것이 없습니다. 본 논문에서는 이러한 방향으로의 초기 시도로서, Thyme(Think Beyond Images)이라는 새로운 패러다임을 소개합니다. Thyme은 MLLM(Multimodal Large Language Model)이 실행 가능한 코드를 통해 다양한 이미지 처리 및 계산 작업을 자율적으로 생성하고 실행함으로써 기존의 '이미지를 통한 사고' 접근법을 뛰어넘을 수 있도록 합니다. 이 접근법은 크롭핑, 회전, 대비 강화와 같은 즉석에서의 풍부한 이미지 조작뿐만 아니라 수학적 계산도 가능하게 하며, 이러한 작업을 언제 어떻게 적용할지에 대한 높은 자율성을 유지합니다. 우리는 이 능력을 두 단계의 학습 전략을 통해 활성화합니다: 첫째, 코드 생성을 가르치기 위해 500K 샘플로 구성된 데이터셋에 대한 초기 SFT(Supervised Fine-Tuning)를 수행하고, 둘째, 의사결정을 개선하기 위해 RL(Reinforcement Learning) 단계를 진행합니다. RL 단계에서는 학습 난이도를 높이기 위해 고해상도 질문-답변 쌍을 수동으로 수집 및 설계하고, 텍스트와 코드 생성에 각기 다른 온도를 적용하여 추론 탐색과 코드 실행 정확도를 균형 있게 조절하는 GRPO-ATS(Group Relative Policy Optimization with Adaptive Temperature Sampling) 알고리즘을 제안합니다. 우리는 광범위한 실험 분석과 어블레이션 연구를 수행했습니다. 20개에 가까운 벤치마크에 대한 종합적 평가 결과, Thyme은 특히 도전적인 고해상도 인지 및 복잡한 추론 과제에서 상당하고 일관된 성능 향상을 보여주었습니다.

English

Following OpenAI's introduction of the ``thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code. In this paper, we make a preliminary attempt in this direction by introducing Thyme (Think Beyond Images), a novel paradigm for enabling MLLMs to transcend existing ``think with images'' approaches by autonomously generating and executing diverse image processing and computational operations via executable code. This approach not only facilitates a rich, on-the-fly set of image manipulations (e.g., cropping, rotation, contrast enhancement) but also allows for mathematical computations, all while maintaining high autonomy in deciding when and how to apply these operations. We activate this capability through a two-stage training strategy: an initial SFT on a curated dataset of 500K samples to teach code generation, followed by a RL phase to refine decision-making. For the RL stage, we manually collect and design high-resolution question-answer pairs to increase the learning difficulty, and we propose GRPO-ATS (Group Relative Policy Optimization with Adaptive Temperature Sampling), an algorithm that applies distinct temperatures to text and code generation to balance reasoning exploration with code execution precision. We conduct extensive experimental analysis and ablation studies. Comprehensive evaluations on nearly 20 benchmarks show that Thyme yields significant and consistent performance gains, particularly in challenging high-resolution perception and complex reasoning tasks.

Thyme: 이미지를 넘어 사고하라

Thyme: Think Beyond Images

초록

Support