적은 데이터로도 최첨단 성능 달성: 시각적 추론 자기 개선을 위한 MCTS 기반 샘플 선택

초록

본 논문에서는 지식 증류 없이 순수한 자기 개선을 통해 훨씬 적은 수의 학습 샘플로 시각적 추론 능력을 향상시키는 효과적인 방법을 제시합니다. 우리의 핵심 통찰은 강화 미세 조정(Reinforcement Fine-Tuning, RFT) 과정에서 학습 데이터의 난이도가 매우 중요하다는 것입니다. 적절히 도전적인 샘플은 데이터셋이 작더라도 추론 능력을 크게 향상시킬 수 있습니다. 직관적이지만, 주요 과제는 효과적인 데이터 필터링을 위해 샘플 난이도를 정확하게 정량화하는 데 있습니다. 이를 위해 우리는 몬테카를로 트리 탐색(Monte Carlo Tree Search, MCTS)을 재활용하는 새로운 방법을 제안합니다. 우리가 선별한 70,000개의 오픈소스 학습 샘플을 시작으로, MCTS 기반 선택 방법을 도입하여 VLM이 각 문제를 해결하는 데 필요한 반복 횟수를 기반으로 샘플 난이도를 정량화합니다. MCTS에서의 명시적인 단계별 추론은 모델이 더 오래 생각하도록 강제하며, 진정으로 도전적인 샘플을 더 잘 식별합니다. 우리는 11,000개의 샘플을 필터링하여 Qwen2.5-VL-7B-Instruct에 RFT를 수행하고, 최종 모델인 ThinkLite-VL을 얻었습니다. 8개의 벤치마크에서의 평가 결과, ThinkLite-VL은 Qwen2.5-VL-7B-Instruct의 평균 성능을 7% 향상시켰으며, 지식 증류 없이 단 11,000개의 학습 샘플만을 사용했습니다. 이는 모든 기존 7B 수준의 추론 VLM과 정확도 기반 필터링과 같은 전통적인 선택 방법을 사용한 비교 가능한 베이스라인을 크게 능가합니다. 특히, MathVista에서 ThinkLite-VL-7B는 75.1의 SoTA 정확도를 달성하며, Qwen2.5-VL-72B, GPT-4o, O1을 능가했습니다. 우리의 코드, 데이터 및 모델은 https://github.com/si0wang/ThinkLite-VL에서 확인할 수 있습니다.

English

In this paper, we present an effective method to enhance visual reasoning with significantly fewer training samples, relying purely on self-improvement with no knowledge distillation. Our key insight is that the difficulty of training data during reinforcement fine-tuning (RFT) is critical. Appropriately challenging samples can substantially boost reasoning capabilities even when the dataset is small. Despite being intuitive, the main challenge remains in accurately quantifying sample difficulty to enable effective data filtering. To this end, we propose a novel way of repurposing Monte Carlo Tree Search (MCTS) to achieve that. Starting from our curated 70k open-source training samples, we introduce an MCTS-based selection method that quantifies sample difficulty based on the number of iterations required by the VLMs to solve each problem. This explicit step-by-step reasoning in MCTS enforces the model to think longer and better identifies samples that are genuinely challenging. We filter and retain 11k samples to perform RFT on Qwen2.5-VL-7B-Instruct, resulting in our final model, ThinkLite-VL. Evaluation results on eight benchmarks show that ThinkLite-VL improves the average performance of Qwen2.5-VL-7B-Instruct by 7%, using only 11k training samples with no knowledge distillation. This significantly outperforms all existing 7B-level reasoning VLMs, and our fairly comparable baselines that use classic selection methods such as accuracy-based filtering. Notably, on MathVista, ThinkLite-VL-7B achieves the SoTA accuracy of 75.1, surpassing Qwen2.5-VL-72B, GPT-4o, and O1. Our code, data, and model are available at https://github.com/si0wang/ThinkLite-VL.

적은 데이터로도 최첨단 성능 달성: 시각적 추론 자기 개선을 위한 MCTS 기반 샘플 선택

SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement

초록

Support