見落としから成長へ：大規模マルチモーダルモデルのための診断主導型反復学習

要旨

大規模マルチモーダルモデル（LMM）のスケール拡大と強化学習（RL）手法の成熟に伴い、LMMは複雑な推論と意思決定において顕著な進歩を遂げている。しかし、訓練は依然として静的なデータと固定されたレシピに依存しており、能力の盲点を診断したり、動的で標的型の強化を提供したりすることが困難である。テスト主導の誤り曝露とフィードバックに基づく修正が反復練習を凌駕するという知見に動機付けられ、我々は診断がデータ生成と強化を駆動し、各反復で更新されたモデルを再診断することで次の標的型改善を促進する螺旋ループ「診断駆動型漸進的進化（DPE）」を提案する。DPEには2つの主要コンポーネントがある。第一に、複数のエージェントが大規模な未ラベルマルチモーダルデータに注釈を付与し、品質管理を行う。これにより、ウェブ検索や画像編集などのツールを活用して多様で現実的なサンプルを生成する。第二に、DPEは失敗を特定の弱点に帰属させ、データ混合比を動的に調整し、エージェントが弱点に焦点を当てたデータを生成して標的型強化を導く。Qwen3-VL-8B-InstructとQwen2.5-VL-7B-Instructを用いた実験では、11のベンチマークで安定した継続的改善が確認され、DPEが開放的なタスク分布下での継続的LMM訓練におけるスケーラブルなパラダイムであることが示唆された。コード、モデル、データはhttps://github.com/hongruijia/DPEで公開されている。

English

As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static data and fixed recipes, making it difficult to diagnose capability blind spots or provide dynamic, targeted reinforcement. Motivated by findings that test driven error exposure and feedback based correction outperform repetitive practice, we propose Diagnostic-driven Progressive Evolution (DPE), a spiral loop where diagnosis steers data generation and reinforcement, and each iteration re-diagnoses the updated model to drive the next round of targeted improvement. DPE has two key components. First, multiple agents annotate and quality control massive unlabeled multimodal data, using tools such as web search and image editing to produce diverse, realistic samples. Second, DPE attributes failures to specific weaknesses, dynamically adjusts the data mixture, and guides agents to generate weakness focused data for targeted reinforcement. Experiments on Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct show stable, continual gains across eleven benchmarks, indicating DPE as a scalable paradigm for continual LMM training under open task distributions. Our code, models, and data are publicly available at https://github.com/hongruijia/DPE.

見落としから成長へ：大規模マルチモーダルモデルのための診断主導型反復学習

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

要旨

Support