從盲點到增益：基於診斷驅動的大規模多模態模型迭代訓練

摘要

隨著大型多模態模型（LMMs）規模擴大與強化學習（RL）方法日趨成熟，LMMs在複雜推理與決策方面已取得顯著進展。然而現有訓練仍依賴靜態數據與固定流程，難以診斷能力盲區或提供動態的針對性強化。受「測試驅動的錯誤暴露與基於反饋的修正勝過重複練習」這一發現啟發，我們提出診斷驅動的漸進演化（DPE）——一種螺旋式循環框架，其中診斷引導數據生成與強化過程，每次迭代都會對更新後的模型重新診斷以驅動下一輪針對性改進。DPE包含兩個核心組件：首先，多個智能體通過網路搜尋、影像編輯等工具對海量未標註多模態數據進行註解與品質管控，生成多樣化且貼近現實的樣本；其次，DPE將失敗案例歸因於特定弱點，動態調整數據配比，並引導智能體生成聚焦弱點的數據以實現精準強化。在Qwen3-VL-8B-Instruct與Qwen2.5-VL-7B-Instruct上的實驗表明，DPE在十一個基準測試中實現穩定且持續的性能提升，證明其為開放任務分佈下持續訓練LMM的可擴展範式。我們的程式碼、模型與數據已公開於：https://github.com/hongruijia/DPE。

English

As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static data and fixed recipes, making it difficult to diagnose capability blind spots or provide dynamic, targeted reinforcement. Motivated by findings that test driven error exposure and feedback based correction outperform repetitive practice, we propose Diagnostic-driven Progressive Evolution (DPE), a spiral loop where diagnosis steers data generation and reinforcement, and each iteration re-diagnoses the updated model to drive the next round of targeted improvement. DPE has two key components. First, multiple agents annotate and quality control massive unlabeled multimodal data, using tools such as web search and image editing to produce diverse, realistic samples. Second, DPE attributes failures to specific weaknesses, dynamically adjusts the data mixture, and guides agents to generate weakness focused data for targeted reinforcement. Experiments on Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct show stable, continual gains across eleven benchmarks, indicating DPE as a scalable paradigm for continual LMM training under open task distributions. Our code, models, and data are publicly available at https://github.com/hongruijia/DPE.

從盲點到增益：基於診斷驅動的大規模多模態模型迭代訓練

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

摘要

Support