从盲点到增益：基于诊断驱动的大规模多模态模型迭代训练

摘要

随着大规模多模态模型（LMMs）的规模扩展与强化学习（RL）方法的成熟，LMMs在复杂推理与决策方面取得了显著进展。然而当前训练仍依赖静态数据和固定范式，难以诊断能力盲区或实现动态精准强化。受测试驱动型错误暴露与反馈校正优于重复训练的启发，我们提出诊断驱动的渐进式演进（DPE）——一种以诊断引导数据生成与模型强化、并通过迭代重诊断更新模型驱动下一轮定向改进的螺旋循环框架。DPE包含两个核心组件：首先，多智能体通过网页搜索、图像编辑等工具对海量无标注多模态数据进行标注与质控，生成多样化的真实样本；其次，DPE将模型失败归因于特定弱点，动态调整数据配比，并指导智能体生成针对弱点的聚焦数据以实现精准强化。在Qwen3-VL-8B-Instruct和Qwen2.5-VL-7B-Instruct上的实验表明，DPE在十一个基准测试中实现稳定持续的性能提升，证明其可作为开放任务分布下持续训练LMM的可扩展范式。我们的代码、模型及数据已开源：https://github.com/hongruijia/DPE。

English

As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static data and fixed recipes, making it difficult to diagnose capability blind spots or provide dynamic, targeted reinforcement. Motivated by findings that test driven error exposure and feedback based correction outperform repetitive practice, we propose Diagnostic-driven Progressive Evolution (DPE), a spiral loop where diagnosis steers data generation and reinforcement, and each iteration re-diagnoses the updated model to drive the next round of targeted improvement. DPE has two key components. First, multiple agents annotate and quality control massive unlabeled multimodal data, using tools such as web search and image editing to produce diverse, realistic samples. Second, DPE attributes failures to specific weaknesses, dynamically adjusts the data mixture, and guides agents to generate weakness focused data for targeted reinforcement. Experiments on Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct show stable, continual gains across eleven benchmarks, indicating DPE as a scalable paradigm for continual LMM training under open task distributions. Our code, models, and data are publicly available at https://github.com/hongruijia/DPE.