通过视觉指导调整改进基线

摘要

近期大型多模态模型（LMM）在视觉指导调整方面显示出令人鼓舞的进展。在这份说明中，我们展示了LLaVA中的全连接视觉-语言跨模态连接器出人意料地强大且高效。通过对LLaVA进行简单修改，即使用带有MLP投影的CLIP-ViT-L-336px，并添加学术任务导向的VQA数据以及简单的响应格式提示，我们建立了更强的基准线，实现了在11个基准测试中的最新技术水平。我们的最终13B检查点仅使用了120万条公开可用数据，并在单个8-A100节点上约1天内完成了完整训练。我们希望这可以使最新的LMM研究更易于获取。代码和模型将公开提供。

English

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.

通过视觉指导调整改进基线

Improved Baselines with Visual Instruction Tuning

摘要

Support