通过视觉指导调整改进基线
Improved Baselines with Visual Instruction Tuning
October 5, 2023
作者: Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee
cs.AI
摘要
近期大型多模态模型(LMM)在视觉指导调整方面显示出令人鼓舞的进展。在这份说明中,我们展示了LLaVA中的全连接视觉-语言跨模态连接器出人意料地强大且高效。通过对LLaVA进行简单修改,即使用带有MLP投影的CLIP-ViT-L-336px,并添加学术任务导向的VQA数据以及简单的响应格式提示,我们建立了更强的基准线,实现了在11个基准测试中的最新技术水平。我们的最终13B检查点仅使用了120万条公开可用数据,并在单个8-A100节点上约1天内完成了完整训练。我们希望这可以使最新的LMM研究更易于获取。代码和模型将公开提供。
English
Large multimodal models (LMM) have recently shown encouraging progress with
visual instruction tuning. In this note, we show that the fully-connected
vision-language cross-modal connector in LLaVA is surprisingly powerful and
data-efficient. With simple modifications to LLaVA, namely, using
CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA
data with simple response formatting prompts, we establish stronger baselines
that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint
uses merely 1.2M publicly available data, and finishes full training in ~1 day
on a single 8-A100 node. We hope this can make state-of-the-art LMM research
more accessible. Code and model will be publicly available.