ChatPaper.aiChatPaper

透過視覺指導調整改進基準线

Improved Baselines with Visual Instruction Tuning

October 5, 2023
作者: Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee
cs.AI

摘要

最近大型多模型(LMM)在視覺指導調整方面展示了令人鼓舞的進展。在這裡,我們展示了LLaVA中全連接視覺-語言跨模態連接器出乎意料地強大且高效。通過對LLaVA進行簡單修改,即使用CLIP-ViT-L-336px與MLP投影,並添加學術任務導向的VQA數據以及簡單的回應格式提示,我們建立了更強的基準線,並在11個基準測試中實現了最先進的表現。我們的最終13B檢查點僅使用了120萬筆公開數據,在單個8-A100節點上約1天內完成完整訓練。我們希望這能使最先進的LMM研究更具可及性。代碼和模型將公開提供。
English
Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.
PDF378December 15, 2024