VILA^2:VILA 增強版 VILA
VILA^2: VILA Augmented VILA
July 24, 2024
作者: Yunhao Fang, Ligeng Zhu, Yao Lu, Yan Wang, Pavlo Molchanov, Jang Hyun Cho, Marco Pavone, Song Han, Hongxu Yin
cs.AI
摘要
視覺語言模型(VLMs)已迅速發展,受到大型語言模型(LLMs)成功的推動。儘管模型架構和訓練基礎設施迅速發展,但數據策劃仍未得到充分探索。當數據的數量和質量成為瓶頸時,現有的工作要麼直接從互聯網中爬取更多沒有數據質量保證的原始數據,要麼從黑盒商業模型(例如GPT-4V / Gemini)中提煉,這將使性能上限受到該模型的限制。在這項工作中,我們引入了一種新方法,其中包括自我增強步驟和專家增強步驟,以迭代地提高數據質量和模型性能。在自我增強步驟中,VLM重新為其自身的預訓練數據加上標題以增強數據質量,然後從頭開始使用這個精煉的數據集進行重新訓練以提高模型性能。這個過程可以迭代多輪。一旦自我增強達到飽和,我們使用幾個從自我增強的VLM中微調的專家VLM,具有特定領域專業知識,通過面向任務的重新標題和重新訓練進一步將專家知識融入通用VLM中。通過結合自我增強和專家增強的訓練,我們引入了VILA^2(VILA-增強-VILA),這是一個VLM家族,通過在各種任務上持續提高準確性,並在MMMU排行榜上取得了新的最先進結果,超越了開源模型。
English
Visual language models (VLMs) have rapidly progressed, driven by the success
of large language models (LLMs). While model architectures and training
infrastructures advance rapidly, data curation remains under-explored. When
data quantity and quality become a bottleneck, existing work either directly
crawls more raw data from the Internet that does not have a guarantee of data
quality or distills from black-box commercial models (e.g., GPT-4V / Gemini)
causing the performance upper bounded by that model. In this work, we introduce
a novel approach that includes a self-augment step and a specialist-augment
step to iteratively improve data quality and model performance. In the
self-augment step, a VLM recaptions its own pretraining data to enhance data
quality, and then retrains from scratch using this refined dataset to improve
model performance. This process can iterate for several rounds. Once
self-augmentation saturates, we employ several specialist VLMs finetuned from
the self-augmented VLM with domain-specific expertise, to further infuse
specialist knowledge into the generalist VLM through task-oriented recaptioning
and retraining. With the combined self-augmented and specialist-augmented
training, we introduce VILA^2 (VILA-augmented-VILA), a VLM family that
consistently improves the accuracy on a wide range of tasks over prior art, and
achieves new state-of-the-art results on MMMU leaderboard among open-sourced
models.Summary
AI-Generated Summary