LLaVaOLMoBitnet1B:三元LLM邁向多模態!
LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal!
August 23, 2024
作者: Jainaveen Sundaram, Ravishankar Iyer
cs.AI
摘要
在過去一年中,多模式大型語言模型(MM-LLMs)取得了顯著進展,在各種任務上展現出令人印象深刻的性能。然而,為了真正實現人工智慧的民主化,模型必須具有強大的能力,並能夠在大多數人可以輕鬆訪問的小型計算環境上高效運行。作為這個目標的一部分,我們介紹了 LLaVaOLMoBitnet1B - 第一個能夠接受圖像+文本輸入並生成連貫文本回應的三元多模式LLM。該模型完全開源,並附帶訓練腳本,以鼓勵在這一領域進行進一步研究。本技術報告突出了訓練過程、評估細節、三元模型相關的挑戰以及未來機遇。模型鏈接:https://huggingface.co/IntelLabs/LlavaOLMoBitnet1B
English
Multimodal Large Language Models (MM-LLMs) have seen significant advancements
in the last year, demonstrating impressive performance across tasks. However,
to truly democratize AI, models must exhibit strong capabilities and be able to
run efficiently on small compute footprints accessible by most. Part of this
quest, we introduce LLaVaOLMoBitnet1B - the first Ternary Multimodal LLM
capable of accepting Image(s)+Text inputs to produce coherent textual
responses. The model is fully open-sourced along with training scripts to
encourage further research in this space. This accompanying technical report
highlights the training process, evaluation details, challenges associated with
ternary models and future opportunities. Link to the model:
https://huggingface.co/IntelLabs/LlavaOLMoBitnet1BSummary
AI-Generated Summary