LLaVaOLMoBitnet1B：三元LLM邁向多模態！

摘要

在過去一年中，多模式大型語言模型（MM-LLMs）取得了顯著進展，在各種任務上展現出令人印象深刻的性能。然而，為了真正實現人工智慧的民主化，模型必須具有強大的能力，並能夠在大多數人可以輕鬆訪問的小型計算環境上高效運行。作為這個目標的一部分，我們介紹了 LLaVaOLMoBitnet1B - 第一個能夠接受圖像+文本輸入並生成連貫文本回應的三元多模式LLM。該模型完全開源，並附帶訓練腳本，以鼓勵在這一領域進行進一步研究。本技術報告突出了訓練過程、評估細節、三元模型相關的挑戰以及未來機遇。模型鏈接：https://huggingface.co/IntelLabs/LlavaOLMoBitnet1B

English

Multimodal Large Language Models (MM-LLMs) have seen significant advancements in the last year, demonstrating impressive performance across tasks. However, to truly democratize AI, models must exhibit strong capabilities and be able to run efficiently on small compute footprints accessible by most. Part of this quest, we introduce LLaVaOLMoBitnet1B - the first Ternary Multimodal LLM capable of accepting Image(s)+Text inputs to produce coherent textual responses. The model is fully open-sourced along with training scripts to encourage further research in this space. This accompanying technical report highlights the training process, evaluation details, challenges associated with ternary models and future opportunities. Link to the model: https://huggingface.co/IntelLabs/LlavaOLMoBitnet1B

LLaVaOLMoBitnet1B：三元LLM邁向多模態！

LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal!

摘要

Support