三元LLM进化为多模态！

摘要

在过去的一年中，多模态大型语言模型（MM-LLMs）取得了显著进展，在各种任务中展现出令人印象深刻的性能。然而，要真正实现人工智能的民主化，模型必须具备强大的能力，并能够在大多数人可以访问的小型计算环境中高效运行。作为这一探索的一部分，我们推出了 LLaVaOLMoBitnet1B - 第一个能够接受图像+文本输入并生成连贯文本响应的三值多模态语言模型。该模型完全开源，附带训练脚本，以鼓励在这一领域进行进一步研究。本技术报告重点介绍了训练过程、评估细节、三值模型面临的挑战以及未来的机遇。模型链接：https://huggingface.co/IntelLabs/LlavaOLMoBitnet1B

English

Multimodal Large Language Models (MM-LLMs) have seen significant advancements in the last year, demonstrating impressive performance across tasks. However, to truly democratize AI, models must exhibit strong capabilities and be able to run efficiently on small compute footprints accessible by most. Part of this quest, we introduce LLaVaOLMoBitnet1B - the first Ternary Multimodal LLM capable of accepting Image(s)+Text inputs to produce coherent textual responses. The model is fully open-sourced along with training scripts to encourage further research in this space. This accompanying technical report highlights the training process, evaluation details, challenges associated with ternary models and future opportunities. Link to the model: https://huggingface.co/IntelLabs/LlavaOLMoBitnet1B