LLaVaOLMoBitnet1B: 三値論理LSTMがマルチモーダルになりました！

要旨

過去1年間で、多様なモダリティを持つ大規模言語モデル（MM-LLMs）は著しい進歩を遂げ、さまざまなタスクで印象的なパフォーマンスを示してきました。しかしながら、AIの民主化を実現するためには、モデルが強力な能力を発揮し、ほとんどの人がアクセスできる小規模な計算リソース上で効率的に実行できる必要があります。この探求の一環として、私たちは、画像+テキストの入力を受け入れ、一貫したテキスト応答を生成することができる初の三値多様なモダリティ言語モデルであるLLaVaOLMoBitnet1Bを紹介します。このモデルは完全にオープンソース化されており、トレーニングスクリプトも提供されており、この分野でのさらなる研究を促進することを目的としています。本技術レポートでは、トレーニングプロセス、評価の詳細、三値モデルに関連する課題、および将来の機会について取り上げています。モデルへのリンク：https://huggingface.co/IntelLabs/LlavaOLMoBitnet1B

English

Multimodal Large Language Models (MM-LLMs) have seen significant advancements in the last year, demonstrating impressive performance across tasks. However, to truly democratize AI, models must exhibit strong capabilities and be able to run efficiently on small compute footprints accessible by most. Part of this quest, we introduce LLaVaOLMoBitnet1B - the first Ternary Multimodal LLM capable of accepting Image(s)+Text inputs to produce coherent textual responses. The model is fully open-sourced along with training scripts to encourage further research in this space. This accompanying technical report highlights the training process, evaluation details, challenges associated with ternary models and future opportunities. Link to the model: https://huggingface.co/IntelLabs/LlavaOLMoBitnet1B

LLaVaOLMoBitnet1B: 三値論理LSTMがマルチモーダルになりました！

LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal!

要旨

Support