LLaVaOLMoBitnet1B: 삼진 LLM이 멀티모달로 진화합니다!

초록

작년 동안에는 다중 모달 대형 언어 모델(MM-LLMs)이 큰 발전을 이루었으며, 다양한 작업에서 인상적인 성능을 보여주었습니다. 그러나 인공지능을 진정으로 대중화하기 위해서는 모델이 강력한 능력을 갖추고 대부분의 사용자가 접근할 수 있는 소형 컴퓨팅 자원 상에서 효율적으로 실행될 수 있어야 합니다. 이를 위한 일환으로, 우리는 이미지+텍스트 입력을 받아 일관된 텍스트 응답을 생성할 수 있는 첫 번째 삼중 모달 LLM인 LLaVaOLMoBitnet1B를 소개합니다. 해당 모델은 완전히 오픈 소스로 공개되었으며 훈련 스크립트도 함께 제공되어 이 분야에서의 추가 연구를 촉진합니다. 본 기술 보고서는 훈련 과정, 평가 세부 정보, 삼진 모델과 관련된 도전과 미래 기회를 강조합니다. 모델 링크: https://huggingface.co/IntelLabs/LlavaOLMoBitnet1B

English

Multimodal Large Language Models (MM-LLMs) have seen significant advancements in the last year, demonstrating impressive performance across tasks. However, to truly democratize AI, models must exhibit strong capabilities and be able to run efficiently on small compute footprints accessible by most. Part of this quest, we introduce LLaVaOLMoBitnet1B - the first Ternary Multimodal LLM capable of accepting Image(s)+Text inputs to produce coherent textual responses. The model is fully open-sourced along with training scripts to encourage further research in this space. This accompanying technical report highlights the training process, evaluation details, challenges associated with ternary models and future opportunities. Link to the model: https://huggingface.co/IntelLabs/LlavaOLMoBitnet1B

LLaVaOLMoBitnet1B: 삼진 LLM이 멀티모달로 진화합니다!

LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal!

초록

Support