ChatPaper.aiChatPaper

Vintern-1B:一個針對越南語的高效多模式大型語言模型

Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese

August 22, 2024
作者: Khang T. Doan, Bao G. Huynh, Dung T. Hoang, Thuc D. Pham, Nhat H. Pham, Quan T. M. Nguyen, Bang Q. Vo, Suong N. Hoang
cs.AI

摘要

在本報告中,我們介紹了 Vintern-1B,一個可靠的 10 億參數多模態大型語言模型(MLLM),用於越南語任務。通過將 Qwen2-0.5B-Instruct 語言模型與 InternViT-300M-448px 視覺模型整合,Vintern-1B 適用於各種應用,包括光學字符識別(OCR)、文檔提取以及越南語境下的一般問答。該模型在超過 300 萬個圖像問題答案對的大型數據集上進行了微調,實現了穩健的性能,並在多個越南語基準測試中取得可靠的結果,如 OpenViVQA 和 ViTextVQA。Vintern-1B 尺寸適中,易於應用於各種設備上。此外,我們開源了幾個越南語視覺問答(VQA)數據集,涵蓋文本和圖表,使用 Gemini 1.5 Flash 創建。我們的模型可在以下網址獲得:https://huggingface.co/5CD-AI/Vintern-1B-v2。
English
In this report, we introduce Vintern-1B, a reliable 1-billion-parameters multimodal large language model (MLLM) for Vietnamese language tasks. By integrating the Qwen2-0.5B-Instruct language model with the InternViT-300M-448px visual model, Vintern-1B is optimized for a range of applications, including optical character recognition (OCR), document extraction, and general question-answering in Vietnamese context. The model is fine-tuned on an extensive dataset of over 3 million image-question-answer pairs, achieving robust performance and reliable results across multiple Vietnamese language benchmarks like OpenViVQA and ViTextVQA. Vintern-1B is small enough to fit into various on-device applications easily. Additionally, we have open-sourced several Vietnamese vision question answering (VQA) datasets for text and diagrams, created with Gemini 1.5 Flash. Our models are available at: https://huggingface.co/5CD-AI/Vintern-1B-v2.

Summary

AI-Generated Summary

PDF244November 16, 2024