ChatPaper.aiChatPaper

Vintern-1B:一种用于越南语的高效多模态大型语言模型

Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese

August 22, 2024
作者: Khang T. Doan, Bao G. Huynh, Dung T. Hoang, Thuc D. Pham, Nhat H. Pham, Quan T. M. Nguyen, Bang Q. Vo, Suong N. Hoang
cs.AI

摘要

在本报告中,我们介绍了Vintern-1B,这是一个可靠的10亿参数的多模态大型语言模型(MLLM),用于越南语任务。通过将Qwen2-0.5B-Instruct语言模型与InternViT-300M-448px视觉模型相结合,Vintern-1B针对一系列应用进行了优化,包括光学字符识别(OCR)、文档提取以及越南语境下的一般问答。该模型在超过300万个图像-问题-答案对的大型数据集上进行了微调,实现了强大的性能,并在多个越南语基准测试中取得可靠的结果,如OpenViVQA和ViTextVQA。Vintern-1B体积较小,易于适配各种设备应用。此外,我们还开源了几个越南语视觉问答(VQA)数据集,涵盖文本和图表,使用了Gemini 1.5 Flash创建。我们的模型可在以下链接获取:https://huggingface.co/5CD-AI/Vintern-1B-v2。
English
In this report, we introduce Vintern-1B, a reliable 1-billion-parameters multimodal large language model (MLLM) for Vietnamese language tasks. By integrating the Qwen2-0.5B-Instruct language model with the InternViT-300M-448px visual model, Vintern-1B is optimized for a range of applications, including optical character recognition (OCR), document extraction, and general question-answering in Vietnamese context. The model is fine-tuned on an extensive dataset of over 3 million image-question-answer pairs, achieving robust performance and reliable results across multiple Vietnamese language benchmarks like OpenViVQA and ViTextVQA. Vintern-1B is small enough to fit into various on-device applications easily. Additionally, we have open-sourced several Vietnamese vision question answering (VQA) datasets for text and diagrams, created with Gemini 1.5 Flash. Our models are available at: https://huggingface.co/5CD-AI/Vintern-1B-v2.

Summary

AI-Generated Summary

PDF244November 16, 2024