MoAI: Смесь Всех Интеллектов для Больших Языковых и Визионных Моделей

Аннотация

Возникновение больших языковых моделей (LLM) и настройка инструкций привели к текущему тренду инструкционно настроенных крупных языковых и видеомоделей (LLVM). Этот тренд включает либо тщательное составление множества наборов данных для настройки инструкций, адаптированных к конкретным целям, либо увеличение размеров LLVM для обработки огромных объемов данных видеоязыка (VL). Однако текущие LLVM игнорируют детальное и всестороннее понимание реального мира, доступное из специализированных моделей компьютерного зрения (CV) в задачах визуального восприятия, таких как сегментация, детекция, генерация графа сцены (SGG) и оптическое распознавание символов (OCR). Вместо этого существующие LLVM в основном полагаются на большую емкость и возможности их LLM основы. Поэтому мы представляем новый LLVM, Mixture of All Intelligence (MoAI), который использует вспомогательную визуальную информацию, полученную из выводов внешних моделей сегментации, детекции, SGG и OCR. MoAI работает через два вновь введенных модуля: MoAI-Compressor и MoAI-Mixer. После вербализации выводов внешних CV моделей MoAI-Compressor выравнивает и сжимает их для эффективного использования соответствующей вспомогательной визуальной информации для задач VL. Затем MoAI-Mixer смешивает три типа интеллекта (1) визуальные признаки, (2) вспомогательные признаки из внешних CV моделей и (3) языковые признаки, используя концепцию Mixture of Experts. Через эту интеграцию MoAI значительно превосходит как открытые, так и закрытые LLVM во многих нулевых задачах VL, особенно связанных с пониманием реального мира, таких как наличие объектов, их позиции, отношения и OCR, без увеличения размера модели или составления дополнительных наборов данных для настройки инструкций визуального восприятия.

English

The rise of large language models (LLMs) and instruction tuning has led to the current trend of instruction-tuned large language and vision models (LLVMs). This trend involves either meticulously curating numerous instruction tuning datasets tailored to specific objectives or enlarging LLVMs to manage vast amounts of vision language (VL) data. However, current LLVMs have disregarded the detailed and comprehensive real-world scene understanding available from specialized computer vision (CV) models in visual perception tasks such as segmentation, detection, scene graph generation (SGG), and optical character recognition (OCR). Instead, the existing LLVMs rely mainly on the large capacity and emergent capabilities of their LLM backbones. Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models. MoAI operates through two newly introduced modules: MoAI-Compressor and MoAI-Mixer. After verbalizing the outputs of the external CV models, the MoAI-Compressor aligns and condenses them to efficiently use relevant auxiliary visual information for VL tasks. MoAI-Mixer then blends three types of intelligence (1) visual features, (2) auxiliary features from the external CV models, and (3) language features by utilizing the concept of Mixture of Experts. Through this integration, MoAI significantly outperforms both open-source and closed-source LLVMs in numerous zero-shot VL tasks, particularly those related to real-world scene understanding such as object existence, positions, relations, and OCR without enlarging the model size or curating extra visual instruction tuning datasets.

MoAI: Смесь Всех Интеллектов для Больших Языковых и Визионных Моделей

MoAI: Mixture of All Intelligence for Large Language and Vision Models

Аннотация

Support