Shakti-VLM：エンタープライズAIのためのスケーラブルな視覚言語モデル

要旨

私たちは、マルチモーダル学習におけるデータ効率の課題に対処するために設計された、1Bおよび4Bパラメータ規模のビジョン・ランゲージモデル（VLM）ファミリーであるShakti VLMを紹介します。最近のVLMは大規模なトレーニングデータを通じて高い性能を達成していますが、Shaktiモデルはアーキテクチャの革新を活用し、より少ないトークンで競争力のある結果を実現します。主な進歩には、アテンションの安定性のためのQK正規化、ハイブリッド正規化技術、そして強化された位置エンコーディングが含まれます。さらに、3段階のトレーニング戦略が学習効率を最適化します。評価結果では、Shakti-VLM-1BとShakti-VLM-4Bが、ドキュメント理解、視覚的推論、OCR抽出、および一般的なマルチモーダル推論において優れていることが示されています。私たちの結果は、高い性能がデータ量ではなく、モデル設計とトレーニング戦略を通じて達成できることを強調し、Shaktiを企業規模のマルチモーダルタスクにおける効率的なソリューションとしています。

English

We introduce Shakti VLM, a family of vision-language models in the capacity of 1B and 4B parameters designed to address data efficiency challenges in multimodal learning. While recent VLMs achieve strong performance through extensive training data, Shakti models leverage architectural innovations to attain competitive results with fewer tokens. Key advancements include QK-Normalization for attention stability, hybrid normalization techniques, and enhanced positional encoding. A three-stage training strategy further optimizes learning efficiency. Evaluations show that Shakti-Shakti-VLM-1B and Shakti-VLM-4B excel in document understanding, Visual Reasoning, OCR extraction, and general multimodal reasoning. Our results highlight that high performance can be achieved through model design and training strategy rather than sheer data volume, making Shakti an efficient solution for enterprise-scale multimodal tasks.

Shakti-VLM：エンタープライズAIのためのスケーラブルな視覚言語モデル

Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI

要旨

Support