PaliGemma 2: 転移向けの多目的VLMファミリ
PaliGemma 2: A Family of Versatile VLMs for Transfer
December 4, 2024
著者: Andreas Steiner, André Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, Siyang Qin, Reeve Ingle, Emanuele Bugliarello, Sahar Kazemzadeh, Thomas Mesnard, Ibrahim Alabdulmohsin, Lucas Beyer, Xiaohua Zhai
cs.AI
要旨
PaliGemma 2は、Gemma 2ファミリーの言語モデルに基づいたPaliGemmaオープンビジョン言語モデル(VLM)のアップグレードです。PaliGemmaで使用されたSigLIP-So400mビジョンエンコーダーを、2Bから27BモデルまでのGemma 2モデル全体と組み合わせています。これらのモデルを224px、448px、896pxの3つの解像度で複数段階でトレーニングし、ファインチューニングを介して広範な知識を装備します。異なるモデルサイズと解像度をカバーするベースモデルファミリーが生まれ、転移パフォーマンスに影響を与える要因(学習率など)や、タスクの種類、モデルサイズ、解像度の相互作用を分析することが可能です。PaliGemmaを超える転移タスクの数と幅をさらに拡大し、テーブル構造認識、分子構造認識、楽譜認識、詳細なキャプション付け、放射線画像報告書生成などの異なるOCR関連タスクを含み、PaliGemma 2が最先端の結果を達成しています。
English
PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM)
based on the Gemma 2 family of language models. We combine the SigLIP-So400m
vision encoder that was also used by PaliGemma with the whole range of Gemma 2
models, from the 2B one all the way up to the 27B model. We train these models
at three resolutions (224px, 448px, and 896px) in multiple stages to equip them
with broad knowledge for transfer via fine-tuning. The resulting family of base
models covering different model sizes and resolutions allows us to investigate
factors impacting transfer performance (such as learning rate) and to analyze
the interplay between the type of task, model size, and resolution. We further
increase the number and breadth of transfer tasks beyond the scope of PaliGemma
including different OCR-related tasks such as table structure recognition,
molecular structure recognition, music score recognition, as well as long
fine-grained captioning and radiography report generation, on which PaliGemma 2
obtains state-of-the-art results.