マルチモーダル基盤モデルにおける多数ショットインコンテクスト学習

要旨

大規模言語モデルは、few-shot in-context learning（ICL）において高い効果を発揮することが広く知られています。近年のマルチモーダル基盤モデルの進展により、前例のない長いコンテキストウィンドウが可能となり、より多くのデモンストレーション例を用いたICLの能力を探る機会が生まれています。本研究では、マルチモーダル基盤モデルの性能を、few-shotからmany-shot ICLにスケールさせて評価します。GPT-4oとGemini 1.5 Proを、自然画像、医療画像、リモートセンシング、分子画像など複数のドメインと、マルチクラス、マルチラベル、細粒度分類などのタスクにわたる10のデータセットでベンチマークしました。その結果、最大約2,000のマルチモーダルデモンストレーション例を含むmany-shot ICLは、few-shot（100例未満）ICLと比較して、すべてのデータセットで大幅な改善をもたらすことが観察されました。さらに、Gemini 1.5 Proの性能は、多くのデータセットでテストされた最大例数まで対数線形的に向上し続けました。many-shot ICLに必要な長いプロンプトに関連する高い推論コストを考慮し、複数のクエリを単一のAPI呼び出しにバッチ処理する影響も調査しました。50クエリまでのバッチ処理は、zero-shotおよびmany-shot ICLの下で性能向上をもたらし、特にzero-shot設定では複数のデータセットで大幅な改善が見られ、クエリあたりのコストとレイテンシを大幅に削減できることが示されました。最後に、モデルのICLデータ効率、つまりより多くのデモンストレーション例から学習する速度を測定しました。GPT-4oとGemini 1.5 Proはデータセット全体で同様のzero-shot性能を達成していますが、Gemini 1.5 ProはほとんどのデータセットでGPT-4oよりも高いICLデータ効率を示しました。我々の結果は、many-shot ICLがユーザーにとってマルチモーダル基盤モデルを新しいアプリケーションやドメインに効率的に適応させることを可能にすることを示唆しています。コードベースはhttps://github.com/stanfordmlgroup/ManyICLで公開されています。

English

Large language models are well-known to be effective at few-shot in-context learning (ICL). Recent advancements in multimodal foundation models have enabled unprecedentedly long context windows, presenting an opportunity to explore their capability to perform ICL with many more demonstrating examples. In this work, we evaluate the performance of multimodal foundation models scaling from few-shot to many-shot ICL. We benchmark GPT-4o and Gemini 1.5 Pro across 10 datasets spanning multiple domains (natural imagery, medical imagery, remote sensing, and molecular imagery) and tasks (multi-class, multi-label, and fine-grained classification). We observe that many-shot ICL, including up to almost 2,000 multimodal demonstrating examples, leads to substantial improvements compared to few-shot (<100 examples) ICL across all of the datasets. Further, Gemini 1.5 Pro performance continues to improve log-linearly up to the maximum number of tested examples on many datasets. Given the high inference costs associated with the long prompts required for many-shot ICL, we also explore the impact of batching multiple queries in a single API call. We show that batching up to 50 queries can lead to performance improvements under zero-shot and many-shot ICL, with substantial gains in the zero-shot setting on multiple datasets, while drastically reducing per-query cost and latency. Finally, we measure ICL data efficiency of the models, or the rate at which the models learn from more demonstrating examples. We find that while GPT-4o and Gemini 1.5 Pro achieve similar zero-shot performance across the datasets, Gemini 1.5 Pro exhibits higher ICL data efficiency than GPT-4o on most datasets. Our results suggest that many-shot ICL could enable users to efficiently adapt multimodal foundation models to new applications and domains. Our codebase is publicly available at https://github.com/stanfordmlgroup/ManyICL .

マルチモーダル基盤モデルにおける多数ショットインコンテクスト学習

Many-Shot In-Context Learning in Multimodal Foundation Models

要旨

Support