多樣本情境學習在多模態基礎模型中的應用
Many-Shot In-Context Learning in Multimodal Foundation Models
May 16, 2024
作者: Yixing Jiang, Jeremy Irvin, Ji Hun Wang, Muhammad Ahmed Chaudhry, Jonathan H. Chen, Andrew Y. Ng
cs.AI
摘要
大型語言模型以其在少量示範範例中的上下文學習(ICL)效果而聞名。最近多模態基礎模型的進步使得窗口上下文能夠達到前所未有的長度,這提供了一個探索它們在執行具有更多示範範例的ICL能力的機會。在這項研究中,我們評估了從少量示範到多量示範ICL的多模態基礎模型的性能。我們在跨多個領域(自然圖像、醫學圖像、遙感和分子圖像)和任務(多類別、多標籤和細粒度分類)的10個數據集上對GPT-4o和Gemini 1.5 Pro進行了基準測試。我們觀察到,包括幾乎2,000個多模態示範範例在內的多量示範ICL相較於少量示範(<100個範例)ICL在所有數據集上都帶來了顯著的改進。此外,Gemini 1.5 Pro的性能在許多數據集上持續以對數線性方式提升,直至測試範例的最大數量。鑒於執行多量示範ICL所需的長提示導致高推理成本,我們還探討了將多個查詢進行批處理在單個API調用中的影響。我們展示,批處理多達50個查詢可以在零編碼和多量示範ICL下帶來性能改進,在多個數據集上在零編碼設置中大幅降低每個查詢的成本和延遲的同時實現了顯著的收益。最後,我們測量模型的ICL數據效率,即模型從更多示範範例中學習的速率。我們發現,雖然GPT-4o和Gemini 1.5 Pro在數據集上實現了類似的零編碼性能,但在大多數數據集上,Gemini 1.5 Pro的ICL數據效率高於GPT-4o。我們的結果表明,多量示範ICL可以使用戶有效地將多模態基礎模型適應到新的應用和領域。我們的代碼庫可在以下鏈接公開獲得:https://github.com/stanfordmlgroup/ManyICL。
English
Large language models are well-known to be effective at few-shot in-context
learning (ICL). Recent advancements in multimodal foundation models have
enabled unprecedentedly long context windows, presenting an opportunity to
explore their capability to perform ICL with many more demonstrating examples.
In this work, we evaluate the performance of multimodal foundation models
scaling from few-shot to many-shot ICL. We benchmark GPT-4o and Gemini 1.5 Pro
across 10 datasets spanning multiple domains (natural imagery, medical imagery,
remote sensing, and molecular imagery) and tasks (multi-class, multi-label, and
fine-grained classification). We observe that many-shot ICL, including up to
almost 2,000 multimodal demonstrating examples, leads to substantial
improvements compared to few-shot (<100 examples) ICL across all of the
datasets. Further, Gemini 1.5 Pro performance continues to improve log-linearly
up to the maximum number of tested examples on many datasets. Given the high
inference costs associated with the long prompts required for many-shot ICL, we
also explore the impact of batching multiple queries in a single API call. We
show that batching up to 50 queries can lead to performance improvements under
zero-shot and many-shot ICL, with substantial gains in the zero-shot setting on
multiple datasets, while drastically reducing per-query cost and latency.
Finally, we measure ICL data efficiency of the models, or the rate at which the
models learn from more demonstrating examples. We find that while GPT-4o and
Gemini 1.5 Pro achieve similar zero-shot performance across the datasets,
Gemini 1.5 Pro exhibits higher ICL data efficiency than GPT-4o on most
datasets. Our results suggest that many-shot ICL could enable users to
efficiently adapt multimodal foundation models to new applications and domains.
Our codebase is publicly available at
https://github.com/stanfordmlgroup/ManyICL .Summary
AI-Generated Summary