多樣本情境學習在多模態基礎模型中的應用

摘要

大型語言模型以其在少量示範範例中的上下文學習（ICL）效果而聞名。最近多模態基礎模型的進步使得窗口上下文能夠達到前所未有的長度，這提供了一個探索它們在執行具有更多示範範例的ICL能力的機會。在這項研究中，我們評估了從少量示範到多量示範ICL的多模態基礎模型的性能。我們在跨多個領域（自然圖像、醫學圖像、遙感和分子圖像）和任務（多類別、多標籤和細粒度分類）的10個數據集上對GPT-4o和Gemini 1.5 Pro進行了基準測試。我們觀察到，包括幾乎2,000個多模態示範範例在內的多量示範ICL相較於少量示範（<100個範例）ICL在所有數據集上都帶來了顯著的改進。此外，Gemini 1.5 Pro的性能在許多數據集上持續以對數線性方式提升，直至測試範例的最大數量。鑒於執行多量示範ICL所需的長提示導致高推理成本，我們還探討了將多個查詢進行批處理在單個API調用中的影響。我們展示，批處理多達50個查詢可以在零編碼和多量示範ICL下帶來性能改進，在多個數據集上在零編碼設置中大幅降低每個查詢的成本和延遲的同時實現了顯著的收益。最後，我們測量模型的ICL數據效率，即模型從更多示範範例中學習的速率。我們發現，雖然GPT-4o和Gemini 1.5 Pro在數據集上實現了類似的零編碼性能，但在大多數數據集上，Gemini 1.5 Pro的ICL數據效率高於GPT-4o。我們的結果表明，多量示範ICL可以使用戶有效地將多模態基礎模型適應到新的應用和領域。我們的代碼庫可在以下鏈接公開獲得：https://github.com/stanfordmlgroup/ManyICL。

English

Large language models are well-known to be effective at few-shot in-context learning (ICL). Recent advancements in multimodal foundation models have enabled unprecedentedly long context windows, presenting an opportunity to explore their capability to perform ICL with many more demonstrating examples. In this work, we evaluate the performance of multimodal foundation models scaling from few-shot to many-shot ICL. We benchmark GPT-4o and Gemini 1.5 Pro across 10 datasets spanning multiple domains (natural imagery, medical imagery, remote sensing, and molecular imagery) and tasks (multi-class, multi-label, and fine-grained classification). We observe that many-shot ICL, including up to almost 2,000 multimodal demonstrating examples, leads to substantial improvements compared to few-shot (<100 examples) ICL across all of the datasets. Further, Gemini 1.5 Pro performance continues to improve log-linearly up to the maximum number of tested examples on many datasets. Given the high inference costs associated with the long prompts required for many-shot ICL, we also explore the impact of batching multiple queries in a single API call. We show that batching up to 50 queries can lead to performance improvements under zero-shot and many-shot ICL, with substantial gains in the zero-shot setting on multiple datasets, while drastically reducing per-query cost and latency. Finally, we measure ICL data efficiency of the models, or the rate at which the models learn from more demonstrating examples. We find that while GPT-4o and Gemini 1.5 Pro achieve similar zero-shot performance across the datasets, Gemini 1.5 Pro exhibits higher ICL data efficiency than GPT-4o on most datasets. Our results suggest that many-shot ICL could enable users to efficiently adapt multimodal foundation models to new applications and domains. Our codebase is publicly available at https://github.com/stanfordmlgroup/ManyICL .

多樣本情境學習在多模態基礎模型中的應用

Many-Shot In-Context Learning in Multimodal Foundation Models

摘要

Support