ChatPaper.aiChatPaper

多样本上下文学习在多模态基础模型中的应用

Many-Shot In-Context Learning in Multimodal Foundation Models

May 16, 2024
作者: Yixing Jiang, Jeremy Irvin, Ji Hun Wang, Muhammad Ahmed Chaudhry, Jonathan H. Chen, Andrew Y. Ng
cs.AI

摘要

大型语言模型以其在少样本情境学习(ICL)中的有效性而闻名。最近多模态基础模型的进展使得窗口上下文长度空前延长,为探索其在执行更多演示示例的ICL能力提供了机会。在这项工作中,我们评估了从少样本扩展到多样本ICL的多模态基础模型的性能。我们在跨多个领域(自然图像、医学图像、遥感和分子图像)和任务(多类别、多标签和细粒度分类)的10个数据集上对GPT-4o和Gemini 1.5 Pro进行了基准测试。我们观察到,包括近2,000个多模态演示示例在内的多样本ICL相对于少样本(<100个示例)ICL在所有数据集上都带来了显著改进。此外,Gemini 1.5 Pro的性能在许多数据集上继续以对数线性方式提高,直至测试示例的最大数量。鉴于执行多样本ICL所需的长提示所带来的高推理成本,我们还探讨了在单个API调用中批处理多个查询的影响。我们展示,批处理多达50个查询可以在零样本和多样本ICL下带来性能改进,在多个数据集上零样本设置中获得实质性收益,同时大幅降低每个查询的成本和延迟。最后,我们衡量模型的ICL数据效率,即模型从更多演示示例中学习的速率。我们发现,虽然GPT-4o和Gemini 1.5 Pro在数据集上实现了类似的零样本性能,但Gemini 1.5 Pro在大多数数据集上的ICL数据效率更高。我们的结果表明,多样本ICL可以使用户有效地将多模态基础模型调整到新的应用程序和领域。我们的代码库可以在以下网址公开获取:https://github.com/stanfordmlgroup/ManyICL。
English
Large language models are well-known to be effective at few-shot in-context learning (ICL). Recent advancements in multimodal foundation models have enabled unprecedentedly long context windows, presenting an opportunity to explore their capability to perform ICL with many more demonstrating examples. In this work, we evaluate the performance of multimodal foundation models scaling from few-shot to many-shot ICL. We benchmark GPT-4o and Gemini 1.5 Pro across 10 datasets spanning multiple domains (natural imagery, medical imagery, remote sensing, and molecular imagery) and tasks (multi-class, multi-label, and fine-grained classification). We observe that many-shot ICL, including up to almost 2,000 multimodal demonstrating examples, leads to substantial improvements compared to few-shot (<100 examples) ICL across all of the datasets. Further, Gemini 1.5 Pro performance continues to improve log-linearly up to the maximum number of tested examples on many datasets. Given the high inference costs associated with the long prompts required for many-shot ICL, we also explore the impact of batching multiple queries in a single API call. We show that batching up to 50 queries can lead to performance improvements under zero-shot and many-shot ICL, with substantial gains in the zero-shot setting on multiple datasets, while drastically reducing per-query cost and latency. Finally, we measure ICL data efficiency of the models, or the rate at which the models learn from more demonstrating examples. We find that while GPT-4o and Gemini 1.5 Pro achieve similar zero-shot performance across the datasets, Gemini 1.5 Pro exhibits higher ICL data efficiency than GPT-4o on most datasets. Our results suggest that many-shot ICL could enable users to efficiently adapt multimodal foundation models to new applications and domains. Our codebase is publicly available at https://github.com/stanfordmlgroup/ManyICL .

Summary

AI-Generated Summary

PDF333December 15, 2024