ChatPaper.aiChatPaper

规模化电商理解中的视觉-语言模型适配研究

Adapting Vision-Language Models for E-commerce Understanding at Scale

February 12, 2026
作者: Matteo Nulli, Vladimir Orshulevich, Tala Bazazo, Christian Herold, Michael Kozielski, Marcin Mazur, Szymon Tuzel, Cees G. M. Snoek, Seyyed Hadi Hashemi, Omar Javed, Yannick Versley, Shahram Khadivi
cs.AI

摘要

电子商务产品理解本质上需要从文本、图像和结构化属性中获取强大的多模态理解能力。通用视觉语言模型虽能实现可泛化的多模态潜在建模,但如何在保持通用性能的前提下,使其适应电子商务数据以属性为中心、多图像及存在噪声的特性,目前尚未形成系统化的成熟策略。本研究通过大规模实验表明,对通用视觉语言模型进行针对性适配能显著提升电商场景性能,同时保持广泛的多模态能力。此外,我们提出了一套创新的综合评估体系,涵盖深度产品理解、严格指令遵循及动态属性提取三大维度。
English
E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.
PDF93February 14, 2026