一图通晓万物:统一的零样本图像描述框架
One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework
October 3, 2025
作者: Lorenzo Bianchi, Giacomo Pacini, Fabio Carrara, Nicola Messina, Giuseppe Amato, Fabrizio Falchi
cs.AI
摘要
近期提出的零样本图像描述模型利用共享空间的视觉-语言表示来为图像生成描述,而无需依赖成对的图像-文本数据。这类模型通过解码与文本对齐的图像特征来生成描述,但其应用范围仅限于全局表示和整图描述。我们提出了一个统一的零样本描述框架,该框架从以图像为中心转向以图像块为中心的模式,使得无需区域级监督即可对任意区域进行描述。我们不再依赖全局图像表示,而是将单个图像块视为基本的描述单元,并通过聚合这些单元来描述从单一图像块到非连续区域乃至整幅图像的任意区域。我们分析了使现有潜在描述模型能在我们新提出的框架中工作的关键要素。实验表明,生成有意义且密集视觉特征的主干网络(如DINO)是在多种基于区域的描述任务中取得最先进性能的关键。与其他基线模型和当前最先进的竞争者相比,我们的模型在零样本密集描述、区域集描述以及新引入的轨迹描述任务中均表现出更优的性能,凸显了基于图像块的语义表示在可扩展描述生成中的有效性。项目页面请访问:https://paciosoft.com/Patch-ioner/。
English
Zero-shot captioners are recently proposed models that utilize common-space
vision-language representations to caption images without relying on paired
image-text data. To caption an image, they proceed by textually decoding a
text-aligned image feature, but they limit their scope to global
representations and whole-image captions. We present , a
unified framework for zero-shot captioning that shifts from an image-centric to
a patch-centric paradigm, enabling the captioning of arbitrary regions without
the need of region-level supervision. Instead of relying on global image
representations, we treat individual patches as atomic captioning units and
aggregate them to describe arbitrary regions, from single patches to
non-contiguous areas and entire images. We analyze the key ingredients that
enable current latent captioners to work in our novel proposed framework.
Experiments demonstrate that backbones producing meaningful, dense visual
features, such as DINO, are key to achieving state-of-the-art performance in
multiple region-based captioning tasks. Compared to other baselines and
state-of-the-art competitors, our models achieve better performance on
zero-shot dense, region-set, and a newly introduced trace captioning task,
highlighting the effectiveness of patch-wise semantic representations for
scalable caption generation. Project page at https://paciosoft.com/Patch-ioner/ .