视觉说服力:什么在影响视觉语言模型的决策?
Visual Persuasion: What Influences Decisions of Vision-Language Models?
February 17, 2026
作者: Manuel Cherep, Pranav M R, Pattie Maes, Nikhil Singh
cs.AI
摘要
網絡上充斥着大量原本為人類視覺消費而創建的圖像,如今這些圖像正日益被基於視覺-語言模型(VLM)的智能體所解析。這些智能體以大規模方式進行視覺決策,決定點擊、推薦或購買哪些內容。然而,我們對其視覺偏好的結構知之甚少。為此,我們提出一個研究框架:將VLM置於受控的圖像選擇任務中,並系統性地擾動其輸入數據。我們的核心思路是將智能體的決策函數視作一種潛在視覺效用,可通過顯示性偏好(即對經過系統編輯的圖像進行選擇)來推斷。從商品照片等常見圖像出發,我們提出視覺提示優化方法,借鑑文本優化技術,利用圖像生成模型迭代式地提出並施加視覺合理的修改(如構圖、光影或背景)。隨後通過評估哪些編輯能提升選擇概率,我們在前沿VLM上進行大規模實驗,證明優化後的編輯能在頭對頭比較中顯著改變選擇概率。我們還開發了自動化解釋管道來闡釋這些偏好,識別驅動選擇行為的一致性視覺主題。我們認為,該方法提供了一種實用高效的途徑來發掘視覺漏洞——這些若在現實場景中被隱性發現則可能引發安全隱患,從而為基於圖像的AI智能體提供更主動的審計與治理支持。
English
The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, deciding what to click, recommend, or buy. Yet, we know little about the structure of their visual preferences. We introduce a framework for studying this by placing VLMs in controlled image-based choice tasks and systematically perturbing their inputs. Our key idea is to treat the agent's decision function as a latent visual utility that can be inferred through revealed preference: choices between systematically edited images. Starting from common images, such as product photos, we propose methods for visual prompt optimization, adapting text optimization methods to iteratively propose and apply visually plausible modifications using an image generation model (such as in composition, lighting, or background). We then evaluate which edits increase selection probability. Through large-scale experiments on frontier VLMs, we demonstrate that optimized edits significantly shift choice probabilities in head-to-head comparisons. We develop an automatic interpretability pipeline to explain these preferences, identifying consistent visual themes that drive selection. We argue that this approach offers a practical and efficient way to surface visual vulnerabilities, safety concerns that might otherwise be discovered implicitly in the wild, supporting more proactive auditing and governance of image-based AI agents.