当文字超越视觉:视觉语言模型通过纯文本训练实现自我提升,助力以人为本的决策
When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making
March 21, 2025
作者: Zhe Hu, Jing Li, Yu Yin
cs.AI
摘要
具身决策对于在现实环境中运作的AI代理至关重要。尽管视觉语言模型(VLMs)已在这一能力上取得进展,它们仍难以应对复杂决策,尤其是在需要深入理解人类需求与价值观的人本情境中。本研究系统评估了开源VLMs在多模态人本决策任务上的表现。我们发现,仅接收文本描述的大型语言模型(LLMs)意外地超越了处理实际图像的同等规模VLMs,这表明视觉对齐可能限制了VLMs的能力。为应对这一挑战,我们提出了一种新颖的纯文本训练方法,利用合成文本数据强化VLMs的语言组件,并将习得能力迁移至多模态推理,从而无需昂贵的图文配对数据。此外,我们证明VLMs通过自我改进能实现显著性能提升,即利用其LLM对应模型生成的训练数据,而非依赖如GPT-4等更大的教师模型。我们的研究为增强VLMs的人本决策能力确立了一种更高效、可扩展的途径,通过自我改进机制优化VLMs开辟了新路径。
English
Embodied decision-making is fundamental for AI agents operating in real-world
environments. While Visual Language Models (VLMs) have advanced this
capability, they still struggle with complex decisions, particularly in
human-centered situations that require deep reasoning about human needs and
values. In this study, we systematically evaluate open-sourced VLMs on
multimodal human-centered decision-making tasks. We find that LLMs receiving
only textual descriptions unexpectedly outperform their VLM counterparts of
similar scale that process actual images, suggesting that visual alignment may
hinder VLM abilities. To address this challenge, we propose a novel text-only
training approach with synthesized textual data. This method strengthens VLMs'
language components and transfers the learned abilities to multimodal
inference, eliminating the need for expensive image-text paired data.
Furthermore, we show that VLMs can achieve substantial performance gains
through self-improvement, using training data generated by their LLM
counterparts rather than relying on larger teacher models like GPT-4. Our
findings establish a more efficient and scalable approach to enhancing VLMs'
human-centered decision-making capabilities, opening new avenues for optimizing
VLMs through self-improvement mechanisms.Summary
AI-Generated Summary