当文字超越视觉：视觉语言模型通过纯文本训练实现自我提升，助力以人为本的决策

摘要

具身决策对于在现实环境中运作的AI代理至关重要。尽管视觉语言模型（VLMs）已在这一能力上取得进展，它们仍难以应对复杂决策，尤其是在需要深入理解人类需求与价值观的人本情境中。本研究系统评估了开源VLMs在多模态人本决策任务上的表现。我们发现，仅接收文本描述的大型语言模型（LLMs）意外地超越了处理实际图像的同等规模VLMs，这表明视觉对齐可能限制了VLMs的能力。为应对这一挑战，我们提出了一种新颖的纯文本训练方法，利用合成文本数据强化VLMs的语言组件，并将习得能力迁移至多模态推理，从而无需昂贵的图文配对数据。此外，我们证明VLMs通过自我改进能实现显著性能提升，即利用其LLM对应模型生成的训练数据，而非依赖如GPT-4等更大的教师模型。我们的研究为增强VLMs的人本决策能力确立了一种更高效、可扩展的途径，通过自我改进机制优化VLMs开辟了新路径。

English

Embodied decision-making is fundamental for AI agents operating in real-world environments. While Visual Language Models (VLMs) have advanced this capability, they still struggle with complex decisions, particularly in human-centered situations that require deep reasoning about human needs and values. In this study, we systematically evaluate open-sourced VLMs on multimodal human-centered decision-making tasks. We find that LLMs receiving only textual descriptions unexpectedly outperform their VLM counterparts of similar scale that process actual images, suggesting that visual alignment may hinder VLM abilities. To address this challenge, we propose a novel text-only training approach with synthesized textual data. This method strengthens VLMs' language components and transfers the learned abilities to multimodal inference, eliminating the need for expensive image-text paired data. Furthermore, we show that VLMs can achieve substantial performance gains through self-improvement, using training data generated by their LLM counterparts rather than relying on larger teacher models like GPT-4. Our findings establish a more efficient and scalable approach to enhancing VLMs' human-centered decision-making capabilities, opening new avenues for optimizing VLMs through self-improvement mechanisms.

当文字超越视觉：视觉语言模型通过纯文本训练实现自我提升，助力以人为本的决策

When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making

摘要

Support