ChatPaper.aiChatPaper

當文字勝過視覺:視覺語言模型可通過純文本訓練自我提升,以實現以人為本的決策

When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making

March 21, 2025
作者: Zhe Hu, Jing Li, Yu Yin
cs.AI

摘要

具身決策對於在現實環境中運作的人工智慧代理至關重要。儘管視覺語言模型(VLMs)已在此能力上取得進展,它們在處理複雜決策時仍面臨挑戰,特別是在需要深入理解人類需求和價值觀的人本情境中。本研究系統性地評估了開源VLMs在多模態人本決策任務上的表現。我們發現,僅接收文本描述的大型語言模型(LLMs)意外地超越了處理實際圖像的相似規模VLM,這表明視覺對齊可能限制了VLM的能力。為應對這一挑戰,我們提出了一種新穎的純文本訓練方法,利用合成文本數據強化VLMs的語言組件,並將所學能力轉移至多模態推理,從而無需昂貴的圖文配對數據。此外,我們展示了VLMs通過自我改進可實現顯著的性能提升,利用其LLM對應模型生成的訓練數據,而非依賴如GPT-4等更大的教師模型。我們的研究成果為提升VLMs的人本決策能力建立了一種更高效且可擴展的方法,為通過自我改進機制優化VLMs開闢了新途徑。
English
Embodied decision-making is fundamental for AI agents operating in real-world environments. While Visual Language Models (VLMs) have advanced this capability, they still struggle with complex decisions, particularly in human-centered situations that require deep reasoning about human needs and values. In this study, we systematically evaluate open-sourced VLMs on multimodal human-centered decision-making tasks. We find that LLMs receiving only textual descriptions unexpectedly outperform their VLM counterparts of similar scale that process actual images, suggesting that visual alignment may hinder VLM abilities. To address this challenge, we propose a novel text-only training approach with synthesized textual data. This method strengthens VLMs' language components and transfers the learned abilities to multimodal inference, eliminating the need for expensive image-text paired data. Furthermore, we show that VLMs can achieve substantial performance gains through self-improvement, using training data generated by their LLM counterparts rather than relying on larger teacher models like GPT-4. Our findings establish a more efficient and scalable approach to enhancing VLMs' human-centered decision-making capabilities, opening new avenues for optimizing VLMs through self-improvement mechanisms.

Summary

AI-Generated Summary

PDF42March 26, 2025