시각을 뛰어넘는 언어: 인간 중심 의사결정을 위한 텍스트 전용 학습을 통해 시각-언어 모델이 자기 개선할 수 있다

초록

실세계 환경에서 작동하는 AI 에이전트에게 구체화된 의사결정은 근본적으로 중요합니다. 비주얼 언어 모델(VLMs)이 이러한 능력을 발전시켜 왔음에도 불구하고, 특히 인간의 필요와 가치에 대한 깊은 사고를 요구하는 인간 중심의 상황에서 복잡한 결정을 내리는 데는 여전히 어려움을 겪고 있습니다. 본 연구에서는 다중모드 인간 중심 의사결정 과제에 대해 오픈소스 VLMs를 체계적으로 평가합니다. 우리는 실제 이미지를 처리하는 유사 규모의 VLM 대비 오직 텍스트 설명만을 받는 대형 언어 모델(LLMs)이 예상치 못하게 더 나은 성능을 보임을 발견했는데, 이는 시각적 정렬이 VLM의 능력을 저해할 수 있음을 시사합니다. 이러한 문제를 해결하기 위해, 우리는 합성된 텍스트 데이터를 활용한 새로운 텍스트 전용 학습 접근법을 제안합니다. 이 방법은 VLMs의 언어 구성 요소를 강화하고 학습된 능력을 다중모드 추론으로 전이시켜, 고비용의 이미지-텍스트 쌍 데이터의 필요성을 제거합니다. 더 나아가, VLMs가 GPT-4와 같은 더 큰 교사 모델에 의존하기보다는 LLM 동료가 생성한 학습 데이터를 사용하여 자기 개선을 통해 상당한 성능 향상을 달성할 수 있음을 보여줍니다. 우리의 연구 결과는 VLMs의 인간 중심 의사결정 능력을 향상시키는 더 효율적이고 확장 가능한 접근법을 확립하며, 자기 개선 메커니즘을 통해 VLMs를 최적화하는 새로운 길을 열어줍니다.

English

Embodied decision-making is fundamental for AI agents operating in real-world environments. While Visual Language Models (VLMs) have advanced this capability, they still struggle with complex decisions, particularly in human-centered situations that require deep reasoning about human needs and values. In this study, we systematically evaluate open-sourced VLMs on multimodal human-centered decision-making tasks. We find that LLMs receiving only textual descriptions unexpectedly outperform their VLM counterparts of similar scale that process actual images, suggesting that visual alignment may hinder VLM abilities. To address this challenge, we propose a novel text-only training approach with synthesized textual data. This method strengthens VLMs' language components and transfers the learned abilities to multimodal inference, eliminating the need for expensive image-text paired data. Furthermore, we show that VLMs can achieve substantial performance gains through self-improvement, using training data generated by their LLM counterparts rather than relying on larger teacher models like GPT-4. Our findings establish a more efficient and scalable approach to enhancing VLMs' human-centered decision-making capabilities, opening new avenues for optimizing VLMs through self-improvement mechanisms.

시각을 뛰어넘는 언어: 인간 중심 의사결정을 위한 텍스트 전용 학습을 통해 시각-언어 모델이 자기 개선할 수 있다

When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making

초록

Support