SocialGPT: 貪欲なセグメント最適化を介した社会関係推論のためのLLMのプロンプティング

要旨

社会関係推論は、画像から友人、配偶者、同僚などの関係カテゴリを特定することを目指しています。現在の手法は、ラベル付き画像データを使用して専用のネットワークをエンドツーエンドでトレーニングするというパラダイムを採用していますが、一般化能力と解釈可能性に制約があります。これらの問題に対処するために、まず、{\name}というシンプルでよく練られたフレームワークを提案します。このフレームワークは、Vision Foundation Models（VFMs）の知覚能力とLarge Language Models（LLMs）の推論能力を組み合わせたものであり、社会関係認識の強力なベースラインを提供します。具体的には、VFMsに画像コンテンツをテキストの社会的ストーリーに変換するよう指示し、その後、テキストベースの推論にはLLMsを利用します。{\name}は、VFMsとLLMsをそれぞれ適応させ、その間のギャップを埋めるための体系的な設計原則を導入しています。追加のモデルトレーニングなしで、LLMsが意思決定のための言語ベースの説明を生成できるため、2つのデータベースで競争力のあるゼロショット結果を達成し、解釈可能な回答を提供します。推論フェーズでのLLMsの手動プロンプト設計プロセスは手間がかかり、自動プロンプト最適化手法が望まれます。視覚分類タスクをLLMsの生成タスクに基本的に変換するため、自動プロンプト最適化は独自の長いプロンプト最適化の問題に直面します。この問題に対処するために、Greedy Segment Prompt Optimization（GSPO）を提案しています。これは、セグメントレベルで勾配情報を利用して貪欲探索を行います。実験結果は、GSPOが性能を大幅に向上させ、当社の手法が異なる画像スタイルにも一般化できることを示しています。コードはhttps://github.com/Mengzibin/SocialGPTで入手可能です。

English

Social relation reasoning aims to identify relation categories such as friends, spouses, and colleagues from images. While current methods adopt the paradigm of training a dedicated network end-to-end using labeled image data, they are limited in terms of generalizability and interpretability. To address these issues, we first present a simple yet well-crafted framework named {\name}, which combines the perception capability of Vision Foundation Models (VFMs) and the reasoning capability of Large Language Models (LLMs) within a modular framework, providing a strong baseline for social relation recognition. Specifically, we instruct VFMs to translate image content into a textual social story, and then utilize LLMs for text-based reasoning. {\name} introduces systematic design principles to adapt VFMs and LLMs separately and bridge their gaps. Without additional model training, it achieves competitive zero-shot results on two databases while offering interpretable answers, as LLMs can generate language-based explanations for the decisions. The manual prompt design process for LLMs at the reasoning phase is tedious and an automated prompt optimization method is desired. As we essentially convert a visual classification task into a generative task of LLMs, automatic prompt optimization encounters a unique long prompt optimization issue. To address this issue, we further propose the Greedy Segment Prompt Optimization (GSPO), which performs a greedy search by utilizing gradient information at the segment level. Experimental results show that GSPO significantly improves performance, and our method also generalizes to different image styles. The code is available at https://github.com/Mengzibin/SocialGPT.

SocialGPT: 貪欲なセグメント最適化を介した社会関係推論のためのLLMのプロンプティング

SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization

要旨

Support