多視点参照コミュニケーションにおける言語の基盤

要旨

多エージェントが共有シーンで視覚的視点を考慮しながら、シーン内のオブジェクトやそれらの間の空間関係に言及するタスクおよびデータセットを紹介します。このタスクでは、2つのエージェントが互いの視覚的視点を考慮し、自分の視点と異なる可能性があることを考慮して、シーン内のオブジェクトやそれらの間の空間関係に言及を生成および理解する必要があります。2,970の人間によって書かれた言及表現と、それぞれが人間の理解判断とペアになったデータセットを収集し、自動モデルのパフォーマンスを評価します。モデルのパフォーマンスは、言及生成および理解の両方において、人間エージェントのペアのそれよりも遅れていることがわかります。最後に、リスナーとペアになった際のコミュニケーションの成功の証拠とともにオープンウェイトのスピーカーモデルをトレーニングする実験を行い、コミュニケーションの成功率が58.9％から69.3％に向上し、最も強力なプロプライエタリモデルを上回る結果となりました。

English

We introduce a task and dataset for referring expression generation and comprehension in multi-agent embodied environments. In this task, two agents in a shared scene must take into account one another's visual perspective, which may be different from their own, to both produce and understand references to objects in a scene and the spatial relations between them. We collect a dataset of 2,970 human-written referring expressions, each paired with human comprehension judgments, and evaluate the performance of automated models as speakers and listeners paired with human partners, finding that model performance in both reference generation and comprehension lags behind that of pairs of human agents. Finally, we experiment training an open-weight speaker model with evidence of communicative success when paired with a listener, resulting in an improvement from 58.9 to 69.3% in communicative success and even outperforming the strongest proprietary model.

多視点参照コミュニケーションにおける言語の基盤

Grounding Language in Multi-Perspective Referential Communication

要旨

Support