私が指しているものが見えますか？ジェスチャーに基づく一人称視点動画質問応答

要旨

ユーザの指差しジェスチャに基づく質問の理解と回答は、次世代のエゴセントリックAIアシスタントにとって不可欠である。しかし、現在のマルチモーダル大規模言語モデル（MLLM）は、ジェスチャに富んだデータの不足や、エゴセントリックビデオから細粒度の指差し意図を推論する能力の限界により、このようなタスクに苦戦している。この問題に対処するため、我々はジェスチャに基づくエゴセントリック質問応答のためのデータセットおよびベンチマークであるEgoPointVQAを提案する。これは、複数の直示的推論タスクにわたる4000の合成ビデオと400の実世界ビデオで構成されている。これを基盤として、我々はさらにHand Intent Tokens（HINT）を提案する。HINTは、既存の3D再構成モデルを用いて3D手関節キーポイントから導出したトークンをエンコードし、それをモデル入力と交互に配置することで、指差し意図を解釈するための明示的な空間的・時間的文脈を提供する。我々のモデルが、異なるバックボーンおよびモデルサイズにおいて他モデルを凌駕することを示す。特に、HINT-14Bは6つのタスク平均で68.1%の精度を達成し、従来の最高性能モデルであるInternVL3-14Bを6.6%上回った。オープンな研究をさらに促進するため、コード、モデル、データセットを公開予定である。プロジェクトページ: https://yuuraa.github.io/papers/choi2026egovqa

English

Understanding and answering questions based on a user's pointing gesture is essential for next-generation egocentric AI assistants. However, current Multimodal Large Language Models (MLLMs) struggle with such tasks due to the lack of gesture-rich data and their limited ability to infer fine-grained pointing intent from egocentric video. To address this, we introduce EgoPointVQA, a dataset and benchmark for gesture-grounded egocentric question answering, comprising 4000 synthetic and 400 real-world videos across multiple deictic reasoning tasks. Built upon it, we further propose Hand Intent Tokens (HINT), which encodes tokens derived from 3D hand keypoints using an off-the-shelf reconstruction model and interleaves them with the model input to provide explicit spatial and temporal context for interpreting pointing intent. We show that our model outperforms others in different backbones and model sizes. In particular, HINT-14B achieves 68.1% accuracy, on average over 6 tasks, surpassing the state-of-the-art, InternVL3-14B, by 6.6%. To further facilitate the open research, we will release the code, model, and dataset. Project page: https://yuuraa.github.io/papers/choi2026egovqa

私が指しているものが見えますか？ジェスチャーに基づく一人称視点動画質問応答

Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering

要旨

Support