你看到我指的地方了吗?基于手势的第一人称视角视频问答
Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering
March 13, 2026
作者: Yura Choi, Roy Miles, Rolandos Alexandros Potamias, Ismail Elezi, Jiankang Deng, Stefanos Zafeiriou
cs.AI
摘要
基于用户指向手势的理解与应答对新一代具身AI助手至关重要。然而,当前的多模态大语言模型因缺乏丰富的手势数据及从第一人称视角视频中推断细粒度指向意图的能力有限,难以完成此类任务。为此,我们推出EgoPointVQA——一个面向手势推理的第一人称问答数据集与基准测试平台,涵盖多种指代推理任务,包含4000段合成视频与400段真实世界视频。在此基础上,我们进一步提出手势意图标记(HINT)技术,通过现成的三维手部关键点重建模型生成表征手势的标记,并将其与模型输入交错拼接,为解读指向意图提供显式的时空上下文。实验表明,我们的模型在不同骨干网络和模型规模下均表现优异。其中HINT-140亿参数模型在6项任务中平均准确率达到68.1%,较当前最先进的InternVL3-140亿参数模型提升6.6%。为促进开源研究,我们将公开代码、模型及数据集。项目主页:https://yuuraa.github.io/papers/choi2026egovqa
English
Understanding and answering questions based on a user's pointing gesture is essential for next-generation egocentric AI assistants. However, current Multimodal Large Language Models (MLLMs) struggle with such tasks due to the lack of gesture-rich data and their limited ability to infer fine-grained pointing intent from egocentric video. To address this, we introduce EgoPointVQA, a dataset and benchmark for gesture-grounded egocentric question answering, comprising 4000 synthetic and 400 real-world videos across multiple deictic reasoning tasks. Built upon it, we further propose Hand Intent Tokens (HINT), which encodes tokens derived from 3D hand keypoints using an off-the-shelf reconstruction model and interleaves them with the model input to provide explicit spatial and temporal context for interpreting pointing intent. We show that our model outperforms others in different backbones and model sizes. In particular, HINT-14B achieves 68.1% accuracy, on average over 6 tasks, surpassing the state-of-the-art, InternVL3-14B, by 6.6%. To further facilitate the open research, we will release the code, model, and dataset. Project page: https://yuuraa.github.io/papers/choi2026egovqa