您是否看见我所指之处？基于手势的以自我为中心视频问答

摘要

基於使用者指向手勢的理解與問答能力，是下一代第一人稱視角AI助理的核心需求。然而，當前多模態大語言模型在此類任務中表現欠佳，原因在於缺乏富含手勢數據的訓練資料，且從第一人稱視角影片中推斷精細指向意圖的能力有限。為解決此問題，我們提出EgoPointVQA——一個專注於手勢定位問答的數據集與基準測試平台，包含涵蓋多種指示推理任務的4000段合成影片與400段真實世界影片。基於此架構，我們進一步提出手部意圖標記技術：通過現成的3D手部關鍵點重建模型生成特徵標記，將其與模型輸入交織編碼，為指向意圖解析提供顯式的時空上下文。實驗表明，我們的模型在不同骨幹網絡與模型規模下均表現優異。其中HINT-14B模型在6項任務中的平均準確率達68.1%，較現有最先進模型InternVL3-14B提升6.6%。為推動開放研究，我們將公開程式碼、模型與數據集。項目頁面：https://yuuraa.github.io/papers/choi2026egovqa

English

Understanding and answering questions based on a user's pointing gesture is essential for next-generation egocentric AI assistants. However, current Multimodal Large Language Models (MLLMs) struggle with such tasks due to the lack of gesture-rich data and their limited ability to infer fine-grained pointing intent from egocentric video. To address this, we introduce EgoPointVQA, a dataset and benchmark for gesture-grounded egocentric question answering, comprising 4000 synthetic and 400 real-world videos across multiple deictic reasoning tasks. Built upon it, we further propose Hand Intent Tokens (HINT), which encodes tokens derived from 3D hand keypoints using an off-the-shelf reconstruction model and interleaves them with the model input to provide explicit spatial and temporal context for interpreting pointing intent. We show that our model outperforms others in different backbones and model sizes. In particular, HINT-14B achieves 68.1% accuracy, on average over 6 tasks, surpassing the state-of-the-art, InternVL3-14B, by 6.6%. To further facilitate the open research, we will release the code, model, and dataset. Project page: https://yuuraa.github.io/papers/choi2026egovqa

您是否看见我所指之处？基于手势的以自我为中心视频问答

Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering

摘要

Support