GazeGen:注視驅動的使用者互動以產生視覺內容
GazeGen: Gaze-Driven User Interaction for Visual Content Generation
November 7, 2024
作者: He-Yen Hsieh, Ziyun Li, Sai Qian Zhang, Wei-Te Mark Ting, Kao-Den Chang, Barbara De Salvo, Chiao Liu, H. T. Kung
cs.AI
摘要
我們提出了GazeGen,一個使用者互動系統,可根據使用者眼神指向的位置生成視覺內容(圖像和影片)。GazeGen允許通過注視感興趣的區域直觀地操作視覺內容。利用物體檢測和生成式人工智慧的先進技術,GazeGen執行了注視控制的圖像添加/刪除、重新定位以及圖像物件的表面材料變更,並將靜態圖像轉換為影片。GazeGen的核心是DFT Gaze(Distilled and Fine-Tuned Gaze)代理,這是一個超輕量級模型,僅有281K個參數,能夠針對個別使用者的眼睛在小型邊緣設備上進行準確的實時注視預測。GazeGen是第一個將視覺內容生成與實時注視估計結合在一起的系統,這僅有DFT Gaze才能實現。這種實時注視估計使各種視覺內容生成任務成為可能,並且由使用者的注視進行控制。DFT Gaze的輸入是使用者的眼睛圖像,而視覺內容生成的輸入則是使用者的視角和從DFT Gaze預測的注視點。為了實現高效的注視預測,我們通過新穎的知識蒸餾和個性化適應技術從大型模型(大10倍)中衍生出小型模型。我們將知識蒸餾與遮罩自編碼器相結合,開發了一個緊湊而強大的注視估計模型。這個模型進一步通過Adapters進行微調,實現高度準確和個性化的注視預測,並最小化使用者輸入。DFT Gaze確保低延遲和精確的注視追踪,支持各種注視驅動任務。我們在AEA和OpenEDS2020基準測試中驗證了DFT Gaze的性能,展示了在邊緣設備(Raspberry Pi 4)上低角度注視誤差和低延遲。此外,我們描述了GazeGen的應用,展示了其在各種使用情境中的多功能性和有效性。
English
We present GazeGen, a user interaction system that generates visual content
(images and videos) for locations indicated by the user's eye gaze. GazeGen
allows intuitive manipulation of visual content by targeting regions of
interest with gaze. Using advanced techniques in object detection and
generative AI, GazeGen performs gaze-controlled image adding/deleting,
repositioning, and surface material changes of image objects, and converts
static images into videos. Central to GazeGen is the DFT Gaze (Distilled and
Fine-Tuned Gaze) agent, an ultra-lightweight model with only 281K parameters,
performing accurate real-time gaze predictions tailored to individual users'
eyes on small edge devices. GazeGen is the first system to combine visual
content generation with real-time gaze estimation, made possible exclusively by
DFT Gaze. This real-time gaze estimation enables various visual content
generation tasks, all controlled by the user's gaze. The input for DFT Gaze is
the user's eye images, while the inputs for visual content generation are the
user's view and the predicted gaze point from DFT Gaze. To achieve efficient
gaze predictions, we derive the small model from a large model (10x larger) via
novel knowledge distillation and personal adaptation techniques. We integrate
knowledge distillation with a masked autoencoder, developing a compact yet
powerful gaze estimation model. This model is further fine-tuned with Adapters,
enabling highly accurate and personalized gaze predictions with minimal user
input. DFT Gaze ensures low-latency and precise gaze tracking, supporting a
wide range of gaze-driven tasks. We validate the performance of DFT Gaze on AEA
and OpenEDS2020 benchmarks, demonstrating low angular gaze error and low
latency on the edge device (Raspberry Pi 4). Furthermore, we describe
applications of GazeGen, illustrating its versatility and effectiveness in
various usage scenarios.