GazeGen：視覚コンテンツ生成のための視線駆動型ユーザーインタラクション

要旨

本論文では、ユーザーの視線が示す位置に基づいて視覚コンテンツ（画像や動画）を生成するユーザーインタラクションシステム「GazeGen」を提案する。GazeGenは、視線を用いて関心領域をターゲットとすることで、視覚コンテンツの直感的な操作を可能にする。物体検出と生成AIの先進的な技術を活用し、GazeGenは視線制御による画像の追加・削除、再配置、画像オブジェクトの表面材質変更、および静止画から動画への変換を実行する。GazeGenの中核となるのは、DFT Gaze（Distilled and Fine-Tuned Gaze）エージェントであり、わずか281Kのパラメータを持つ超軽量モデルで、小型エッジデバイス上で個々のユーザーの目に特化した正確なリアルタイム視線予測を行う。GazeGenは、リアルタイム視線推定と視覚コンテンツ生成を初めて組み合わせたシステムであり、これはDFT Gazeによってのみ実現可能である。このリアルタイム視線推定により、ユーザーの視線によって制御される多様な視覚コンテンツ生成タスクが可能となる。DFT Gazeへの入力はユーザーの目の画像であり、視覚コンテンツ生成への入力はユーザーの視野とDFT Gazeから予測された視線点である。効率的な視線予測を実現するため、我々は大規模モデル（10倍のサイズ）から新たな知識蒸留と個人適応技術を介して小型モデルを導出した。知識蒸留をマスクドオートエンコーダと統合し、コンパクトでありながら強力な視線推定モデルを開発した。このモデルはさらにアダプターを用いて微調整され、最小限のユーザー入力で高精度かつパーソナライズされた視線予測を可能にする。DFT Gazeは低遅延かつ正確な視線追跡を保証し、幅広い視線駆動タスクをサポートする。我々は、AEAおよびOpenEDS2020ベンチマークにおいてDFT Gazeの性能を検証し、エッジデバイス（Raspberry Pi 4）上での低角度視線誤差と低遅延を実証した。さらに、GazeGenの応用例を説明し、様々な使用シナリオにおけるその汎用性と有効性を示す。

English

We present GazeGen, a user interaction system that generates visual content (images and videos) for locations indicated by the user's eye gaze. GazeGen allows intuitive manipulation of visual content by targeting regions of interest with gaze. Using advanced techniques in object detection and generative AI, GazeGen performs gaze-controlled image adding/deleting, repositioning, and surface material changes of image objects, and converts static images into videos. Central to GazeGen is the DFT Gaze (Distilled and Fine-Tuned Gaze) agent, an ultra-lightweight model with only 281K parameters, performing accurate real-time gaze predictions tailored to individual users' eyes on small edge devices. GazeGen is the first system to combine visual content generation with real-time gaze estimation, made possible exclusively by DFT Gaze. This real-time gaze estimation enables various visual content generation tasks, all controlled by the user's gaze. The input for DFT Gaze is the user's eye images, while the inputs for visual content generation are the user's view and the predicted gaze point from DFT Gaze. To achieve efficient gaze predictions, we derive the small model from a large model (10x larger) via novel knowledge distillation and personal adaptation techniques. We integrate knowledge distillation with a masked autoencoder, developing a compact yet powerful gaze estimation model. This model is further fine-tuned with Adapters, enabling highly accurate and personalized gaze predictions with minimal user input. DFT Gaze ensures low-latency and precise gaze tracking, supporting a wide range of gaze-driven tasks. We validate the performance of DFT Gaze on AEA and OpenEDS2020 benchmarks, demonstrating low angular gaze error and low latency on the edge device (Raspberry Pi 4). Furthermore, we describe applications of GazeGen, illustrating its versatility and effectiveness in various usage scenarios.

GazeGen：視覚コンテンツ生成のための視線駆動型ユーザーインタラクション

GazeGen: Gaze-Driven User Interaction for Visual Content Generation

要旨

Support