Gaze-LLE：大規模学習エンコーダを用いた視線ターゲット推定

要旨

私たちは、視線ターゲット推定の問題に取り組んでおり、これは人がシーンのどこを見ているかを予測することを目指しています。人の視線ターゲットを予測するには、その人の外見とシーンの内容の両方について推論する必要があります。従来の研究では、別々のシーンエンコーダ、ヘッドエンコーダ、深度やポーズなどの信号用の補助モデルから特徴を注意深く統合する、ますます複雑な手作業のパイプラインが開発されてきました。さまざまな視覚タスクで汎用的な特徴抽出器の成功を受けて、私たちはGaze-LLEを提案します。これは、凍結されたDINOv2エンコーダからの特徴を活用して、視線ターゲット推定を合理化する新しいトランスフォーマーフレームワークです。私たちはシーン用の単一の特徴表現を抽出し、軽量モジュールを使用して人物固有の位置プロンプトを適用して視線をデコードします。私たちは、いくつかの視線ベンチマークで最先端のパフォーマンスを実証し、設計選択の妥当性を検証するための包括的な分析を提供します。私たちのコードはこちらで入手できます：http://github.com/fkryan/gazelle。

English

We address the problem of gaze target estimation, which aims to predict where a person is looking in a scene. Predicting a person's gaze target requires reasoning both about the person's appearance and the contents of the scene. Prior works have developed increasingly complex, hand-crafted pipelines for gaze target estimation that carefully fuse features from separate scene encoders, head encoders, and auxiliary models for signals like depth and pose. Motivated by the success of general-purpose feature extractors on a variety of visual tasks, we propose Gaze-LLE, a novel transformer framework that streamlines gaze target estimation by leveraging features from a frozen DINOv2 encoder. We extract a single feature representation for the scene, and apply a person-specific positional prompt to decode gaze with a lightweight module. We demonstrate state-of-the-art performance across several gaze benchmarks and provide extensive analysis to validate our design choices. Our code is available at: http://github.com/fkryan/gazelle .

Gaze-LLE：大規模学習エンコーダを用いた視線ターゲット推定

Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders

要旨

Support