コンピュータ利用から汎用ユーザーモデルを構築する

要旨

人間とコンピュータのインタラクションは、私たちの好みや習慣から日常行動のタイミングや目的までを理解するテクノロジーを長らく構想してきた。しかし、現在のユーザーモデルは断片的で、特定のアプリに特化しており、これらのビジョンを実現するために必要な柔軟な推論能力を備えていない。本論文では、コンピュータとのあらゆるインタラクションを観察することでユーザーについて学習する汎用ユーザーモデル（GUM）のアーキテクチャを提案する。GUMは、ユーザーの非構造化された観測データ（例えば、デバイスのスクリーンショット）を入力として受け取り、そのユーザーの知識や好みを捉えた信頼度付きの命題を構築する。GUMは、友人とのメッセージからユーザーが参加する結婚式の準備をしていると推論したり、複数の編集の停滞と関連文献の閲覧への切り替えを観察することで、ユーザーが共同作業者のフィードバックに苦戦していることを認識したりすることができる。GUMは、マルチモーダルな観測からユーザーに関する新しい命題を推論し、関連する命題を文脈として取得し、既存の命題を継続的に修正するアーキテクチャを導入する。GUMが可能にする応用の幅広さを示すために、チャットベースのアシスタントに文脈を追加する方法、OS通知を管理して重要な情報を選択的に表示する方法、アプリ間でユーザーの好みに適応するインタラクティブエージェントを可能にする方法を実証する。また、GUMを使用してユーザーに代わって有用な提案を発見し実行するプロアクティブアシスタント（GUMBO）を具体化する。評価において、GUMはユーザーについて較正された正確な推論を行い、GUMを基に構築されたアシスタントは、ユーザーが明示的にリクエストしないアクションをプロアクティブに特定し実行することがわかった。全体として、GUMは非構造化された文脈を理解するためにマルチモーダルモデルを活用する方法を導入し、HCIの長年のビジョンとユーザーのニーズを予測する全く新しいインタラクティブシステムを可能にする。

English

Human-computer interaction has long imagined technology that understands us-from our preferences and habits, to the timing and purpose of our everyday actions. Yet current user models remain fragmented, narrowly tailored to specific apps, and incapable of the flexible reasoning required to fulfill these visions. This paper presents an architecture for a general user model (GUM) that learns about you by observing any interaction you have with your computer. The GUM takes as input any unstructured observation of a user (e.g., device screenshots) and constructs confidence-weighted propositions that capture that user knowledge and preferences. GUMs can infer that a user is preparing for a wedding they're attending from messages with a friend. Or recognize that a user is struggling with a collaborator's feedback on a draft by observing multiple stalled edits and a switch to reading related work. GUMs introduce an architecture that infers new propositions about a user from multimodal observations, retrieves related propositions for context, and continuously revises existing propositions. To illustrate the breadth of applications that GUMs enable, we demonstrate how they augment chat-based assistants with context, manage OS notifications to selectively surface important information, and enable interactive agents that adapt to preferences across apps. We also instantiate proactive assistants (GUMBOs) that discover and execute useful suggestions on a user's behalf using their GUM. In our evaluations, we find that GUMs make calibrated and accurate inferences about users, and that assistants built on GUMs proactively identify and perform actions that users wouldn't think to request explicitly. Altogether, GUMs introduce methods that leverage multimodal models to understand unstructured context, enabling long-standing visions of HCI and entirely new interactive systems that anticipate user needs.

コンピュータ利用から汎用ユーザーモデルを構築する

Creating General User Models from Computer Use

要旨

Support