個人向け健康大規模言語モデルに向けて

要旨

健康分野において、大規模言語モデル（LLM）の研究は主に臨床タスクに焦点が当てられてきた。しかし、そのようなタスクにはほとんど統合されていないモバイルおよびウェアラブルデバイスは、個人の健康モニタリングのための豊富で長期的なデータを提供する。本稿では、数値時系列の個人健康データを理解し推論するためにGeminiをファインチューニングしたPersonal Health Large Language Model（PH-LLM）を紹介する。我々は、1）睡眠パターン、身体活動、生理的反応からの個別化された洞察と推奨事項の生成、2）専門領域の知識、3）自己報告された睡眠結果の予測をテストする3つのデータセットを作成し、キュレーションした。最初のタスクでは、睡眠とフィットネスの現実世界のシナリオを評価するために、専門家と協力して857のケーススタディを設計した。領域固有の評価基準を用いた包括的な評価を通じて、Gemini Ultra 1.0とPH-LLMはフィットネスにおいて専門家のパフォーマンスと統計的に差がないことが観察され、睡眠においては専門家が依然として優れているものの、PH-LLMのファインチューニングにより、関連する領域知識の使用と睡眠洞察のための情報の個別化において大幅な改善がもたらされたことが確認された。PH-LLMの領域知識を評価するために、睡眠医学とフィットネスの多肢選択式試験を使用した。PH-LLMは睡眠で79％、フィットネスで88％のスコアを達成し、専門家のサンプルからの平均スコアを上回った。最後に、PH-LLMを訓練し、ウェアラブルデータのテキストおよびマルチモーダルエンコーディング表現から自己報告された睡眠の質の結果を予測させ、マルチモーダルエンコーディングが専門的な識別モデルのパフォーマンスに匹敵するために必要であることを示した。安全が重要な個人健康分野においてさらなる開発と評価が必要ではあるものの、これらの結果は、Geminiモデルの広範な知識と能力、およびPH-LLMで行われたように生理学的データを個人健康アプリケーションに文脈化することの利点を示している。

English

In health, most large language model (LLM) research has focused on clinical tasks. However, mobile and wearable devices, which are rarely integrated into such tasks, provide rich, longitudinal data for personal health monitoring. Here we present Personal Health Large Language Model (PH-LLM), fine-tuned from Gemini for understanding and reasoning over numerical time-series personal health data. We created and curated three datasets that test 1) production of personalized insights and recommendations from sleep patterns, physical activity, and physiological responses, 2) expert domain knowledge, and 3) prediction of self-reported sleep outcomes. For the first task we designed 857 case studies in collaboration with domain experts to assess real-world scenarios in sleep and fitness. Through comprehensive evaluation of domain-specific rubrics, we observed that Gemini Ultra 1.0 and PH-LLM are not statistically different from expert performance in fitness and, while experts remain superior for sleep, fine-tuning PH-LLM provided significant improvements in using relevant domain knowledge and personalizing information for sleep insights. We evaluated PH-LLM domain knowledge using multiple choice sleep medicine and fitness examinations. PH-LLM achieved 79% on sleep and 88% on fitness, exceeding average scores from a sample of human experts. Finally, we trained PH-LLM to predict self-reported sleep quality outcomes from textual and multimodal encoding representations of wearable data, and demonstrate that multimodal encoding is required to match performance of specialized discriminative models. Although further development and evaluation are necessary in the safety-critical personal health domain, these results demonstrate both the broad knowledge and capabilities of Gemini models and the benefit of contextualizing physiological data for personal health applications as done with PH-LLM.

個人向け健康大規模言語モデルに向けて

Towards a Personal Health Large Language Model

要旨

Support