ChatPaper.aiChatPaper

邁向個人健康大型語言模型

Towards a Personal Health Large Language Model

June 10, 2024
作者: Justin Cosentino, Anastasiya Belyaeva, Xin Liu, Nicholas A. Furlotte, Zhun Yang, Chace Lee, Erik Schenck, Yojan Patel, Jian Cui, Logan Douglas Schneider, Robby Bryant, Ryan G. Gomes, Allen Jiang, Roy Lee, Yun Liu, Javier Perez, Jameson K. Rogers, Cathy Speed, Shyam Tailor, Megan Walker, Jeffrey Yu, Tim Althoff, Conor Heneghan, John Hernandez, Mark Malhotra, Leor Stern, Yossi Matias, Greg S. Corrado, Shwetak Patel, Shravya Shetty, Jiening Zhan, Shruthi Prabhakara, Daniel McDuff, Cory Y. McLean
cs.AI

摘要

在健康領域中,大多數大型語言模型(LLM)的研究主要集中在臨床任務上。然而,移動和可穿戴設備很少整合到這些任務中,卻為個人健康監測提供豐富且長期的數據。在這裡,我們介紹了個人健康大型語言模型(PH-LLM),從Gemini進行了微調,用於理解和推理數值時間序列個人健康數據。我們創建並編輯了三個數據集,用於測試:1)從睡眠模式、身體活動和生理反應中生成個性化見解和建議,2)專業領域知識,以及3)預測自報睡眠結果。對於第一個任務,我們與領域專家合作設計了857個案例研究,以評估睡眠和健身領域中的現實情況。通過對領域特定評分標準的全面評估,我們觀察到Gemini Ultra 1.0和PH-LLM在健身方面與專家表現沒有統計學上的差異,而對於睡眠,專家仍然優於PH-LLM,但通過對PH-LLM進行微調,可以顯著改善運用相關領域知識和個性化信息以獲得睡眠見解。我們使用多項選擇睡眠醫學和健身考試來評估PH-LLM的領域知識。PH-LLM在睡眠方面達到79%,在健身方面達到88%,超過了一組人類專家的平均分數。最後,我們訓練PH-LLM來從可穿戴數據的文本和多模式編碼表示中預測自報睡眠質量結果,並展示多模式編碼是必要的,以匹配專門的區分模型的性能。儘管在關鍵的個人健康領域中需要進一步的開發和評估,但這些結果既展示了Gemini模型的廣泛知識和能力,也顯示了將生理數據情境化為個人健康應用的好處,就像PH-LLM所做的那樣。
English
In health, most large language model (LLM) research has focused on clinical tasks. However, mobile and wearable devices, which are rarely integrated into such tasks, provide rich, longitudinal data for personal health monitoring. Here we present Personal Health Large Language Model (PH-LLM), fine-tuned from Gemini for understanding and reasoning over numerical time-series personal health data. We created and curated three datasets that test 1) production of personalized insights and recommendations from sleep patterns, physical activity, and physiological responses, 2) expert domain knowledge, and 3) prediction of self-reported sleep outcomes. For the first task we designed 857 case studies in collaboration with domain experts to assess real-world scenarios in sleep and fitness. Through comprehensive evaluation of domain-specific rubrics, we observed that Gemini Ultra 1.0 and PH-LLM are not statistically different from expert performance in fitness and, while experts remain superior for sleep, fine-tuning PH-LLM provided significant improvements in using relevant domain knowledge and personalizing information for sleep insights. We evaluated PH-LLM domain knowledge using multiple choice sleep medicine and fitness examinations. PH-LLM achieved 79% on sleep and 88% on fitness, exceeding average scores from a sample of human experts. Finally, we trained PH-LLM to predict self-reported sleep quality outcomes from textual and multimodal encoding representations of wearable data, and demonstrate that multimodal encoding is required to match performance of specialized discriminative models. Although further development and evaluation are necessary in the safety-critical personal health domain, these results demonstrate both the broad knowledge and capabilities of Gemini models and the benefit of contextualizing physiological data for personal health applications as done with PH-LLM.

Summary

AI-Generated Summary

PDF250December 8, 2024