走向个人健康大型语言模型

摘要

在健康领域，大多数大型语言模型（LLM）研究集中在临床任务上。然而，移动和可穿戴设备很少整合到这些任务中，为个人健康监测提供了丰富的、纵向的数据。在这里，我们介绍了个人健康大型语言模型（PH-LLM），从Gemini进行了微调，用于理解和推理数值时间序列个人健康数据。我们创建和整理了三个数据集，用于测试：1）从睡眠模式、体育活动和生理反应中产生个性化见解和建议，2）专家领域知识，以及3）预测自我报告的睡眠结果。对于第一个任务，我们与领域专家合作设计了857个案例研究，以评估睡眠和健身领域的实际场景。通过对领域特定评分标准的全面评估，我们观察到Gemini Ultra 1.0和PH-LLM在健身方面与专家表现没有统计学上的差异，而专家在睡眠方面仍然优越，但通过对PH-LLM进行微调，在使用相关领域知识和个性化睡眠见解方面取得了显著改进。我们使用多项选择睡眠医学和健身考试评估了PH-LLM的领域知识。PH-LLM在睡眠方面达到了79%，在健身方面达到了88%，超过了一组人类专家的平均分数。最后，我们训练了PH-LLM，以从可穿戴数据的文本和多模态编码表示中预测自我报告的睡眠质量结果，并证明多模态编码是必需的，以匹配专门的判别模型的性能。尽管在安全关键的个人健康领域中需要进一步的开发和评估，但这些结果既展示了Gemini模型的广泛知识和能力，也展示了将生理数据情境化为个人健康应用的好处，正如PH-LLM所做的那样。

English

In health, most large language model (LLM) research has focused on clinical tasks. However, mobile and wearable devices, which are rarely integrated into such tasks, provide rich, longitudinal data for personal health monitoring. Here we present Personal Health Large Language Model (PH-LLM), fine-tuned from Gemini for understanding and reasoning over numerical time-series personal health data. We created and curated three datasets that test 1) production of personalized insights and recommendations from sleep patterns, physical activity, and physiological responses, 2) expert domain knowledge, and 3) prediction of self-reported sleep outcomes. For the first task we designed 857 case studies in collaboration with domain experts to assess real-world scenarios in sleep and fitness. Through comprehensive evaluation of domain-specific rubrics, we observed that Gemini Ultra 1.0 and PH-LLM are not statistically different from expert performance in fitness and, while experts remain superior for sleep, fine-tuning PH-LLM provided significant improvements in using relevant domain knowledge and personalizing information for sleep insights. We evaluated PH-LLM domain knowledge using multiple choice sleep medicine and fitness examinations. PH-LLM achieved 79% on sleep and 88% on fitness, exceeding average scores from a sample of human experts. Finally, we trained PH-LLM to predict self-reported sleep quality outcomes from textual and multimodal encoding representations of wearable data, and demonstrate that multimodal encoding is required to match performance of specialized discriminative models. Although further development and evaluation are necessary in the safety-critical personal health domain, these results demonstrate both the broad knowledge and capabilities of Gemini models and the benefit of contextualizing physiological data for personal health applications as done with PH-LLM.

走向个人健康大型语言模型

Towards a Personal Health Large Language Model

摘要

Support