ChatPaper.aiChatPaper

基於多重大型語言模型的主題分析與雙重信度指標:結合科恩卡帕係數與語義相似度以驗證質性研究

Multi-LLM Thematic Analysis with Dual Reliability Metrics: Combining Cohen's Kappa and Semantic Similarity for Qualitative Research Validation

December 23, 2025
作者: Nilesh Jain, Seyi Adeyinka, Leor Roseman, Aza Allsop
cs.AI

摘要

质性研究面临着一个关键的可靠性挑战:传统评分者间一致性方法需要多位人工编码员,耗时且通常只能达到中等一致性水平。我们提出一个基于大语言模型的主题分析多视角验证框架,该框架将集成验证与双重可靠性指标相结合:科恩卡帕(κ)用于评分者间一致性评估,余弦相似度用于语义一致性衡量。我们的框架支持可配置的分析参数(1-6个随机种子,温度值0.0-2.0),提供含变量替换功能的定制化提示词结构,并能从任意JSON格式中提取共识主题。作为概念验证,我们使用致幻艺术治疗访谈文本对三大主流LLM(Gemini 2.5 Pro、GPT-4o、Claude 3.5 Sonnet)进行评估,每个模型独立运行六次。结果表明Gemini可靠性最高(κ=0.907,余弦相似度=95.3%),其次是GPT-4o(κ=0.853,余弦相似度=92.6%)和Claude(κ=0.842,余弦相似度=92.1%)。所有模型均达到高度一致性水平(κ>0.80),验证了多轮集成方法的有效性。该框架成功实现了跨轮次共识主题提取,其中Gemini识别出6个共识主题(一致性50-83%),GPT-4o识别5个主题,Claude识别4个主题。我们的开源实现为研究者提供了透明的可靠性指标、灵活配置选项和结构无关的共识提取功能,为可靠的AI辅助质性研究奠定了方法论基础。
English
Qualitative research faces a critical reliability challenge: traditional inter-rater agreement methods require multiple human coders, are time-intensive, and often yield moderate consistency. We present a multi-perspective validation framework for LLM-based thematic analysis that combines ensemble validation with dual reliability metrics: Cohen's Kappa (κ) for inter-rater agreement and cosine similarity for semantic consistency. Our framework enables configurable analysis parameters (1-6 seeds, temperature 0.0-2.0), supports custom prompt structures with variable substitution, and provides consensus theme extraction across any JSON format. As proof-of-concept, we evaluate three leading LLMs (Gemini 2.5 Pro, GPT-4o, Claude 3.5 Sonnet) on a psychedelic art therapy interview transcript, conducting six independent runs per model. Results demonstrate Gemini achieves highest reliability (κ= 0.907, cosine=95.3%), followed by GPT-4o (κ= 0.853, cosine=92.6%) and Claude (κ= 0.842, cosine=92.1%). All three models achieve a high agreement (κ> 0.80), validating the multi-run ensemble approach. The framework successfully extracts consensus themes across runs, with Gemini identifying 6 consensus themes (50-83% consistency), GPT-4o identifying 5 themes, and Claude 4 themes. Our open-source implementation provides researchers with transparent reliability metrics, flexible configuration, and structure-agnostic consensus extraction, establishing methodological foundations for reliable AI-assisted qualitative research.
PDF32February 8, 2026