教育应用领域的大语言模型提示词评估
LLM Prompt Evaluation for Educational Applications
January 22, 2026
作者: Langdon Holmes, Adam Coscia, Scott Crossley, Joon Suh Choi, Wesley Morris
cs.AI
摘要
随着大语言模型在教育应用中的日益普及,亟需基于证据的方法来设计和评估能够产生个性化且教学对齐输出的提示词。本研究提出一种可推广的系统化提示词评估方法,通过对结构化对话活动中LLM生成的后续问题进行分析来验证其有效性。研究设计并测试了六种提示模板,这些模板融合了成熟的提示工程模式,每种提示均侧重不同的教学策略。通过适用于各类教育应用的锦标赛式评估框架,对提示模板进行了比较研究。该锦标赛采用Glicko2评分系统,由八位评委从格式规范性、对话支持度和学习者适配性三个维度对问题组进行评价。数据来源于三个不同教育场景中120组真实用户交互记录。结果显示,专注于策略性阅读的提示模板在 pairwise 比较中以81%至100%的胜率显著优于其他模板。该优胜提示融合了人物角色设定和情境管理模式,旨在支持元认知学习策略(如自主导向学习)。本方法为教育技术研究者展示了如何系统评估并优化提示设计,推动教育应用从临时性的提示工程向基于证据的提示开发范式转变。
English
As large language models (LLMs) become increasingly common in educational applications, there is a growing need for evidence-based methods to design and evaluate LLM prompts that produce personalized and pedagogically aligned out-puts. This study presents a generalizable, systematic approach for evaluating prompts, demonstrated through an analysis of LLM-generated follow-up questions in a structured dialogue activity. Six prompt templates were designed and tested. The templates incorporated established prompt engineering patterns, with each prompt emphasizing distinct pedagogical strategies. The prompt templates were compared through a tournament-style evaluation framework that can be adapted for other educational applications. The tournament employed the Glicko2 rating system with eight judges evaluating question pairs across three dimensions: format, dialogue support, and appropriateness for learners. Data was sourced from 120 authentic user interactions across three distinct educational deployments. Results showed that a single prompt related to strategic reading out-performed other templates with win probabilities ranging from 81% to 100% in pairwise comparisons. This prompt combined persona and context manager pat-terns and was designed to support metacognitive learning strategies such as self-directed learning. The methodology showcases how educational technology re- searchers can systematically evaluate and improve prompt designs, moving beyond ad-hoc prompt engineering toward evidence-based prompt development for educational applications.