从黑箱到透明：在大学课堂中运用可解释性AI提升自动口译评估

摘要

机器学习领域的最新进展激发了人们对自动口译质量评估日益增长的兴趣。然而，现有研究存在诸多不足：对语言使用质量的考察不够充分，因数据稀缺与不平衡导致的建模效果欠佳，以及缺乏对模型预测进行解释的努力。为填补这些空白，我们提出了一种多维建模框架，该框架整合了特征工程、数据增强和可解释机器学习。此方法通过仅采用与构建相关的透明特征，并运用Shapley值（SHAP）分析，优先考虑可解释性而非“黑箱”预测。我们的研究结果在一个新颖的英汉交替传译数据集上展现了强大的预测性能，识别出BLEURT和CometKiwi评分是忠实度的最强预测特征，停顿相关特征对流畅度影响显著，而中文特有的短语多样性指标则对语言使用质量至关重要。总体而言，通过特别强调可解释性，我们提出了一种可扩展、可靠且透明的替代方案，以取代传统的人工评估，不仅为学习者提供详细的诊断反馈，还支持自动化评分单独无法实现的自律学习优势。

English

Recent advancements in machine learning have spurred growing interests in automated interpreting quality assessment. Nevertheless, existing research suffers from insufficient examination of language use quality, unsatisfactory modeling effectiveness due to data scarcity and imbalance, and a lack of efforts to explain model predictions. To address these gaps, we propose a multi-dimensional modeling framework that integrates feature engineering, data augmentation, and explainable machine learning. This approach prioritizes explainability over ``black box'' predictions by utilizing only construct-relevant, transparent features and conducting Shapley Value (SHAP) analysis. Our results demonstrate strong predictive performance on a novel English-Chinese consecutive interpreting dataset, identifying BLEURT and CometKiwi scores to be the strongest predictive features for fidelity, pause-related features for fluency, and Chinese-specific phraseological diversity metrics for language use. Overall, by placing particular emphasis on explainability, we present a scalable, reliable, and transparent alternative to traditional human evaluation, facilitating the provision of detailed diagnostic feedback for learners and supporting self-regulated learning advantages not afforded by automated scores in isolation.

从黑箱到透明：在大学课堂中运用可解释性AI提升自动口译评估

From Black Box to Transparency: Enhancing Automated Interpreting Assessment with Explainable AI in College Classrooms

摘要

Support