Pearmut：简化人工翻译评估流程

摘要

人工评估是多语言自然语言处理领域的黄金标准，但在实践中常因现有工具设置复杂缓慢、工程和运维成本高昂而被自动指标替代。我们推出轻量级但功能全面的Pearmut平台，使端到端人工评估能像自动评估一样简便运行。该平台消除了常见的使用门槛，特别针对机器翻译等多语言任务提供评估支持，既实现了DA、ESA、MQM等标准评估协议，也可扩展支持新协议原型开发。其特色功能包括文档级上下文评估、绝对与对比评估、注意力校验、ESAAI预标注技术，以及静态和基于主动学习的分配策略。Pearmut使可靠的人工评估不再是偶发性工作，而成为模型开发与诊断中实用、常规的组成部分。

English

Human evaluation is the gold standard for multilingual NLP, but is often skipped in practice and substituted with automatic metrics, because it is notoriously complex and slow to set up with existing tools with substantial engineering and operational overhead. We introduce Pearmut, a lightweight yet feature-rich platform that makes end-to-end human evaluation as easy to run as automatic evaluation. Pearmut removes common entry barriers and provides support for evaluating multilingual tasks, with a particular focus on machine translation. The platform implements standard evaluation protocols, including DA, ESA, or MQM, but is also extensible to allow prototyping new protocols. It features document-level context, absolute and contrastive evaluation, attention checks, ESAAI pre-annotations and both static and active learning-based assignment strategies. Pearmut enables reliable human evaluation to become a practical, routine component of model development and diagnosis rather than an occasional effort.

Pearmut：简化人工翻译评估流程

Pearmut: Human Evaluation of Translation Made Trivial

摘要

Support