대규모 언어 모델 기반 평가자가 다국어 평가 확장의 해결책이 될 수 있을까?

초록

대형 언어 모델(LLM)은 질의응답, 요약, 분류와 같은 자연어 처리(NLP) 작업에서 인상적인 성능을 보여왔습니다. 기존 평가 기법의 한계, 즉 적절한 벤치마크와 메트릭의 부족, 비용 문제, 인간 주석자 접근성의 어려움 등으로 인해, 다른 모델(주로 LLM)의 출력을 순위 매기거나 점수화할 수 있는 평가자로서 LLM의 사용이 점점 더 인기를 끌고 있습니다. LLM은 약 100개 언어를 처리할 수 있지만, 상위 20개 언어를 제외한 대부분의 언어는 다양한 작업, 메트릭, 벤치마크에 걸쳐 체계적인 평가가 이루어지지 않고 있습니다. 이는 다양한 언어에서의 LLM 성능을 정확히 이해하기 위해 다국어 평가를 확장해야 할 긴급한 필요성을 만들어냅니다. LLM 기반 평가자는 인간 주석자, 인간이 작성한 참조 문장, 벤치마크가 필요하지 않으며, 이론적으로 LLM이 지원하는 모든 언어를 평가하는 데 사용할 수 있기 때문에 이 문제에 대한 완벽한 해결책처럼 보입니다. 본 논문에서는 LLM 기반 평가자가 다국어 평가 확장에 도움을 줄 수 있는지 조사합니다. 구체적으로, 우리는 8개 언어에서 세 가지 텍스트 생성 작업에 걸친 다섯 가지 메트릭에 대한 2만 건의 인간 평가를 기준으로 LLM 기반 평가를 보정합니다. 연구 결과, LLM 기반 평가자는 높은 점수에 편향될 가능성이 있으며, 특히 저자원 언어 및 비라틴 문자 언어에서는 원어민 평가 데이터셋을 통해 보정을 거친 후 신중하게 사용해야 함을 보여줍니다.

English

Large Language Models (LLMs) have demonstrated impressive performance on Natural Language Processing (NLP) tasks, such as Question Answering, Summarization, and Classification. The use of LLMs as evaluators, that can rank or score the output of other models (usually LLMs) has become increasingly popular, due to the limitations of current evaluation techniques including the lack of appropriate benchmarks, metrics, cost, and access to human annotators. While LLMs are capable of handling approximately 100 languages, the majority of languages beyond the top 20 lack systematic evaluation across various tasks, metrics, and benchmarks. This creates an urgent need to scale up multilingual evaluation to ensure a precise understanding of LLM performance across diverse languages. LLM-based evaluators seem like the perfect solution to this problem, as they do not require human annotators, human-created references, or benchmarks and can theoretically be used to evaluate any language covered by the LLM. In this paper, we investigate whether LLM-based evaluators can help scale up multilingual evaluation. Specifically, we calibrate LLM-based evaluation against 20k human judgments of five metrics across three text-generation tasks in eight languages. Our findings indicate that LLM-based evaluators may exhibit bias towards higher scores and should be used with caution and should always be calibrated with a dataset of native speaker judgments, particularly in low-resource and non-Latin script languages.

대규모 언어 모델 기반 평가자가 다국어 평가 확장의 해결책이 될 수 있을까?

Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

초록

Support