언어화 조작을 통한 지시사항 수행 평가

초록

명령어 튜닝된 모델들은 다양한 자연어 처리 과제에서 놀라운 성공을 거두었지만, 이들이 명령어를 얼마나 잘 따르는지를 정확히 평가하는 것은 여전히 어려운 과제입니다. 기존 벤치마크는 주로 모델이 학습 과정에서 잘 익힌 일반적인 명령어에 초점을 맞추고 있습니다. 그러나 이러한 명령어에 대한 숙련도가 반드시 강력한 명령어 수행 능력을 의미하는 것은 아닙니다. 본 논문에서는 '버벌라이저 조작(verbalizer manipulation)'이라는 새로운 명령어 수행 평가 프로토콜을 제안합니다. 이 방법은 모델이 작업 레이블을 모델의 사전 지식과 다양한 수준으로 일치하는 단어로 표현하도록 지시하며, 높은 일치도(예: 긍정적 감정에 대해 "긍정적" 출력)부터 최소 일치도(예: 긍정적 감정에 대해 "부정적" 출력)까지 다양한 버벌라이저를 채택합니다. 버벌라이저 조작은 어떤 분류 벤치마크와도 원활하게 통합될 수 있어, 모델이 사전 지식에 의존하는 정도와 이를 무시하고 명령어를 정확히 따르는 능력을 검토할 수 있습니다. 우리는 9개의 데이터셋에 대해 4개의 주요 모델 패밀리를 대상으로 포괄적인 평가를 수행하며, 각각에 대해 12개의 버벌라이저 세트를 적용했습니다. 그 결과, 다양한 패밀리와 규모의 모델들이 덜 자연스러운 버벌라이저에서 보이는 성능에 따라 명령어 수행 능력이 크게 구분되는 것을 관찰했습니다. 가장 강력한 GPT-4 모델조차도 가장 도전적인 버벌라이저에서는 무작위 추측 수준을 크게 벗어나지 못했으며, 이는 명령어 수행 능력을 개선하기 위한 지속적인 발전의 필요성을 강조합니다.

English

While instruction-tuned models have shown remarkable success in various natural language processing tasks, accurately evaluating their ability to follow instructions remains challenging. Existing benchmarks primarily focus on common instructions that align well with what the model learned during training. However, proficiency in responding to these instructions does not necessarily imply strong ability in instruction following. In this paper, we propose a novel instruction-following evaluation protocol called verbalizer manipulation. It instructs the model to verbalize the task label with words aligning with model priors to different extents, adopting verbalizers from highly aligned (e.g., outputting ``postive'' for positive sentiment), to minimally aligned (e.g., outputting ``negative'' for positive sentiment). Verbalizer manipulation can be seamlessly integrated with any classification benchmark to examine the model's reliance on priors and its ability to override them to accurately follow the instructions. We conduct a comprehensive evaluation of four major model families across nine datasets, employing twelve sets of verbalizers for each of them. We observe that the instruction-following abilities of models, across different families and scales, are significantly distinguished by their performance on less natural verbalizers. Even the strongest GPT-4 model struggles to perform better than random guessing on the most challenging verbalizer, emphasizing the need for continued advancements to improve their instruction-following abilities.

언어화 조작을 통한 지시사항 수행 평가

Instruction-following Evaluation through Verbalizer Manipulation

초록

Support