GPT-4는 튜링 테스트를 통과하는가?

초록

우리는 공개 온라인 튜링 테스트에서 GPT-4를 평가했다. 가장 성능이 뛰어난 GPT-4 프롬프트는 41%의 게임에서 통과했으며, ELIZA(27%)와 GPT-3.5(14%)가 설정한 기준선을 능가했으나, 우연적 확률과 인간 참가자들이 설정한 기준선(63%)에는 미치지 못했다. 참가자들의 결정은 주로 언어적 스타일(35%)과 사회-정서적 특성(27%)에 기반을 두었으며, 이는 지능만으로는 튜링 테스트를 통과하기에 부족하다는 주장을 지지한다. 참가자들의 인구통계학적 특성, 즉 교육 수준과 대형 언어 모델(LLM)에 대한 친숙도는 탐지율을 예측하지 못했는데, 이는 시스템을 깊이 이해하고 자주 상호작용하는 사람들조차도 속임수에 취약할 수 있음을 시사한다. 지능을 평가하는 테스트로서의 알려진 한계에도 불구하고, 우리는 튜링 테스트가 자연스러운 의사소통과 속임수를 평가하는 도구로서 여전히 관련성이 있다고 주장한다. 인간으로 위장할 수 있는 능력을 가진 AI 모델은 광범위한 사회적 영향을 미칠 수 있으며, 우리는 인간과 유사성을 판단하기 위한 다양한 전략과 기준의 효과를 분석한다.

English

We evaluated GPT-4 in a public online Turing Test. The best-performing GPT-4 prompt passed in 41% of games, outperforming baselines set by ELIZA (27%) and GPT-3.5 (14%), but falling short of chance and the baseline set by human participants (63%). Participants' decisions were based mainly on linguistic style (35%) and socio-emotional traits (27%), supporting the idea that intelligence is not sufficient to pass the Turing Test. Participants' demographics, including education and familiarity with LLMs, did not predict detection rate, suggesting that even those who understand systems deeply and interact with them frequently may be susceptible to deception. Despite known limitations as a test of intelligence, we argue that the Turing Test continues to be relevant as an assessment of naturalistic communication and deception. AI models with the ability to masquerade as humans could have widespread societal consequences, and we analyse the effectiveness of different strategies and criteria for judging humanlikeness.

GPT-4는 튜링 테스트를 통과하는가?

Does GPT-4 Pass the Turing Test?

초록

Support