"이건 내 목소리가 아니다": 합성 AI 음성 서비스에서의 액센트 편향과 디지털 배제 현상 연구

초록

최근 인공지능(AI) 음성 생성 및 음성 복제 기술의 발전으로 자연스러운 음성과 정확한 음성 복제가 가능해졌지만, 다양한 억양과 언어적 특성에 걸친 사회기술적 시스템에 미치는 영향은 완전히 이해되지 않고 있습니다. 본 연구는 두 가지 합성 AI 음성 서비스(Speechify와 ElevenLabs)를 혼합 방법론을 통해 평가하며, 설문 조사와 인터뷰를 활용하여 기술적 성능을 평가하고 사용자의 생활 경험이 이러한 음성 기술에서의 억양 변이에 대한 인식에 어떻게 영향을 미치는지 밝혀냅니다. 연구 결과는 다섯 가지 지역별 영어 억양 간의 기술적 성능 차이를 보여주며, 현재의 음성 생성 기술이 의도치 않게 언어적 특권과 억양 기반 차별을 강화하여 새로운 형태의 디지털 배제를 초래할 가능성을 시사합니다. 전반적으로, 본 연구는 개발자, 정책 입안자 및 조직이 공정하고 사회적으로 책임 있는 AI 음성 기술을 보장하기 위한 실행 가능한 통찰력을 제공함으로써 포용적 설계와 규제의 필요성을 강조합니다.

English

Recent advances in artificial intelligence (AI) speech generation and voice cloning technologies have produced naturalistic speech and accurate voice replication, yet their influence on sociotechnical systems across diverse accents and linguistic traits is not fully understood. This study evaluates two synthetic AI voice services (Speechify and ElevenLabs) through a mixed methods approach using surveys and interviews to assess technical performance and uncover how users' lived experiences influence their perceptions of accent variations in these speech technologies. Our findings reveal technical performance disparities across five regional, English-language accents and demonstrate how current speech generation technologies may inadvertently reinforce linguistic privilege and accent-based discrimination, potentially creating new forms of digital exclusion. Overall, our study highlights the need for inclusive design and regulation by providing actionable insights for developers, policymakers, and organizations to ensure equitable and socially responsible AI speech technologies.