Fai Quello che Dico: Un Dataset di Comandi Vocali per l'Esecuzione di Istruzioni

Abstract

I modelli linguistici di grandi dimensioni per il parlato (Speech Large Language Models, SLLM) si sono rapidamente diffusi, supportando un'ampia gamma di compiti. Questi modelli vengono tipicamente valutati utilizzando prompt testuali, approccio che potrebbe non riflettere scenari reali in cui gli utenti interagiscono tramite voce. Per colmare questa lacuna, presentiamo DoWhatISay (DOWIS), un dataset multilingue di prompt parlati (registrati da esseri umani) e scritti, progettato per essere accoppiato con qualsiasi benchmark esistente al fine di una valutazione realistica degli SLLM in condizioni di istruzione vocale. Coprendo 9 compiti e 11 lingue, fornisce 10 varianti di prompt per ogni coppia compito-lingua, distribuite in cinque stili. Utilizzando DOWIS, valutiamo modelli SLLM all'avanguardia, analizzando l'interazione tra modalità del prompt, stile, lingua e tipo di compito. I risultati mostrano che i prompt testuali superano costantemente quelli parlati, specialmente in contesti con risorse linguistiche limitate e cross-linguali. Solo per i compiti con output vocale, i prompt parlati riescono a colmare il divario, sottolineando la necessità di prompt basati sul parlato nella valutazione degli SLLM.

English

Speech Large Language Models (SLLMs) have rapidly expanded, supporting a wide range of tasks. These models are typically evaluated using text prompts, which may not reflect real-world scenarios where users interact with speech. To address this gap, we introduce DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with any existing benchmark for realistic evaluation of SLLMs under spoken instruction conditions. Spanning 9 tasks and 11 languages, it provides 10 prompt variants per task-language pair, across five styles. Using DOWIS, we benchmark state-of-the-art SLLMs, analyzing the interplay between prompt modality, style, language, and task type. Results show that text prompts consistently outperform spoken prompts, particularly for low-resource and cross-lingual settings. Only for tasks with speech output, spoken prompts do close the gap, highlighting the need for speech-based prompting in SLLM evaluation.

Fai Quello che Dico: Un Dataset di Comandi Vocali per l'Esecuzione di Istruzioni

Do What I Say: A Spoken Prompt Dataset for Instruction-Following

Abstract

Support