DeepSpeakデータセット v1.0

要旨

大規模なデータセット「DeepSpeak」について説明する。これは、ウェブカメラの前で話し、ジェスチャーをする人々の本物とディープフェイクの映像から構成されている。この最初のバージョンのデータセットにおける本物の映像は、220人の多様な個人からなる9時間分の映像である。25時間以上の映像を占めるフェイク映像は、自然な音声とAI生成音声を用いた、さまざまな最先端のフェイススワップおよびリップシンクディープフェイクで構成されている。今後、異なるおよび更新されたディープフェイク技術を用いたこのデータセットの新バージョンをリリースする予定である。このデータセットは、研究および非商用目的で自由に利用可能であり、商用利用のリクエストは検討される。

English

We describe a large-scale dataset--{\em DeepSpeak}--of real and deepfake footage of people talking and gesturing in front of their webcams. The real videos in this first version of the dataset consist of 9 hours of footage from 220 diverse individuals. Constituting more than 25 hours of footage, the fake videos consist of a range of different state-of-the-art face-swap and lip-sync deepfakes with natural and AI-generated voices. We expect to release future versions of this dataset with different and updated deepfake technologies. This dataset is made freely available for research and non-commercial uses; requests for commercial use will be considered.