African Speech Recognition & Transcription Dataset
3,000+ hours of transcribed audio across 12 African languages — recorded in realistic acoustic environments by demographically diverse speakers — providing the broadest African speech corpus available for ASR model training, voice assistant localisation, and accessibility tool development.
This is a synthetic dataset generated from high-quality expert-labelled seed data. All records are algorithmically derived — statistical distributions, inter-field correlations, and annotation characteristics faithfully replicate real-world patterns from the source data, while ensuring no real individual, organisation, or transaction can be identified or reconstructed.
The African Speech Recognition & Transcription Dataset spans 3,000+ hours of audio recorded across 12 African languages: Hausa, Yoruba, Igbo, Swahili, Zulu, Xhosa, Twi, Amharic, Wolof, Luganda, Kinyarwanda, and Mozambican Portuguese. Recordings were collected in controlled studio sessions, community centres, outdoor markets, and simulated call-centre environments — deliberately capturing the full acoustic range encountered in real-world deployment: background noise, varying microphone quality, speaker proximity variation, and multi-speaker overlap.
Each audio clip is paired with a human-verified orthographic transcript, a phonetic transliteration where applicable, speaker demographic metadata (age group, gender, dialect region), and acoustic environment tags. ASR confidence scores from a baseline wav2vec 2.0 model are included per clip, enabling curriculum learning approaches that sequence training from high-confidence to difficult utterances. Clip durations range from 2 to 30 seconds; the median is 8 seconds.
The dataset is partitioned into train / validation / test splits stratified by language, speaker identity (no speaker appears in both train and test), and acoustic environment. A separate out-of-domain evaluation set comprising radio broadcast excerpts and phone-call audio is provided for robustness testing. All audio is stored as 16 kHz mono WAV; transcripts are distributed as JSON sidecar files and as HuggingFace Datasets-compatible JSONL.
Key Use Cases
Languages Covered
Dataset Highlights
Geographic Coverage
Dataset Schema
Each record represents one audio clip and its associated transcript and metadata. Audio files are referenced by filename; transcripts and annotations are stored inline.
| Field Name | Type | Description | Nullable | Example |
|---|---|---|---|---|
| clip_id | STRING | Unique clip identifier | No | CLK-HAS-NGA-0041823 |
| language | ENUM | Spoken language: HAUSA, YORUBA, IGBO, SWAHILI, ZULU, XHOSA, TWI, AMHARIC, WOLOF, LUGANDA, KINYARWANDA, PT_MOZ | No | HAUSA |
| country_code | STRING | ISO 3166-1 alpha-2 country of recording | No | NG |
| audio_filename | STRING | WAV file path relative to dataset root | No | audio/ng/hausa/CLK-HAS-NGA-0041823.wav |
| duration_seconds | FLOAT | Clip duration in seconds | No | 7.4 |
| transcript | STRING | Human-verified orthographic transcription | No | Yaya za mu iya taimaka muku yau? |
| phonetic_transcript | STRING | Phonetic transliteration in IPA (null if not available) | Yes | null |
| speaker_id | STRING | Anonymised speaker identifier (consistent within split) | No | SPK-NGA-1847 |
| speaker_gender | ENUM | Speaker gender: MALE, FEMALE | Yes | FEMALE |
| speaker_age_group | ENUM | Age group: YOUTH (15–24), ADULT (25–54), SENIOR (55+) | Yes | ADULT |
| dialect_region | STRING | Speaker dialect or regional variety label | Yes | Northern Nigeria |
| acoustic_environment | ENUM | Recording environment: STUDIO, COMMUNITY, OUTDOOR, CALL_CENTRE | No | COMMUNITY |
| snr_db | FLOAT | Estimated signal-to-noise ratio in decibels | Yes | 18.3 |
| baseline_asr_wer | FLOAT | Word error rate from baseline wav2vec 2.0 model (0–1) | Yes | 0.14 |
| split | ENUM | Dataset partition: TRAIN, VAL, TEST, OOD_EVAL | No | TRAIN |
Sample Records
Four representative clip records spanning languages, acoustic environments, and speaker demographics.
Build with Data that reflects Africa
Request access to our full catalog of licensed human-validated African datasets or request custom data tailored to your project.