High-Quality African Training Datasets
20 licensed datasets across 6 industries and 50+ curated open-access datasets for African AI research. From healthcare to finance, NLP to computer vision — ethically sourced, human-validated, and built for local relevance and global impact.
No datasets found matching your criteria.
Open Datasets for AI Research
Finding the right dataset for your AI model shouldn’t take weeks. Discover 50+ curated open-access datasets for building AI that works for Africa — spanning NLP, speech, computer vision, and multimodal tasks across 35+ African languages.
| Dataset | Modality | Tasks | Languages | License | Quality | Links |
|---|---|---|---|---|---|---|
African Breast Ultrasound Dataset Annotated breast ultrasound images for cancer screening in Black African women. |
Image | Image ClassificationObject Detection |
— | CC BY-NC 4.0 | ★★★★★ | |
African Voices Speech Dataset Rich audio-text datasets in four major African languages. |
Audio | ASRLanguage Identification |
HausaYorubaIgboNigerian Pidgin |
CC BY 4.0 | ★★★★★ | |
AfriCaption Multi-lingual image captioning benchmark for 20 African languages. |
Multimodal | Image CaptioningVisual Question Answering |
YorubaHausaIgboAmharic+15 |
CC BY 4.0 | ★★★★★ | 🤗 |
AfriDocMT Document-level multi-parallel machine translation corpus covering English and 5 African languages across health and IT news domains. |
Text | Machine Translation |
AmharicHausaSwahiliYoruba+2 |
CC BY-NC-SA 3.0 | ★★★★★ | 🤗 |
AfriQA Cross-lingual open-retrieval question answering benchmark for 10 African languages. |
Text | Question AnsweringInformation Retrieval |
BembaFonHausaIgbo+6 |
CC BY-SA 4.0 | ★★★★★ | 🤗 |
AfriQA Gold Passages Gold Wikipedia passage retrieval corpus for the AfriQA cross-lingual open-retrieval QA benchmark across 10 African languages. |
Text | Information RetrievalQuestion Answering |
BembaFonHausaIgbo+6 |
CC BY-SA 4.0 | ★★★★★ | 🤗 |
AfriSenti-SemEval 2023 Multilingual Twitter sentiment analysis dataset for 14 African languages from SemEval 2023 Task 12. |
Text | Sentiment AnalysisText Classification |
AmharicAlgerian ArabicHausaIgbo+10 |
CC BY 4.0 | ★★★★★ | 🤗 |
AfriSpeech-200 Pan-African accented English speech dataset covering 200 hours from clinical and general domains. |
Audio | ASRAccent Classification |
English (African accents) |
CC BY 4.0 | ★★★★★ | 🤗 |
ALFFA Speech Corpus Read-speech ASR corpus for five African languages collected in-country by academic partners. |
Audio | ASR |
AmharicSwahiliWolofHausa+1 |
CC BY 4.0 | ★★★★★ | |
Amharic News Text Classification Dataset Amharic news articles scraped from Ethiopian outlets for multi-class topic classification. |
Text | Text ClassificationLanguage Modeling |
Amharic |
MIT | ★★★★★ | 🤗 |
Aya Dataset - Cohere Labs (African subset) Massive multilingual human-annotated instruction-following dataset covering 101 languages including 20+ African. |
Text | Language ModelingQuestion AnsweringSummarisation+2 |
AmharicChichewaHausaIgbo+9 |
Apache 2.0 | ★★★★★ | 🤗 |
Bean Disease Detection Dataset (iBean) Field-captured images of bean leaves for diagnosing common diseases in East Africa. |
Image | Image Classification |
— | CC BY 4.0 | ★★★★★ | 🤗 |
BraTS-Africa MRI brain imaging dataset with expert segmentation for brain tumors in African patients. |
Image | SegmentationImage Classification |
— | CC BY 4.0 | ★★★★★ | |
Cassava Leaf Disease Classification Expert-annotated leaf image dataset for classifying five cassava disease conditions in Ugandan farms. |
Image | Image Classification |
— | CC BY 3.0 IGO | ★★★★★ | 🤗 |
Common Voice Scripted Speech 25.0 — Amharic Crowdsourced Amharic read-speech corpus from Mozilla's Common Voice platform. |
Audio | ASR |
Amharic |
CC0 1.0 | ★★★★★ | |
Common Voice Scripted Speech 25.0 — Hausa Crowdsourced Hausa read-speech corpus from Mozilla's Common Voice initiative. |
Audio | ASR |
Hausa |
CC0 1.0 | ★★★★★ | |
Common Voice Scripted Speech 25.0 — Igbo Crowdsourced Igbo read-speech corpus from Mozilla's Common Voice initiative. |
Audio | ASR |
Igbo |
CC0 1.0 | ★★★★★ | |
Common Voice Scripted Speech 25.0 — Kinyarwanda Crowdsourced Kinyarwanda speech corpus, one of the largest African language datasets on Common Voice. |
Audio | ASR |
Kinyarwanda |
CC0 1.0 | ★★★★★ | |
Common Voice Scripted Speech 25.0 — Luganda Crowdsourced Luganda read-speech corpus from Mozilla's Common Voice platform. |
Audio | ASR |
Luganda |
CC0 1.0 | ★★★★★ | |
Common Voice Scripted Speech 25.0 — Swahili Crowdsourced Swahili read-speech corpus from Mozilla's Common Voice initiative. |
Audio | ASR |
Swahili |
CC0 1.0 | ★★★★★ | |
Common Voice Scripted Speech 25.0 — Tigrinya Crowdsourced Tigrinya read-speech corpus from Mozilla's Common Voice initiative. |
Audio | ASR |
Tigrinya |
CC0 1.0 | ★★★★★ | |
Common Voice Scripted Speech 25.0 — Yoruba Crowdsourced Yoruba read-speech corpus from Mozilla's Common Voice initiative. |
Audio | ASR |
Yoruba |
CC0 1.0 | ★★★★★ | |
FLORES+ African Languages Subset Evaluation benchmark dataset for multilingual machine translation covering 229 languages. |
Text | Machine Translation |
AmharicBembaChokweEwe+23 |
CC BY-SA 4.0 | ★★★★★ | 🤗 |
GhanaNLP Twi and English Parallel Data Twi language NLP resources including parallel text and news data from the GhanaNLP initiative. |
Text | Machine TranslationLanguage ModelingText Classification |
TwiEnglish |
CC BY 4.0 | ★★★★★ | 🤗 |
Google FLEURS — African Languages Subset Few-shot speech benchmark with 20+ African languages for ASR and language identification. |
Audio | ASRLanguage Identification |
AfrikaansAmharicFulaGanda+16 |
CC BY 4.0 | ★★★★★ | 🤗 |
Google Open Buildings ML-derived building footprint dataset covering ~516M structures across Africa and South/Southeast Asia. |
Image | Object DetectionSegmentationGeospatial Analysis |
— | CC BY 4.0 | ★★★★★ | |
GRID3 African Settlement Mapping Dataset High-resolution settlement boundaries and population estimates for 14+ African nations. |
Image | Geospatial AnalysisSegmentation |
— | CC BY 4.0 | ★★★★★ | |
HausaVG Hausa Visual Genome for multi-modal English-to-Hausa machine translation, created by professional translators at BUK. |
Multimodal | Machine TranslationLanguage Modeling |
HausaEnglish |
CC BY-NC-SA 4.0 | ★★★★★ | 🤗 |
HausaVQA Bilingual Hausa–English visual question answering dataset with image–question–answer triplets from HausaNLP. |
Multimodal | Visual Question AnsweringImage Captioning |
HausaEnglish |
CC BY-SA 4.0 | ★★★★★ | 🤗 |
iCassava FGVC 2019 Challenge Dataset Earlier Makerere cassava disease image dataset used in the iCassava 2019 FGVC workshop challenge. |
Image | Image ClassificationFine-Grained Recognition |
— | MIT | ★★★★★ | |
IgboNLP Corpus Gold-standard Igbo NLP resources covering POS tagging, NER, and morphological analysis. |
Text | POS TaggingNERMorphological Analysis |
Igbo |
Apache 2.0 | ★★★★★ | 🤗 |
JW300 African Language Pairs Large parallel corpus from Jehovah's Witnesses publications covering 100+ African language pairs. |
Text | Machine TranslationLanguage Modeling |
SwahiliLingalaShonaZulu+9 |
Unknown | ★★★★★ | 🤗 |
KINNEWS and KIRNEWS Kinyarwanda and Kirundi news classification corpora scraped from regional online news outlets. |
Text | Text ClassificationLanguage Modeling |
KinyarwandaKirundi |
MIT | ★★★★★ | 🤗 |
MAFAND-MT News-domain machine translation benchmark for 21 African languages paired with English and French. |
Text | Machine Translation |
AmharicBambaraGhomalaEwe+19 |
CC BY-NC 4.0 | ★★★★★ | 🤗 |
Makerere Fall Armyworm Crop Dataset Drone and ground-level imagery for detecting fall armyworm infestation in Ugandan maize fields. |
Image | Object DetectionImage Classification |
— | CC BY 4.0 | ★★★★★ | |
Masakhane MT Benchmark 2020 Community-built machine translation benchmark for 38 African language pairs against English. |
Text | Machine Translation |
YorubaHausaSwahiliShona+2 |
Apache 2.0 | ★★★★★ | |
MasakhaNER 2.0 Human-annotated named entity recognition benchmark spanning 20 African languages. |
Text | NER |
BambaraGhomalaEweFon+16 |
Apache 2.0 | ★★★★★ | 🤗 |
MasakhaNEWS News topic classification dataset spanning 16 African languages across 7 categories. |
Text | Text Classification |
AmharicEnglishFrenchHausa+12 |
CC BY-NC 4.0 | ★★★★★ | 🤗 |
MasakhaPOS Part-of-speech tagging dataset covering 20 African languages with Universal Dependencies annotations. |
Text | POS Tagging |
BambaraGhomalaEweFon+16 |
CC BY-NC 4.0 | ★★★★★ | 🤗 |
MASSIVE (African Languages Subset) Multilingual Amazon Slu resource for slot-filling, intent classification, and dialogue tasks covering 52 languages. |
Text | Text ClassificationDialogue |
AfrikaansAmharicSwahili |
CC BY 4.0 | ★★★★★ | 🤗 |
MENYO-20k Multi-domain Yoruba–English parallel corpus of 20,000 human-translated sentence pairs. |
Text | Machine Translation |
YorubaEnglish |
CC BY-NC 4.0 | ★★★★★ | 🤗 |
NaijaSenti Twitter sentiment analysis dataset for four Nigerian languages annotated by native speakers. |
Text | Sentiment Analysis |
YorubaHausaIgboNigerian Pidgin |
CC BY 4.0 | ★★★★★ | 🤗 |
NCHLT Speech Corpus Read-speech corpus covering all 11 official South African languages with phoneme-level transcriptions. |
Audio | ASRTTS |
ZuluXhosaAfrikaansNorthern Sotho (Sepedi)+7 |
CC BY 3.0 | ★★★★★ | 🤗 |
NIH Malaria Parasite Cell Image Dataset Expert-annotated blood smear images for automated malaria detection. |
Image | Image ClassificationObject Detection |
— | Public Domain | ★★★★★ | |
PlantVillage Crop Disease Dataset Expert-labelled plant leaf images covering 26 disease categories across 14 crop types, widely used in African agricultural AI. |
Image | Image Classification |
— | CC BY-SA 3.0 | ★★★★★ | 🤗 |
Radiant MLHub — Rwanda Field Boundary Competition Dataset Sentinel-2 satellite time-series imagery with expert-delineated agricultural field boundaries across Rwanda. |
Image | SegmentationObject DetectionGeospatial Analysis |
— | CC BY 4.0 | ★★★★★ | |
Snapshot Serengeti Millions of camera trap images from Serengeti with crowdsourced wildlife labels. |
Image | Image ClassificationObject Detection |
— | CC BY 4.0 | ★★★★★ | |
Sunbird African Language Technology (SALT) Dataset SALT is a multi-way parallel text and speech corpus of English and six languages widely spoken in Uganda and East Africa. |
Multimodal | Machine TranslationText ClassificationNER |
English (Ugandan accent)LugandaAcholiLugbara+6 |
CC BY 4.0 | ★★★★★ | 🤗 |
WAXAL Google's large-scale multilingual speech corpus spanning 24 African languages for ASR and TTS. |
Audio | ASRTTS |
AcholiAkanAmharicDagbani+20 |
CC BY 4.0 | ★★★★★ | 🤗 |
XL-Sum African Languages Subset BBC-sourced cross-lingual abstractive summarisation dataset covering 10 African languages. |
Text | SummarisationLanguage Modeling |
AmharicHausaIgboKirundi+6 |
CC BY-NC-SA 4.0 | ★★★★★ | 🤗 |
No datasets match your filters. Try adjusting your search.
Data infrastructure for Africa's AI moment
Generic global datasets fail to capture the nuance, language diversity, and economic realities of African markets. DataLens Africa was built to close that gap — with data that is contextually correct, ethically sourced, and ready for production.
Build with Data that reflects Africa
Request access to our full catalog of licensed human-validated African dataset or request a custom data tailored to your project.