Licensed & Open Dataset Catalogue

High-Quality African Training Datasets

20 licensed datasets across 6 industries and 50+ curated open-access datasets for African AI research. From healthcare to finance, NLP to computer vision — ethically sourced, human-validated, and built for local relevance and global impact.

African AI Training Data
Loading datasets...

Open Datasets for AI Research

Finding the right dataset for your AI model shouldn’t take weeks. Discover 50+ curated open-access datasets for building AI that works for Africa — spanning NLP, speech, computer vision, and multimodal tasks across 35+ African languages.

Dataset Modality Tasks Languages License Quality Links
African Breast Ultrasound Dataset
Annotated breast ultrasound images for cancer screening in Black African women.
Image
Image ClassificationObject Detection
CC BY-NC 4.0
African Voices Speech Dataset
Rich audio-text datasets in four major African languages.
Audio
ASRLanguage Identification
HausaYorubaIgboNigerian Pidgin
CC BY 4.0
AfriCaption
Multi-lingual image captioning benchmark for 20 African languages.
Multimodal
Image CaptioningVisual Question Answering
YorubaHausaIgboAmharic+15
CC BY 4.0 🤗
AfriDocMT
Document-level multi-parallel machine translation corpus covering English and 5 African languages across health and IT news domains.
Text
Machine Translation
AmharicHausaSwahiliYoruba+2
CC BY-NC-SA 3.0 🤗
AfriQA
Cross-lingual open-retrieval question answering benchmark for 10 African languages.
Text
Question AnsweringInformation Retrieval
BembaFonHausaIgbo+6
CC BY-SA 4.0 🤗
AfriQA Gold Passages
Gold Wikipedia passage retrieval corpus for the AfriQA cross-lingual open-retrieval QA benchmark across 10 African languages.
Text
Information RetrievalQuestion Answering
BembaFonHausaIgbo+6
CC BY-SA 4.0 🤗
AfriSenti-SemEval 2023
Multilingual Twitter sentiment analysis dataset for 14 African languages from SemEval 2023 Task 12.
Text
Sentiment AnalysisText Classification
AmharicAlgerian ArabicHausaIgbo+10
CC BY 4.0 🤗
AfriSpeech-200
Pan-African accented English speech dataset covering 200 hours from clinical and general domains.
Audio
ASRAccent Classification
English (African accents)
CC BY 4.0 🤗
ALFFA Speech Corpus
Read-speech ASR corpus for five African languages collected in-country by academic partners.
Audio
ASR
AmharicSwahiliWolofHausa+1
CC BY 4.0
Amharic News Text Classification Dataset
Amharic news articles scraped from Ethiopian outlets for multi-class topic classification.
Text
Text ClassificationLanguage Modeling
Amharic
MIT 🤗
Aya Dataset - Cohere Labs (African subset)
Massive multilingual human-annotated instruction-following dataset covering 101 languages including 20+ African.
Text
Language ModelingQuestion AnsweringSummarisation+2
AmharicChichewaHausaIgbo+9
Apache 2.0 🤗
Bean Disease Detection Dataset (iBean)
Field-captured images of bean leaves for diagnosing common diseases in East Africa.
Image
Image Classification
CC BY 4.0 🤗
BraTS-Africa
MRI brain imaging dataset with expert segmentation for brain tumors in African patients.
Image
SegmentationImage Classification
CC BY 4.0
Cassava Leaf Disease Classification
Expert-annotated leaf image dataset for classifying five cassava disease conditions in Ugandan farms.
Image
Image Classification
CC BY 3.0 IGO 🤗
Common Voice Scripted Speech 25.0 — Amharic
Crowdsourced Amharic read-speech corpus from Mozilla's Common Voice platform.
Audio
ASR
Amharic
CC0 1.0
Common Voice Scripted Speech 25.0 — Hausa
Crowdsourced Hausa read-speech corpus from Mozilla's Common Voice initiative.
Audio
ASR
Hausa
CC0 1.0
Common Voice Scripted Speech 25.0 — Igbo
Crowdsourced Igbo read-speech corpus from Mozilla's Common Voice initiative.
Audio
ASR
Igbo
CC0 1.0
Common Voice Scripted Speech 25.0 — Kinyarwanda
Crowdsourced Kinyarwanda speech corpus, one of the largest African language datasets on Common Voice.
Audio
ASR
Kinyarwanda
CC0 1.0
Common Voice Scripted Speech 25.0 — Luganda
Crowdsourced Luganda read-speech corpus from Mozilla's Common Voice platform.
Audio
ASR
Luganda
CC0 1.0
Common Voice Scripted Speech 25.0 — Swahili
Crowdsourced Swahili read-speech corpus from Mozilla's Common Voice initiative.
Audio
ASR
Swahili
CC0 1.0
Common Voice Scripted Speech 25.0 — Tigrinya
Crowdsourced Tigrinya read-speech corpus from Mozilla's Common Voice initiative.
Audio
ASR
Tigrinya
CC0 1.0
Common Voice Scripted Speech 25.0 — Yoruba
Crowdsourced Yoruba read-speech corpus from Mozilla's Common Voice initiative.
Audio
ASR
Yoruba
CC0 1.0
FLORES+ African Languages Subset
Evaluation benchmark dataset for multilingual machine translation covering 229 languages.
Text
Machine Translation
AmharicBembaChokweEwe+23
CC BY-SA 4.0 🤗
GhanaNLP Twi and English Parallel Data
Twi language NLP resources including parallel text and news data from the GhanaNLP initiative.
Text
Machine TranslationLanguage ModelingText Classification
TwiEnglish
CC BY 4.0 🤗
Google FLEURS — African Languages Subset
Few-shot speech benchmark with 20+ African languages for ASR and language identification.
Audio
ASRLanguage Identification
AfrikaansAmharicFulaGanda+16
CC BY 4.0 🤗
Google Open Buildings
ML-derived building footprint dataset covering ~516M structures across Africa and South/Southeast Asia.
Image
Object DetectionSegmentationGeospatial Analysis
CC BY 4.0
GRID3 African Settlement Mapping Dataset
High-resolution settlement boundaries and population estimates for 14+ African nations.
Image
Geospatial AnalysisSegmentation
CC BY 4.0
HausaVG
Hausa Visual Genome for multi-modal English-to-Hausa machine translation, created by professional translators at BUK.
Multimodal
Machine TranslationLanguage Modeling
HausaEnglish
CC BY-NC-SA 4.0 🤗
HausaVQA
Bilingual Hausa–English visual question answering dataset with image–question–answer triplets from HausaNLP.
Multimodal
Visual Question AnsweringImage Captioning
HausaEnglish
CC BY-SA 4.0 🤗
iCassava FGVC 2019 Challenge Dataset
Earlier Makerere cassava disease image dataset used in the iCassava 2019 FGVC workshop challenge.
Image
Image ClassificationFine-Grained Recognition
MIT
IgboNLP Corpus
Gold-standard Igbo NLP resources covering POS tagging, NER, and morphological analysis.
Text
POS TaggingNERMorphological Analysis
Igbo
Apache 2.0 🤗
JW300 African Language Pairs
Large parallel corpus from Jehovah's Witnesses publications covering 100+ African language pairs.
Text
Machine TranslationLanguage Modeling
SwahiliLingalaShonaZulu+9
Unknown 🤗
KINNEWS and KIRNEWS
Kinyarwanda and Kirundi news classification corpora scraped from regional online news outlets.
Text
Text ClassificationLanguage Modeling
KinyarwandaKirundi
MIT 🤗
MAFAND-MT
News-domain machine translation benchmark for 21 African languages paired with English and French.
Text
Machine Translation
AmharicBambaraGhomalaEwe+19
CC BY-NC 4.0 🤗
Makerere Fall Armyworm Crop Dataset
Drone and ground-level imagery for detecting fall armyworm infestation in Ugandan maize fields.
Image
Object DetectionImage Classification
CC BY 4.0
Masakhane MT Benchmark 2020
Community-built machine translation benchmark for 38 African language pairs against English.
Text
Machine Translation
YorubaHausaSwahiliShona+2
Apache 2.0
MasakhaNER 2.0
Human-annotated named entity recognition benchmark spanning 20 African languages.
Text
NER
BambaraGhomalaEweFon+16
Apache 2.0 🤗
MasakhaNEWS
News topic classification dataset spanning 16 African languages across 7 categories.
Text
Text Classification
AmharicEnglishFrenchHausa+12
CC BY-NC 4.0 🤗
MasakhaPOS
Part-of-speech tagging dataset covering 20 African languages with Universal Dependencies annotations.
Text
POS Tagging
BambaraGhomalaEweFon+16
CC BY-NC 4.0 🤗
MASSIVE (African Languages Subset)
Multilingual Amazon Slu resource for slot-filling, intent classification, and dialogue tasks covering 52 languages.
Text
Text ClassificationDialogue
AfrikaansAmharicSwahili
CC BY 4.0 🤗
MENYO-20k
Multi-domain Yoruba–English parallel corpus of 20,000 human-translated sentence pairs.
Text
Machine Translation
YorubaEnglish
CC BY-NC 4.0 🤗
NaijaSenti
Twitter sentiment analysis dataset for four Nigerian languages annotated by native speakers.
Text
Sentiment Analysis
YorubaHausaIgboNigerian Pidgin
CC BY 4.0 🤗
NCHLT Speech Corpus
Read-speech corpus covering all 11 official South African languages with phoneme-level transcriptions.
Audio
ASRTTS
ZuluXhosaAfrikaansNorthern Sotho (Sepedi)+7
CC BY 3.0 🤗
NIH Malaria Parasite Cell Image Dataset
Expert-annotated blood smear images for automated malaria detection.
Image
Image ClassificationObject Detection
Public Domain
PlantVillage Crop Disease Dataset
Expert-labelled plant leaf images covering 26 disease categories across 14 crop types, widely used in African agricultural AI.
Image
Image Classification
CC BY-SA 3.0 🤗
Radiant MLHub — Rwanda Field Boundary Competition Dataset
Sentinel-2 satellite time-series imagery with expert-delineated agricultural field boundaries across Rwanda.
Image
SegmentationObject DetectionGeospatial Analysis
CC BY 4.0
Snapshot Serengeti
Millions of camera trap images from Serengeti with crowdsourced wildlife labels.
Image
Image ClassificationObject Detection
CC BY 4.0
Sunbird African Language Technology (SALT) Dataset
SALT is a multi-way parallel text and speech corpus of English and six languages widely spoken in Uganda and East Africa.
Multimodal
Machine TranslationText ClassificationNER
English (Ugandan accent)LugandaAcholiLugbara+6
CC BY 4.0 🤗
WAXAL
Google's large-scale multilingual speech corpus spanning 24 African languages for ASR and TTS.
Audio
ASRTTS
AcholiAkanAmharicDagbani+20
CC BY 4.0 🤗
XL-Sum African Languages Subset
BBC-sourced cross-lingual abstractive summarisation dataset covering 10 African languages.
Text
SummarisationLanguage Modeling
AmharicHausaIgboKirundi+6
CC BY-NC-SA 4.0 🤗

No datasets match your filters. Try adjusting your search.

Data infrastructure for Africa's AI moment

Generic global datasets fail to capture the nuance, language diversity, and economic realities of African markets. DataLens Africa was built to close that gap — with data that is contextually correct, ethically sourced, and ready for production.

Context-Native Data
Context-Native Data
Every dataset is collected and annotated within African contexts. You get data that reflects how African economies actually work, not how Western datasets assume they do. No more forcing square proxies into round markets.
Production-Scale Volume
Production-Scale Volume
With over 10 million records spanning financial services, healthcare, agriculture, logistics, and governance, our datasets are sized for production model training. Gold, Silver, and Bronze tiers ensure the right fit for every stage of your AI development cycle.
Structured for ML Workflows
Structured for ML Workflows
No raw data dumps. Every dataset ships with consistent schemas, human-validated annotations, standardised label taxonomies, and clear feature documentation. We make it easy to plug our data into your training pipelines and get to model iteration faster.
Continuously Updated
Continuously Updated
African markets move fast, and so does our data. Datasets are refreshed on rolling cycles through live partner pipelines, ongoing data collection, and regular annotation sprints. You get access to the latest trends, behaviors, and market dynamics.

Build with Data that reflects Africa

Request access to our full catalog of licensed human-validated African dataset or request a custom data tailored to your project.