DS-09 Language & NLP

West African Multilingual Annotated Text Dataset

5M+ human-validated annotated tokens spanning Hausa, Yoruba, Igbo, Twi, and Nigerian Pidgin — covering sentiment, named-entity, intent, and language-identification labels for LLM fine-tuning and conversational AI targeting West African language speakers.

This is a synthetic dataset generated from high-quality expert-labelled seed data. All records are algorithmically derived — statistical distributions, inter-field correlations, and annotation characteristics faithfully replicate real-world patterns from the source data, while ensuring no real individual, organisation, or transaction can be identified or reconstructed.

The West African Multilingual Annotated Text Dataset is a 5M+ token corpus covering five languages and language varieties spoken across Nigeria and Ghana: Hausa, Yoruba, Igbo, Twi, and Nigerian Pidgin. Text was collected from social media (Twitter/X, Facebook), news portals, radio transcripts, and SMS-style customer service conversations — sources that reflect the actual linguistic register of West African digital communication rather than formal written text.

Each text unit (sentence or short paragraph) is annotated with: a primary language identification label, a sentiment class (positive, negative, neutral, mixed), a named-entity recognition (NER) span set covering persons, locations, organisations, and product names, and — for conversational data — an intent label from a 24-class taxonomy aligned with common customer-service and civic-information use cases. All annotations were produced by native-speaker annotators with a three-reviewer consensus protocol; inter-annotator agreement (Fleiss κ) is reported per language and annotation type.

The dataset is structured as JSONL with one record per annotated text unit, making it compatible with standard NLP fine-tuning pipelines (HuggingFace Datasets, spaCy, OpenAI fine-tune format). Language-stratified train / validation / test splits are provided. A separate romanisation-normalised variant is available for models that struggle with code-switching and diacritic-heavy orthography.

Key Use Cases

LLM instruction fine-tuning for Hausa, Yoruba, Igbo, and Twi
Sentiment analysis for brand monitoring in West African markets
Named-entity recognition for African names, places, and organisations
Intent classification for customer-service chatbot NLU
Language identification and code-switching detection
African language translation and cross-lingual transfer learning
Hate speech and harmful content moderation in local languages
Voice-to-text post-processing and transcription normalisation

Supported Languages & Formats

🇳🇬 Hausa
🇳🇬 Yoruba
🇳🇬 Igbo
🇳🇬 Nigerian Pidgin
🇬🇭 Twi (Akan)
📦 JSONL / HuggingFace Datasets
🤖 HuggingFace Transformers
🐍 spaCy / OpenAI fine-tune format

Corpus Highlights

Total Tokens
5M+
human-validated annotations
Languages
5
Hausa, Yoruba, Igbo, Twi, Pidgin
Annotation Types
4
lang-id, sentiment, NER, intent
Intent Classes
24
customer-service & civic taxonomy

Geographic Coverage

Primary Coverage
Other Regions

Dataset Schema

Each record represents one annotated text unit (sentence or short paragraph). Fields cover text content, language metadata, annotation layers, and dataset provenance.

Field NameTypeDescriptionNullableExample
text_id STRING Unique text unit identifier No TXT-HAS-NGA-0084291
language ENUM Primary language: HAUSA, YORUBA, IGBO, TWI, PIDGIN No HAUSA
country_code STRING Country of origin: NG or GH No NG
source_type ENUM Text origin: SOCIAL_MEDIA, NEWS, RADIO_TRANSCRIPT, CUSTOMER_SERVICE, SMS No SOCIAL_MEDIA
text_raw STRING Original text as collected (may include diacritics and code-switching) No Farashin kaya ya yi yawa sosai a kasuwa yau.
text_normalised STRING Romanisation-normalised variant (diacritics removed, code-switch tokens marked) Yes Farashin kaya ya yi yawa sosai a kasuwa yau.
sentiment ENUM Sentiment class: POSITIVE, NEGATIVE, NEUTRAL, MIXED No NEGATIVE
sentiment_confidence FLOAT Annotator consensus confidence for sentiment label (0–1) No 0.91
ner_spans JSON Array of NER span objects {start, end, label, text} covering PER, LOC, ORG, PROD Yes []
intent STRING Intent label from 24-class taxonomy (null for non-conversational text) Yes null
intent_confidence FLOAT Annotator consensus confidence for intent label (0–1) Yes null
token_count INTEGER Number of whitespace-delimited tokens in text_raw No 9
has_code_switching BOOLEAN True if the text mixes two or more languages No false
iaa_kappa FLOAT Fleiss κ inter-annotator agreement for this text unit Yes 0.84
split ENUM Dataset partition: TRAIN, VAL, TEST No TRAIN

Sample Records

Four representative text records spanning languages, source types, and annotation layers.

multilingual_text_sample.json
[ { "text_id": "TXT-HAS-NGA-0084291", "language": "HAUSA", "country_code": "NG", "source_type": "SOCIAL_MEDIA", "text_raw": "Farashin kaya ya yi yawa sosai a kasuwa yau.", "text_normalised": "Farashin kaya ya yi yawa sosai a kasuwa yau.", "sentiment": "NEGATIVE", "sentiment_confidence": 0.91, "ner_spans": [], "intent": null, "intent_confidence": null, "token_count": 9, "has_code_switching": false, "iaa_kappa": 0.84, "split": "TRAIN" }, { "text_id": "TXT-YOR-NGA-0021844", "language": "YORUBA", "country_code": "NG", "source_type": "CUSTOMER_SERVICE", "text_raw": "Mo fẹ́ mọ ìdí tí owó mi kò tí wọlé sí àkọọ́lẹ̀ mi.", "text_normalised": "Mo fe mo idi ti owo mi ko ti wole si akoole mi.", "sentiment": "NEGATIVE", "sentiment_confidence": 0.87, "ner_spans": [], "intent": "CHECK_ACCOUNT_BALANCE", "intent_confidence": 0.93, "token_count": 13, "has_code_switching": false, "iaa_kappa": 0.79, "split": "TRAIN" }, { "text_id": "TXT-TWI-GHA-0053012", "language": "TWI", "country_code": "GH", "source_type": "NEWS", "text_raw": "Accra ne Kumasi ayɛ nkurow a wɔde wɔn ho hyɛ Ghana ase paa.", "text_normalised": "Accra ne Kumasi aye nkurow a wode won ho hye Ghana ase paa.", "sentiment": "POSITIVE", "sentiment_confidence": 0.88, "ner_spans": [ { "start": 0, "end": 5, "label": "LOC", "text": "Accra" }, { "start": 9, "end": 15, "label": "LOC", "text": "Kumasi" }, { "start": 43, "end": 48, "label": "LOC", "text": "Ghana" } ], "intent": null, "intent_confidence": null, "token_count": 11, "has_code_switching": false, "iaa_kappa": 0.91, "split": "VAL" }, { "text_id": "TXT-PID-NGA-0097731", "language": "PIDGIN", "country_code": "NG", "source_type": "SMS", "text_raw": "Abeg send me the money quick quick, I go pay you back next week I swear.", "text_normalised": "Abeg send me the money quick quick, I go pay you back next week I swear.", "sentiment": "NEUTRAL", "sentiment_confidence": 0.76, "ner_spans": [], "intent": "REQUEST_PAYMENT", "intent_confidence": 0.89, "token_count": 15, "has_code_switching": true, "iaa_kappa": 0.72, "split": "TRAIN" } ]
Request Dataset Access

All datasets are available under a commercial licence agreement. Our team typically responds within 2 business days.

Request Access
NDA may be required

Build with Data that reflects Africa

Request access to our full catalog of licensed human-validated African datasets or request custom data tailored to your project.