DS-09 Language & NLP

West African Multilingual Annotated Text Dataset

5M+ human-validated annotated tokens spanning Hausa, Yoruba, Igbo, Twi, and Nigerian Pidgin — covering sentiment, named-entity, intent, and language-identification labels for LLM fine-tuning and conversational AI targeting West African language speakers.

This is a synthetic dataset generated from high-quality expert-labelled seed data. All records are algorithmically derived — statistical distributions, inter-field correlations, and annotation characteristics faithfully replicate real-world patterns from the source data, while ensuring no real individual, organisation, or transaction can be identified or reconstructed.

The West African Multilingual Annotated Text Dataset is a 5M+ token corpus covering five languages and language varieties spoken across Nigeria and Ghana: Hausa, Yoruba, Igbo, Twi, and Nigerian Pidgin. Text was collected from social media (Twitter/X, Facebook), news portals, radio transcripts, and SMS-style customer service conversations — sources that reflect the actual linguistic register of West African digital communication rather than formal written text.

Each text unit (sentence or short paragraph) is annotated with: a primary language identification label, a sentiment class (positive, negative, neutral, mixed), a named-entity recognition (NER) span set covering persons, locations, organisations, and product names, and — for conversational data — an intent label from a 24-class taxonomy aligned with common customer-service and civic-information use cases. All annotations were produced by native-speaker annotators with a three-reviewer consensus protocol; inter-annotator agreement (Fleiss κ) is reported per language and annotation type.

The dataset is structured as JSONL with one record per annotated text unit, making it compatible with standard NLP fine-tuning pipelines (HuggingFace Datasets, spaCy, OpenAI fine-tune format). Language-stratified train / validation / test splits are provided. A separate romanisation-normalised variant is available for models that struggle with code-switching and diacritic-heavy orthography.

Key Use Cases

LLM instruction fine-tuning for Hausa, Yoruba, Igbo, and Twi

Sentiment analysis for brand monitoring in West African markets

Named-entity recognition for African names, places, and organisations

Intent classification for customer-service chatbot NLU

Language identification and code-switching detection

African language translation and cross-lingual transfer learning

Hate speech and harmful content moderation in local languages

Voice-to-text post-processing and transcription normalisation

Supported Languages & Formats

🇳🇬 Hausa

🇳🇬 Yoruba

🇳🇬 Igbo

🇳🇬 Nigerian Pidgin

🇬🇭 Twi (Akan)

📦 JSONL / HuggingFace Datasets

🤖 HuggingFace Transformers

🐍 spaCy / OpenAI fine-tune format

Corpus Highlights

Total Tokens

5M+

human-validated annotations

Languages

Hausa, Yoruba, Igbo, Twi, Pidgin

Annotation Types

lang-id, sentiment, NER, intent

Intent Classes

customer-service & civic taxonomy

Geographic Coverage

Primary Coverage

Other Regions

Dataset Schema

Each record represents one annotated text unit (sentence or short paragraph). Fields cover text content, language metadata, annotation layers, and dataset provenance.

Field Name	Type	Description	Nullable	Example
text_id	STRING	Unique text unit identifier	No	TXT-HAS-NGA-0084291
language	ENUM	Primary language: HAUSA, YORUBA, IGBO, TWI, PIDGIN	No	HAUSA
country_code	STRING	Country of origin: NG or GH	No	NG
source_type	ENUM	Text origin: SOCIAL_MEDIA, NEWS, RADIO_TRANSCRIPT, CUSTOMER_SERVICE, SMS	No	SOCIAL_MEDIA
text_raw	STRING	Original text as collected (may include diacritics and code-switching)	No	Farashin kaya ya yi yawa sosai a kasuwa yau.
text_normalised	STRING	Romanisation-normalised variant (diacritics removed, code-switch tokens marked)	Yes	Farashin kaya ya yi yawa sosai a kasuwa yau.
sentiment	ENUM	Sentiment class: POSITIVE, NEGATIVE, NEUTRAL, MIXED	No	NEGATIVE
sentiment_confidence	FLOAT	Annotator consensus confidence for sentiment label (0–1)	No	0.91
ner_spans	JSON	Array of NER span objects {start, end, label, text} covering PER, LOC, ORG, PROD	Yes	[]
intent	STRING	Intent label from 24-class taxonomy (null for non-conversational text)	Yes	null
intent_confidence	FLOAT	Annotator consensus confidence for intent label (0–1)	Yes	null
token_count	INTEGER	Number of whitespace-delimited tokens in text_raw	No	9
has_code_switching	BOOLEAN	True if the text mixes two or more languages	No	false
iaa_kappa	FLOAT	Fleiss κ inter-annotator agreement for this text unit	Yes	0.84
split	ENUM	Dataset partition: TRAIN, VAL, TEST	No	TRAIN

Sample Records

Four representative text records spanning languages, source types, and annotation layers.

multilingual_text_sample.json

[ { "text_id": "TXT-HAS-NGA-0084291", "language": "HAUSA", "country_code": "NG", "source_type": "SOCIAL_MEDIA", "text_raw": "Farashin kaya ya yi yawa sosai a kasuwa yau.", "text_normalised": "Farashin kaya ya yi yawa sosai a kasuwa yau.", "sentiment": "NEGATIVE", "sentiment_confidence": 0.91, "ner_spans": [], "intent": null, "intent_confidence": null, "token_count": 9, "has_code_switching": false, "iaa_kappa": 0.84, "split": "TRAIN" }, { "text_id": "TXT-YOR-NGA-0021844", "language": "YORUBA", "country_code": "NG", "source_type": "CUSTOMER_SERVICE", "text_raw": "Mo fẹ́ mọ ìdí tí owó mi kò tí wọlé sí àkọọ́lẹ̀ mi.", "text_normalised": "Mo fe mo idi ti owo mi ko ti wole si akoole mi.", "sentiment": "NEGATIVE", "sentiment_confidence": 0.87, "ner_spans": [], "intent": "CHECK_ACCOUNT_BALANCE", "intent_confidence": 0.93, "token_count": 13, "has_code_switching": false, "iaa_kappa": 0.79, "split": "TRAIN" }, { "text_id": "TXT-TWI-GHA-0053012", "language": "TWI", "country_code": "GH", "source_type": "NEWS", "text_raw": "Accra ne Kumasi ayɛ nkurow a wɔde wɔn ho hyɛ Ghana ase paa.", "text_normalised": "Accra ne Kumasi aye nkurow a wode won ho hye Ghana ase paa.", "sentiment": "POSITIVE", "sentiment_confidence": 0.88, "ner_spans": [ { "start": 0, "end": 5, "label": "LOC", "text": "Accra" }, { "start": 9, "end": 15, "label": "LOC", "text": "Kumasi" }, { "start": 43, "end": 48, "label": "LOC", "text": "Ghana" } ], "intent": null, "intent_confidence": null, "token_count": 11, "has_code_switching": false, "iaa_kappa": 0.91, "split": "VAL" }, { "text_id": "TXT-PID-NGA-0097731", "language": "PIDGIN", "country_code": "NG", "source_type": "SMS", "text_raw": "Abeg send me the money quick quick, I go pay you back next week I swear.", "text_normalised": "Abeg send me the money quick quick, I go pay you back next week I swear.", "sentiment": "NEUTRAL", "sentiment_confidence": 0.76, "ner_spans": [], "intent": "REQUEST_PAYMENT", "intent_confidence": 0.89, "token_count": 15, "has_code_switching": true, "iaa_kappa": 0.72, "split": "TRAIN" } ]

Request Dataset Access

All datasets are available under a commercial licence agreement. Our team typically responds within 2 business days.

Request Access

NDA may be required

Related Datasets

Build with Data that reflects Africa

Request access to our full catalog of licensed human-validated African dataset or request a custom data tailored to your project.

Request Dataset Access Contact Sales