West African Multilingual Annotated Text Dataset
5M+ human-validated annotated tokens spanning Hausa, Yoruba, Igbo, Twi, and Nigerian Pidgin — covering sentiment, named-entity, intent, and language-identification labels for LLM fine-tuning and conversational AI targeting West African language speakers.
This is a synthetic dataset generated from high-quality expert-labelled seed data. All records are algorithmically derived — statistical distributions, inter-field correlations, and annotation characteristics faithfully replicate real-world patterns from the source data, while ensuring no real individual, organisation, or transaction can be identified or reconstructed.
The West African Multilingual Annotated Text Dataset is a 5M+ token corpus covering five languages and language varieties spoken across Nigeria and Ghana: Hausa, Yoruba, Igbo, Twi, and Nigerian Pidgin. Text was collected from social media (Twitter/X, Facebook), news portals, radio transcripts, and SMS-style customer service conversations — sources that reflect the actual linguistic register of West African digital communication rather than formal written text.
Each text unit (sentence or short paragraph) is annotated with: a primary language identification label, a sentiment class (positive, negative, neutral, mixed), a named-entity recognition (NER) span set covering persons, locations, organisations, and product names, and — for conversational data — an intent label from a 24-class taxonomy aligned with common customer-service and civic-information use cases. All annotations were produced by native-speaker annotators with a three-reviewer consensus protocol; inter-annotator agreement (Fleiss κ) is reported per language and annotation type.
The dataset is structured as JSONL with one record per annotated text unit, making it compatible with standard NLP fine-tuning pipelines (HuggingFace Datasets, spaCy, OpenAI fine-tune format). Language-stratified train / validation / test splits are provided. A separate romanisation-normalised variant is available for models that struggle with code-switching and diacritic-heavy orthography.
Key Use Cases
Supported Languages & Formats
Corpus Highlights
Geographic Coverage
Dataset Schema
Each record represents one annotated text unit (sentence or short paragraph). Fields cover text content, language metadata, annotation layers, and dataset provenance.
| Field Name | Type | Description | Nullable | Example |
|---|---|---|---|---|
| text_id | STRING | Unique text unit identifier | No | TXT-HAS-NGA-0084291 |
| language | ENUM | Primary language: HAUSA, YORUBA, IGBO, TWI, PIDGIN | No | HAUSA |
| country_code | STRING | Country of origin: NG or GH | No | NG |
| source_type | ENUM | Text origin: SOCIAL_MEDIA, NEWS, RADIO_TRANSCRIPT, CUSTOMER_SERVICE, SMS | No | SOCIAL_MEDIA |
| text_raw | STRING | Original text as collected (may include diacritics and code-switching) | No | Farashin kaya ya yi yawa sosai a kasuwa yau. |
| text_normalised | STRING | Romanisation-normalised variant (diacritics removed, code-switch tokens marked) | Yes | Farashin kaya ya yi yawa sosai a kasuwa yau. |
| sentiment | ENUM | Sentiment class: POSITIVE, NEGATIVE, NEUTRAL, MIXED | No | NEGATIVE |
| sentiment_confidence | FLOAT | Annotator consensus confidence for sentiment label (0–1) | No | 0.91 |
| ner_spans | JSON | Array of NER span objects {start, end, label, text} covering PER, LOC, ORG, PROD | Yes | [] |
| intent | STRING | Intent label from 24-class taxonomy (null for non-conversational text) | Yes | null |
| intent_confidence | FLOAT | Annotator consensus confidence for intent label (0–1) | Yes | null |
| token_count | INTEGER | Number of whitespace-delimited tokens in text_raw | No | 9 |
| has_code_switching | BOOLEAN | True if the text mixes two or more languages | No | false |
| iaa_kappa | FLOAT | Fleiss κ inter-annotator agreement for this text unit | Yes | 0.84 |
| split | ENUM | Dataset partition: TRAIN, VAL, TEST | No | TRAIN |
Sample Records
Four representative text records spanning languages, source types, and annotation layers.
Build with Data that reflects Africa
Request access to our full catalog of licensed human-validated African datasets or request custom data tailored to your project.