African Legal & Regulatory Document NLP Dataset
500K+ annotated legal and regulatory documents spanning legislation, court judgments, contracts, and policy circulars across Nigeria, Kenya, South Africa, Senegal, and Mozambique — with NER, clause-type classification, and citation-graph annotations for powering LegalTech and regulatory compliance AI.
This is a synthetic dataset generated from high-quality expert-labelled seed data. All records are algorithmically derived — statistical distributions, inter-field correlations, and annotation characteristics faithfully replicate real-world patterns from the source data, while ensuring no real individual, organisation, or transaction can be identified or reconstructed.
The African Legal & Regulatory Document NLP Dataset contains 500K+ digitised and annotated legal texts sourced from five African jurisdictions — Nigeria, Kenya, South Africa, Senegal, and Mozambique. Document types span primary legislation (acts, codes, decrees), subsidiary legislation (regulations, statutory instruments), court judgments (supreme, appellate, and high courts), commercial contracts (anonymised), and central bank / securities regulator policy circulars. The English-law and civil-law traditions are both represented, as are English, French, and Portuguese-language source documents.
Annotation layers include: named entity recognition (NER) covering legal persons, organisations, court names, legislation citations, dates, and monetary amounts; clause-type classification using a 18-class taxonomy (definitions, obligations, prohibitions, penalty clauses, jurisdiction, force majeure, etc.); document-level topic labels from a 32-class regulatory taxonomy; and a citation graph linking each document to the statutes, precedents, and regulations it references. All annotations were produced by qualified legal professionals supervised by practising advocates.
The dataset is optimised for transformer-based NLP pipelines. Each document is chunked into 512-token segments with overlap, preserving clause boundaries where possible. Metadata fields enable filtering by jurisdiction, document type, legal tradition, language, and date range. A companion knowledge-graph export (Turtle / JSON-LD) exposes the citation network for graph-neural-network and retrieval-augmented-generation applications.
Key Use Cases
Jurisdictions & Languages
Dataset Highlights
Geographic Coverage
Dataset Schema
Each record represents one 512-token document chunk. Fields cover document provenance, annotation layers, and chunk position metadata.
| Field Name | Type | Description | Nullable | Example |
|---|---|---|---|---|
| chunk_id | STRING | Unique chunk identifier | No | CHK-NGA-LEG-0082341-004 |
| document_id | STRING | Parent document identifier (multiple chunks share this) | No | DOC-NGA-LEG-0082341 |
| country_code | STRING | ISO 3166-1 alpha-2 jurisdiction code | No | NG |
| language | ENUM | Document language: ENGLISH, FRENCH, PORTUGUESE | No | ENGLISH |
| legal_tradition | ENUM | Legal system: COMMON_LAW, CIVIL_LAW, MIXED | No | COMMON_LAW |
| document_type | ENUM | Document category: LEGISLATION, REGULATION, JUDGMENT, CONTRACT, POLICY_CIRCULAR | No | LEGISLATION |
| document_date | DATE | Date of enactment, judgment, or publication (YYYY-MM-DD) | Yes | 2019-06-12 |
| chunk_index | INTEGER | Zero-based position of this chunk within the parent document | No | 3 |
| text | STRING | 512-token text segment (clause-boundary-aware) | No | 42. Any person who contravenes section 38 shall be liable... |
| clause_type | STRING | Primary clause type from 18-class taxonomy (e.g. PENALTY, OBLIGATION, DEFINITION) | Yes | PENALTY |
| topic_label | STRING | Document-level regulatory topic from 32-class taxonomy | No | BANKING_REGULATION |
| ner_spans | JSON | Array of NER span objects {start, end, label, text} — legal persons, orgs, statutes, dates, amounts | Yes | [...] |
| cited_documents | JSON | Array of document IDs cited within this chunk | Yes | ["DOC-NGA-LEG-0041200"] |
| split | ENUM | Dataset partition: TRAIN, VAL, TEST | No | TRAIN |
Sample Records
Four representative document chunks spanning jurisdictions, document types, and annotation layers.
Build with Data that reflects Africa
Request access to our full catalog of licensed human-validated African datasets or request custom data tailored to your project.