DS-12 Language & NLP

African Legal & Regulatory Document NLP Dataset

500K+ annotated legal and regulatory documents spanning legislation, court judgments, contracts, and policy circulars across Nigeria, Kenya, South Africa, Senegal, and Mozambique — with NER, clause-type classification, and citation-graph annotations for powering LegalTech and regulatory compliance AI.

This is a synthetic dataset generated from high-quality expert-labelled seed data. All records are algorithmically derived — statistical distributions, inter-field correlations, and annotation characteristics faithfully replicate real-world patterns from the source data, while ensuring no real individual, organisation, or transaction can be identified or reconstructed.

The African Legal & Regulatory Document NLP Dataset contains 500K+ digitised and annotated legal texts sourced from five African jurisdictions — Nigeria, Kenya, South Africa, Senegal, and Mozambique. Document types span primary legislation (acts, codes, decrees), subsidiary legislation (regulations, statutory instruments), court judgments (supreme, appellate, and high courts), commercial contracts (anonymised), and central bank / securities regulator policy circulars. The English-law and civil-law traditions are both represented, as are English, French, and Portuguese-language source documents.

Annotation layers include: named entity recognition (NER) covering legal persons, organisations, court names, legislation citations, dates, and monetary amounts; clause-type classification using a 18-class taxonomy (definitions, obligations, prohibitions, penalty clauses, jurisdiction, force majeure, etc.); document-level topic labels from a 32-class regulatory taxonomy; and a citation graph linking each document to the statutes, precedents, and regulations it references. All annotations were produced by qualified legal professionals supervised by practising advocates.

The dataset is optimised for transformer-based NLP pipelines. Each document is chunked into 512-token segments with overlap, preserving clause boundaries where possible. Metadata fields enable filtering by jurisdiction, document type, legal tradition, language, and date range. A companion knowledge-graph export (Turtle / JSON-LD) exposes the citation network for graph-neural-network and retrieval-augmented-generation applications.

Key Use Cases

Legal document search and retrieval-augmented generation (RAG)
Clause extraction and contract review automation
Regulatory change monitoring and compliance gap analysis
Court judgment summarisation and precedent retrieval
Named entity recognition for legal persons, statutes, and courts
Citation network analysis and legal research assistants
Multi-jurisdiction regulatory taxonomy alignment
LegalTech chatbot fine-tuning for African law

Jurisdictions & Languages

🇳🇬 Nigeria (English common law)
🇰🇪 Kenya (English common law)
🇿🇦 South Africa (mixed common / civil law)
🇸🇳 Senegal (French civil law)
🇲🇿 Mozambique (Portuguese civil law)
📦 JSONL + Turtle / JSON-LD knowledge graph

Dataset Highlights

Documents
500K+
annotated legal texts
Clause Classes
18
obligations, penalties, jurisdiction…
Topic Labels
32
regulatory taxonomy
Jurisdictions
5
English, French & Portuguese law

Geographic Coverage

Primary Coverage
Other Regions

Dataset Schema

Each record represents one 512-token document chunk. Fields cover document provenance, annotation layers, and chunk position metadata.

Field NameTypeDescriptionNullableExample
chunk_id STRING Unique chunk identifier No CHK-NGA-LEG-0082341-004
document_id STRING Parent document identifier (multiple chunks share this) No DOC-NGA-LEG-0082341
country_code STRING ISO 3166-1 alpha-2 jurisdiction code No NG
language ENUM Document language: ENGLISH, FRENCH, PORTUGUESE No ENGLISH
legal_tradition ENUM Legal system: COMMON_LAW, CIVIL_LAW, MIXED No COMMON_LAW
document_type ENUM Document category: LEGISLATION, REGULATION, JUDGMENT, CONTRACT, POLICY_CIRCULAR No LEGISLATION
document_date DATE Date of enactment, judgment, or publication (YYYY-MM-DD) Yes 2019-06-12
chunk_index INTEGER Zero-based position of this chunk within the parent document No 3
text STRING 512-token text segment (clause-boundary-aware) No 42. Any person who contravenes section 38 shall be liable...
clause_type STRING Primary clause type from 18-class taxonomy (e.g. PENALTY, OBLIGATION, DEFINITION) Yes PENALTY
topic_label STRING Document-level regulatory topic from 32-class taxonomy No BANKING_REGULATION
ner_spans JSON Array of NER span objects {start, end, label, text} — legal persons, orgs, statutes, dates, amounts Yes [...]
cited_documents JSON Array of document IDs cited within this chunk Yes ["DOC-NGA-LEG-0041200"]
split ENUM Dataset partition: TRAIN, VAL, TEST No TRAIN

Sample Records

Four representative document chunks spanning jurisdictions, document types, and annotation layers.

legal_doc_sample.json
[ { "chunk_id": "CHK-NGA-LEG-0082341-004", "document_id": "DOC-NGA-LEG-0082341", "country_code": "NG", "language": "ENGLISH", "legal_tradition": "COMMON_LAW", "document_type": "LEGISLATION", "document_date": "2019-06-12", "chunk_index": 3, "text": "42. Any person who contravenes section 38 of this Act shall be liable on conviction to a fine not exceeding five million naira or imprisonment for a term not exceeding three years, or both.", "clause_type": "PENALTY", "topic_label": "BANKING_REGULATION", "ner_spans": [ { "start": 55, "end": 63, "label": "LEGISLATION_REF", "text": "section 38" }, { "start": 111, "end": 131, "label": "MONETARY_AMOUNT", "text": "five million naira" } ], "cited_documents": [ "DOC-NGA-LEG-0082341" ], "split": "TRAIN" }, { "chunk_id": "CHK-KEN-JDG-0034871-001", "document_id": "DOC-KEN-JDG-0034871", "country_code": "KE", "language": "ENGLISH", "legal_tradition": "COMMON_LAW", "document_type": "JUDGMENT", "document_date": "2023-03-15", "chunk_index": 0, "text": "IN THE COURT OF APPEAL OF KENYA AT NAIROBI. Civil Appeal No. 187 of 2022. Between Safaricom PLC (Appellant) and Communications Authority of Kenya (Respondent).", "clause_type": "JURISDICTION", "topic_label": "TELECOMMUNICATIONS_REGULATION", "ner_spans": [ { "start": 36, "end": 43, "label": "LOC", "text": "NAIROBI" }, { "start": 84, "end": 98, "label": "ORG", "text": "Safaricom PLC" }, { "start": 112, "end": 142, "label": "ORG", "text": "Communications Authority of Kenya" } ], "cited_documents": [], "split": "TEST" }, { "chunk_id": "CHK-SEN-REG-0019204-002", "document_id": "DOC-SEN-REG-0019204", "country_code": "SN", "language": "FRENCH", "legal_tradition": "CIVIL_LAW", "document_type": "REGULATION", "document_date": "2021-09-30", "chunk_index": 1, "text": "Article 7 — Les établissements de crédit sont tenus de constituer et de maintenir en permanence un ratio de solvabilité minimal de huit pour cent (8%) conformément aux normes BCEAO.", "clause_type": "OBLIGATION", "topic_label": "BANKING_REGULATION", "ner_spans": [ { "start": 155, "end": 160, "label": "ORG", "text": "BCEAO" } ], "cited_documents": [], "split": "TRAIN" }, { "chunk_id": "CHK-ZAF-CTR-0061038-007", "document_id": "DOC-ZAF-CTR-0061038", "country_code": "ZA", "language": "ENGLISH", "legal_tradition": "MIXED", "document_type": "CONTRACT", "document_date": "2022-07-01", "chunk_index": 6, "text": "14.3 Neither party shall be liable for any failure or delay in performing its obligations under this Agreement to the extent that such failure or delay is caused by a Force Majeure Event.", "clause_type": "FORCE_MAJEURE", "topic_label": "COMMERCIAL_CONTRACT", "ner_spans": [], "cited_documents": [], "split": "TRAIN" } ]
Request Dataset Access

All datasets are available under a commercial licence agreement. Our team typically responds within 2 business days.

Request Access
NDA may be required

Build with Data that reflects Africa

Request access to our full catalog of licensed human-validated African datasets or request custom data tailored to your project.