DS-12 Language & NLP

African Legal & Regulatory Document NLP Dataset

500K+ annotated legal and regulatory documents spanning legislation, court judgments, contracts, and policy circulars across Nigeria, Kenya, South Africa, Senegal, and Mozambique — with NER, clause-type classification, and citation-graph annotations for powering LegalTech and regulatory compliance AI.

This is a synthetic dataset generated from high-quality expert-labelled seed data. All records are algorithmically derived — statistical distributions, inter-field correlations, and annotation characteristics faithfully replicate real-world patterns from the source data, while ensuring no real individual, organisation, or transaction can be identified or reconstructed.

The African Legal & Regulatory Document NLP Dataset contains 500K+ digitised and annotated legal texts sourced from five African jurisdictions — Nigeria, Kenya, South Africa, Senegal, and Mozambique. Document types span primary legislation (acts, codes, decrees), subsidiary legislation (regulations, statutory instruments), court judgments (supreme, appellate, and high courts), commercial contracts (anonymised), and central bank / securities regulator policy circulars. The English-law and civil-law traditions are both represented, as are English, French, and Portuguese-language source documents.

Annotation layers include: named entity recognition (NER) covering legal persons, organisations, court names, legislation citations, dates, and monetary amounts; clause-type classification using a 18-class taxonomy (definitions, obligations, prohibitions, penalty clauses, jurisdiction, force majeure, etc.); document-level topic labels from a 32-class regulatory taxonomy; and a citation graph linking each document to the statutes, precedents, and regulations it references. All annotations were produced by qualified legal professionals supervised by practising advocates.

The dataset is optimised for transformer-based NLP pipelines. Each document is chunked into 512-token segments with overlap, preserving clause boundaries where possible. Metadata fields enable filtering by jurisdiction, document type, legal tradition, language, and date range. A companion knowledge-graph export (Turtle / JSON-LD) exposes the citation network for graph-neural-network and retrieval-augmented-generation applications.

Key Use Cases

Legal document search and retrieval-augmented generation (RAG)

Clause extraction and contract review automation

Regulatory change monitoring and compliance gap analysis

Court judgment summarisation and precedent retrieval

Named entity recognition for legal persons, statutes, and courts

Citation network analysis and legal research assistants

Multi-jurisdiction regulatory taxonomy alignment

LegalTech chatbot fine-tuning for African law

Jurisdictions & Languages

🇳🇬 Nigeria (English common law)

🇰🇪 Kenya (English common law)

🇿🇦 South Africa (mixed common / civil law)

🇸🇳 Senegal (French civil law)

🇲🇿 Mozambique (Portuguese civil law)

📦 JSONL + Turtle / JSON-LD knowledge graph

Dataset Highlights

Documents

500K+

annotated legal texts

Clause Classes

obligations, penalties, jurisdiction…

Topic Labels

regulatory taxonomy

Jurisdictions

English, French & Portuguese law

Geographic Coverage

Primary Coverage

Other Regions

Dataset Schema

Each record represents one 512-token document chunk. Fields cover document provenance, annotation layers, and chunk position metadata.

Field Name	Type	Description	Nullable	Example
chunk_id	STRING	Unique chunk identifier	No	CHK-NGA-LEG-0082341-004
document_id	STRING	Parent document identifier (multiple chunks share this)	No	DOC-NGA-LEG-0082341
country_code	STRING	ISO 3166-1 alpha-2 jurisdiction code	No	NG
language	ENUM	Document language: ENGLISH, FRENCH, PORTUGUESE	No	ENGLISH
legal_tradition	ENUM	Legal system: COMMON_LAW, CIVIL_LAW, MIXED	No	COMMON_LAW
document_type	ENUM	Document category: LEGISLATION, REGULATION, JUDGMENT, CONTRACT, POLICY_CIRCULAR	No	LEGISLATION
document_date	DATE	Date of enactment, judgment, or publication (YYYY-MM-DD)	Yes	2019-06-12
chunk_index	INTEGER	Zero-based position of this chunk within the parent document	No	3
text	STRING	512-token text segment (clause-boundary-aware)	No	42. Any person who contravenes section 38 shall be liable...
clause_type	STRING	Primary clause type from 18-class taxonomy (e.g. PENALTY, OBLIGATION, DEFINITION)	Yes	PENALTY
topic_label	STRING	Document-level regulatory topic from 32-class taxonomy	No	BANKING_REGULATION
ner_spans	JSON	Array of NER span objects {start, end, label, text} — legal persons, orgs, statutes, dates, amounts	Yes	[...]
cited_documents	JSON	Array of document IDs cited within this chunk	Yes	["DOC-NGA-LEG-0041200"]
split	ENUM	Dataset partition: TRAIN, VAL, TEST	No	TRAIN

Sample Records

Four representative document chunks spanning jurisdictions, document types, and annotation layers.

legal_doc_sample.json

[ { "chunk_id": "CHK-NGA-LEG-0082341-004", "document_id": "DOC-NGA-LEG-0082341", "country_code": "NG", "language": "ENGLISH", "legal_tradition": "COMMON_LAW", "document_type": "LEGISLATION", "document_date": "2019-06-12", "chunk_index": 3, "text": "42. Any person who contravenes section 38 of this Act shall be liable on conviction to a fine not exceeding five million naira or imprisonment for a term not exceeding three years, or both.", "clause_type": "PENALTY", "topic_label": "BANKING_REGULATION", "ner_spans": [ { "start": 55, "end": 63, "label": "LEGISLATION_REF", "text": "section 38" }, { "start": 111, "end": 131, "label": "MONETARY_AMOUNT", "text": "five million naira" } ], "cited_documents": [ "DOC-NGA-LEG-0082341" ], "split": "TRAIN" }, { "chunk_id": "CHK-KEN-JDG-0034871-001", "document_id": "DOC-KEN-JDG-0034871", "country_code": "KE", "language": "ENGLISH", "legal_tradition": "COMMON_LAW", "document_type": "JUDGMENT", "document_date": "2023-03-15", "chunk_index": 0, "text": "IN THE COURT OF APPEAL OF KENYA AT NAIROBI. Civil Appeal No. 187 of 2022. Between Safaricom PLC (Appellant) and Communications Authority of Kenya (Respondent).", "clause_type": "JURISDICTION", "topic_label": "TELECOMMUNICATIONS_REGULATION", "ner_spans": [ { "start": 36, "end": 43, "label": "LOC", "text": "NAIROBI" }, { "start": 84, "end": 98, "label": "ORG", "text": "Safaricom PLC" }, { "start": 112, "end": 142, "label": "ORG", "text": "Communications Authority of Kenya" } ], "cited_documents": [], "split": "TEST" }, { "chunk_id": "CHK-SEN-REG-0019204-002", "document_id": "DOC-SEN-REG-0019204", "country_code": "SN", "language": "FRENCH", "legal_tradition": "CIVIL_LAW", "document_type": "REGULATION", "document_date": "2021-09-30", "chunk_index": 1, "text": "Article 7 — Les établissements de crédit sont tenus de constituer et de maintenir en permanence un ratio de solvabilité minimal de huit pour cent (8%) conformément aux normes BCEAO.", "clause_type": "OBLIGATION", "topic_label": "BANKING_REGULATION", "ner_spans": [ { "start": 155, "end": 160, "label": "ORG", "text": "BCEAO" } ], "cited_documents": [], "split": "TRAIN" }, { "chunk_id": "CHK-ZAF-CTR-0061038-007", "document_id": "DOC-ZAF-CTR-0061038", "country_code": "ZA", "language": "ENGLISH", "legal_tradition": "MIXED", "document_type": "CONTRACT", "document_date": "2022-07-01", "chunk_index": 6, "text": "14.3 Neither party shall be liable for any failure or delay in performing its obligations under this Agreement to the extent that such failure or delay is caused by a Force Majeure Event.", "clause_type": "FORCE_MAJEURE", "topic_label": "COMMERCIAL_CONTRACT", "ner_spans": [], "cited_documents": [], "split": "TRAIN" } ]

Request Dataset Access

All datasets are available under a commercial licence agreement. Our team typically responds within 2 business days.

Request Access

NDA may be required

Related Datasets

Build with Data that reflects Africa

Request access to our full catalog of licensed human-validated African dataset or request a custom data tailored to your project.

Request Dataset Access Contact Sales