African Patient Electronic Health Record Dataset
1M+ de-identified patient episodes from tertiary and secondary hospitals in Nigeria and Ghana — covering demographics, ICD-10 diagnoses, treatment pathways, lab results, and discharge outcomes — purpose-built for clinical decision support, disease surveillance, and health-system AI in African contexts.
This is a synthetic dataset generated from high-quality expert-labelled seed data. All records are algorithmically derived — statistical distributions, inter-field correlations, and annotation characteristics faithfully replicate real-world patterns from the source data, while ensuring no real individual, organisation, or transaction can be identified or reconstructed.
The African Patient Electronic Health Record Dataset aggregates 1M+ de-identified patient episodes from 14 tertiary and secondary hospitals across Nigeria (Lagos, Abuja, Kano, Port Harcourt) and Ghana (Accra, Kumasi, Tamale). Each episode covers a single inpatient admission or outpatient visit and includes structured fields for patient demographics, presenting complaint, ICD-10-CM diagnosis codes (primary + up to 4 secondary), treatment interventions (procedure codes, prescribed medications), selected laboratory results, and discharge outcome.
De-identification was performed using a two-stage pipeline: a rule-based redactor replaced direct identifiers (names, MRN, dates shifted by a random per-patient offset), followed by a statistical disclosure control review to suppress rare combinations of quasi-identifiers. All data processing was conducted under ethics approvals from the respective hospital institutional review boards and the DataLens Africa Research Ethics Committee. The dataset complies with the Nigeria Data Protection Regulation (NDPR) and Ghana's Data Protection Act.
The dataset is structured to support a wide range of clinical AI tasks: supervised classification of diagnosis and readmission risk, survival analysis, treatment-effect estimation, and NLP extraction from free-text clinical notes (where available). A separate ICD-10 code co-occurrence graph and a hospital-level metadata file (bed capacity, facility type, urban/rural flag) are provided as companion files to enable multi-level modelling.
Key Use Cases
Dataset Highlights
Geographic Coverage
Dataset Schema
Each record represents one patient episode (inpatient admission or outpatient visit). All direct identifiers have been removed; dates are shifted by a per-patient random offset preserving temporal ordering within a patient's history.
| Field Name | Type | Description | Nullable | Example |
|---|---|---|---|---|
| episode_id | STRING | Unique episode identifier | No | EP-NGA-LG-00841923 |
| patient_id | STRING | Anonymised persistent patient identifier (links episodes for same patient) | No | PAT-NGA-0049182 |
| country_code | STRING | ISO 3166-1 alpha-2 country code | No | NG |
| facility_id | STRING | Anonymised hospital / facility identifier | No | FAC-NGA-007 |
| facility_type | ENUM | Facility level: TERTIARY, SECONDARY | No | TERTIARY |
| episode_type | ENUM | Visit type: INPATIENT, OUTPATIENT, EMERGENCY | No | INPATIENT |
| admission_year | INTEGER | Shifted year of admission (temporal ordering preserved within patient) | No | 2022 |
| age_group | ENUM | Patient age group: INFANT, CHILD, ADOLESCENT, ADULT, ELDERLY | No | ADULT |
| gender | ENUM | Patient gender: MALE, FEMALE | No | FEMALE |
| primary_diagnosis | STRING | Primary ICD-10-CM diagnosis code | No | A01.0 |
| secondary_diagnoses | JSON | Array of up to 4 secondary ICD-10-CM codes | Yes | ["E11.9", "I10"] |
| length_of_stay_days | INTEGER | Inpatient length of stay in days (null for outpatient) | Yes | 5 |
| lab_results_summary | JSON | Key-value pairs of selected lab test results (test name → value + unit) | Yes | {"HB": "9.2 g/dL", "WBC": "11.4 k/uL"} |
| discharge_outcome | ENUM | Episode outcome: DISCHARGED, REFERRED, DECEASED, ABSCONDED | No | DISCHARGED |
| readmitted_30d | BOOLEAN | True if patient was readmitted within 30 days of discharge | Yes | false |
| has_clinical_notes | BOOLEAN | True if free-text clinical notes are available for this episode | No | true |
Sample Records
Four representative patient episodes illustrating variation across facility types, diagnoses, and outcomes.
Build with Data that reflects Africa
Request access to our full catalog of licensed human-validated African datasets or request custom data tailored to your project.