Which AI model scored highest on African benchmarks in May 2026?

Gemini 3.5 Flash by Google achieved the highest overall score of 82.12% on the DataLens Africa LLM Leaderboard in May 2026, setting a new record. It led on AfriMCQA (90.14%) and AfriMedQA (84.71%), making it the dominant model for factual reasoning and medical question-answering in African language contexts.

How did Claude perform on African language benchmarks in 2026?

Claude Opus 4.6 debuted in May 2026 with an overall score of 77.19%, placing second on the DataLens Africa leaderboard. Its strongest benchmark was MasakhaNEWS news classification, where it scored 79.19% — the highest of any model evaluated. Claude Sonnet 4.6 (72.12%) and Claude Haiku 4.5 (62.10%) remained from the February cohort.

What benchmarks does the DataLens Africa LLM Leaderboard use?

The DataLens Africa LLM Leaderboard evaluates models across four African-language benchmarks: AfriMCQA (multiple-choice question answering across African topics), AfriMMLU (massive multitask language understanding adapted for African contexts), MasakhaNEWS (news topic classification across 16 African languages), and AfriMedQA (medical question answering in African language settings).

Did the overall top score improve between February and May 2026?

Yes, significantly. The top overall score jumped from 74.67% (GPT-5.1, February 2026) to 82.12% (Gemini 3.5 Flash, May 2026) — an improvement of 7.45 percentage points in approximately three months. This is the largest single-cycle improvement recorded on the leaderboard to date.

Why do DeepSeek models show missing scores on AfriMCQA?

DeepSeek-V3.2 and DeepSeek-V4-Pro both show '--' for AfriMCQA on the DataLens Africa leaderboard. This indicates the models were either not evaluated on that benchmark or returned results that did not meet the evaluation validity threshold. DeepSeek-R1 was similarly excluded from the AfriMCQA results despite being evaluated across other benchmarks.

Benchmarking the Next Frontier of African-Centric AI: May 2026 LLM Leaderboard Analysis

Executive Summary

As Large Language Models scale globally, assessing their efficacy within regional contexts is paramount for enterprise adoption. DataLens Africa has updated its LLM Leaderboard, introducing rigorous evaluation metrics tailored to African languages, cultural nuances, and localized knowledge systems.

A comparative analysis between the February 2026 baseline and the latest May 2026 update reveals significant leaps in localized performance. Most notably, Gemini 3.5 Flash has disrupted the standings — claiming the absolute top position — while new entries from Anthropic (Claude Opus 4.6), OpenAI (GPT-5.4), and DeepSeek (DeepSeek-V4-Pro) signal an intensifying race for dominance in African-centric AI capability.

New Overall Record 82.12% Gemini 3.5 Flash · May 2026

Score Improvement +7.45pp From Feb high of 74.67%

New Models Evaluated 5 Between May 17–22, 2026

Highest AfriMedQA 84.71% Gemini 3.5 Flash · Healthcare AI

May 2026: Full Leaderboard Rankings

The updated leaderboard now covers 15 models across five provider families. Here are the complete results, sorted by overall score:

#	Model	AfriMCQA	AfriMMLU	MasakhaNEWS	AfriMedQA	Overall
1	Gemini 3.5 Flash New	90.14	80.88	72.74	84.71	82.12%
2	Claude Opus 4.6 New	82.45	68.13	79.19	78.99	77.19%
3	DeepSeek-V4-Pro New	—	76.75	75.80	76.24	76.26%
4	GPT-5.4 New	78.37	72.75	77.71	74.71	75.88%
5	GPT-5.1	81.22	62.13	78.61	76.72	74.67%
6	Gemini 3.1 Flash Lite New	85.58	69.38	71.44	72.06	74.61%
7	Gemini 2.5 Pro	85.58	79.47	51.22	75.95	73.05%
8	DeepSeek-R1	—	70.63	73.87	72.99	72.50%
9	Claude Sonnet 4.6	79.33	62.38	68.75	78.04	72.12%
10	GPT-5.2	75.50	67.38	75.27	69.97	72.03%
11	Gemini 2.5 Flash	81.05	60.50	71.20	74.42	71.79%
12	Grok 4.1 Fast Reasoning	72.60	59.63	64.90	70.13	66.81%
13	DeepSeek-V3.2	—	61.50	64.66	71.06	65.74%
14	Grok 4 Fast Reasoning	74.76	54.13	60.26	71.32	65.12%
15	Claude Haiku 4.5	63.46	54.75	67.72	62.49	62.10%

— indicates model was not evaluated on this benchmark or results were below validity threshold. View the live leaderboard at datalens.africa/llm-leaderboard.

Overall Score Rankings — All 15 Models (May 2026)

Google Anthropic OpenAI DeepSeek xAI / Grok

Scores reflect weighted average of available benchmarks. — entries excluded from AfriMCQA calculation for DeepSeek models.

Key Insights & Market Dynamics

1. Google Seizes the Throne with Gemini 3.5 Flash

In the February baseline, OpenAI's GPT-5.1 led the pack with an overall score of 74.67%. The May update shows a dramatic shift: Gemini 3.5 Flash debuted with an exceptional overall score of 82.12%, outperforming the entire market by a notable margin.

🏆

AfriMCQA — Record High

90.14%

Highest single-benchmark score ever recorded on the leaderboard

🏥

AfriMedQA — Healthcare AI

84.71%

Sets a new benchmark for clinical AI in African language contexts

Google's lighter iteration, Gemini 3.1 Flash Lite, also put up a fierce performance with an overall score of 74.61% — nearly matching GPT-5.1, the February leader, while optimizing for computational efficiency. This demonstrates that Google's Flash architecture scales down without collapsing on African benchmarks.

2. Premium Frontier Models Battle for the Top Tier

The mid-2026 releases have pushed enterprise-grade capabilities forward across all major providers:

Claude Opus 4.6 secured the #2 spot overall with 77.19%, exhibiting the highest MasakhaNEWS score of any model at 79.19% — the first time any model has overtaken GPT-5.1 on African news classification across 16 languages.
DeepSeek-V4-Pro achieved an overall score of 76.26% on just three benchmarks, securing #3 and demonstrating unusual consistency across AfriMMLU (76.75%), MasakhaNEWS (75.80%), and AfriMedQA (76.24%).
GPT-5.4 improved upon its predecessor to reach 75.88% overall, posting strong gains in AfriMMLU (72.75%) and MasakhaNEWS (77.71%) — though it still trails on medical reasoning.

3. Benchmark Performance Breakdown

The most telling story in the May data is not the overall rankings — it is what per-benchmark scores reveal about model specialization. The four benchmarks probe fundamentally different capabilities, and the gap between leaders varies significantly by task:

AfriMCQA & AfriMMLU (Academic & Cultural Nuance): Gemini 3.5 Flash dominates AfriMCQA at 90.14%, while Gemini 2.5 Pro retains a very strong hold on AfriMMLU at 79.47%.
MasakhaNEWS (Regional News & Media): Claude Opus 4.6 leads at 79.19%, closely followed by GPT-5.1 (78.61%). Anthropic and OpenAI remain the most sophisticated models for African journalistic text. Gemini 2.5 Pro's anomalous 51.22% score here remains the leaderboard's most significant outlier.
AfriMedQA (Localized Healthcare): Gemini 3.5 Flash sets the benchmark at 84.71%, with Claude Opus 4.6 (78.99%) as the clear second — results with profound implications for AI-driven healthcare deployments across the continent.

Per-Benchmark Scores — Top 7 Models (May 2026)

AfriMCQA AfriMMLU MasakhaNEWS AfriMedQA

DeepSeek-V4-Pro's AfriMCQA bar is absent — systematic evaluation incompatibility, not a capability gap.

Comparative View: Feb 2026 vs. May 2026

The table below outlines the evolution of the leaderboard, focusing on the top-performing flagships from February alongside all five newly introduced models from May.

Rank	Model	AfriMCQA	AfriMMLU	MasakhaNEWS	AfriMedQA	Overall	Eval Date	Status
1	Gemini 3.5 Flash	90.14	80.88	72.74	84.71	82.12%	5/20/2026	New · Leader
2	Claude Opus 4.6	82.45	68.13	79.19	78.99	77.19%	5/18/2026	New
3	DeepSeek-V4-Pro	—	76.75	75.80	76.24	76.26%	5/17/2026	New
4	GPT-5.4	78.37	72.75	77.71	74.71	75.88%	5/17/2026	New
5	GPT-5.1	81.22	62.13	78.61	76.72	74.67%	2/22/2026	↓ from #1
6	Gemini 3.1 Flash Lite	85.58	69.38	71.44	72.06	74.61%	5/22/2026	New
7	Gemini 2.5 Pro	85.58	79.47	51.22	75.95	73.05%	2/22/2026	↓ from #2

Overall Score: Feb 2026 Leaders vs. May 2026 New Entrants

Feb 2026 bars show the best-scoring models from the baseline cohort. May 2026 bars show the five new entrants only.

Strategic Takeaways for Enterprises

The May 2026 results carry clear strategic signals for organizations evaluating LLM deployments across African markets.

Takeaway 01

The Efficiency vs. Power Paradox is Shifting

Gemini 3.5 Flash proves that "Flash" or smaller-architecture models are no longer purely budget options. Through optimized training data, they can comprehensively beat legacy heavyweights on localized benchmarks — redefining the cost-performance calculus for African deployments.

Takeaway 02

Open-Weights & Alternate Players Are Maturing

DeepSeek's strong showing with V4-Pro demonstrates that non-Western AI labs are rapidly capturing the nuances of regional benchmarks, offering viable alternatives for cost-conscious enterprise applications. Their AfriMCQA blind spot aside, the consistency of their scores is impressive.

Takeaway 03

Localization Is the New Competitive Moat

Generic LLM capabilities are commoditizing. The real value for organizations operating in Africa lies in choosing models that excel in regional datasets like MasakhaNEWS and AfriMedQA — ensuring high-fidelity interactions and reducing hallucination rates in local contexts.

What Comes Next

The May 2026 evaluation cycle raised the bar significantly. Five new models, one new record, and a clearer picture of which families are investing in African language capability. But fifteen models is still a narrow slice of those being deployed across the continent, and four benchmarks cannot yet capture the full range of tasks African language AI must handle in production.

The most valuable near-term extension is language-disaggregated results: knowing a model scores 80% on AfriMCQA overall is less actionable than knowing it scores 91% on Hausa and 67% on Wolof. That granularity is what actually informs model selection for teams building products in specific African markets. Future cycles will add new benchmarks, new models, and eventually those per-language breakdowns.

"Three months ago, no model had crossed 75% on African benchmarks overall. Today, four have — and the leader is at 82%. The benchmarks are beginning to separate models that treat African languages as an afterthought from those genuinely investing in the capability."

The DataLens Africa LLM Leaderboard is updated continuously as evaluations are completed. Organizations deploying or developing models for African language applications can submit models for evaluation, or access annotation and training data infrastructure through DataLens Studio.