Executive Summary

As Large Language Models scale globally, assessing their efficacy within regional contexts is paramount for enterprise adoption. DataLens Africa has updated its LLM Leaderboard, introducing rigorous evaluation metrics tailored to African languages, cultural nuances, and localized knowledge systems.

A comparative analysis between the February 2026 baseline and the latest May 2026 update reveals significant leaps in localized performance. Most notably, Gemini 3.5 Flash has disrupted the standings — claiming the absolute top position — while new entries from Anthropic (Claude Opus 4.6), OpenAI (GPT-5.4), and DeepSeek (DeepSeek-V4-Pro) signal an intensifying race for dominance in African-centric AI capability.

New Overall Record 82.12% Gemini 3.5 Flash · May 2026
Score Improvement +7.45pp From Feb high of 74.67%
New Models Evaluated 5 Between May 17–22, 2026
Highest AfriMedQA 84.71% Gemini 3.5 Flash · Healthcare AI

May 2026: Full Leaderboard Rankings

The updated leaderboard now covers 15 models across five provider families. Here are the complete results, sorted by overall score:

# Model AfriMCQA AfriMMLU MasakhaNEWS AfriMedQA Overall
1Gemini 3.5 Flash New90.1480.8872.7484.7182.12%
2Claude Opus 4.6 New82.4568.1379.1978.9977.19%
3DeepSeek-V4-Pro New76.7575.8076.2476.26%
4GPT-5.4 New78.3772.7577.7174.7175.88%
5GPT-5.181.2262.1378.6176.7274.67%
6Gemini 3.1 Flash Lite New85.5869.3871.4472.0674.61%
7Gemini 2.5 Pro85.5879.4751.2275.9573.05%
8DeepSeek-R170.6373.8772.9972.50%
9Claude Sonnet 4.679.3362.3868.7578.0472.12%
10GPT-5.275.5067.3875.2769.9772.03%
11Gemini 2.5 Flash81.0560.5071.2074.4271.79%
12Grok 4.1 Fast Reasoning72.6059.6364.9070.1366.81%
13DeepSeek-V3.261.5064.6671.0665.74%
14Grok 4 Fast Reasoning74.7654.1360.2671.3265.12%
15Claude Haiku 4.563.4654.7567.7262.4962.10%

— indicates model was not evaluated on this benchmark or results were below validity threshold. View the live leaderboard at datalens.africa/llm-leaderboard.

Overall Score Rankings — All 15 Models (May 2026)
Google Anthropic OpenAI DeepSeek xAI / Grok

Scores reflect weighted average of available benchmarks. — entries excluded from AfriMCQA calculation for DeepSeek models.


Key Insights & Market Dynamics

1. Google Seizes the Throne with Gemini 3.5 Flash

In the February baseline, OpenAI's GPT-5.1 led the pack with an overall score of 74.67%. The May update shows a dramatic shift: Gemini 3.5 Flash debuted with an exceptional overall score of 82.12%, outperforming the entire market by a notable margin.

🏆
AfriMCQA — Record High
90.14%
Highest single-benchmark score ever recorded on the leaderboard
🏥
AfriMedQA — Healthcare AI
84.71%
Sets a new benchmark for clinical AI in African language contexts

Google's lighter iteration, Gemini 3.1 Flash Lite, also put up a fierce performance with an overall score of 74.61% — nearly matching GPT-5.1, the February leader, while optimizing for computational efficiency. This demonstrates that Google's Flash architecture scales down without collapsing on African benchmarks.

2. Premium Frontier Models Battle for the Top Tier

The mid-2026 releases have pushed enterprise-grade capabilities forward across all major providers:

  • Claude Opus 4.6 secured the #2 spot overall with 77.19%, exhibiting the highest MasakhaNEWS score of any model at 79.19% — the first time any model has overtaken GPT-5.1 on African news classification across 16 languages.
  • DeepSeek-V4-Pro achieved an overall score of 76.26% on just three benchmarks, securing #3 and demonstrating unusual consistency across AfriMMLU (76.75%), MasakhaNEWS (75.80%), and AfriMedQA (76.24%).
  • GPT-5.4 improved upon its predecessor to reach 75.88% overall, posting strong gains in AfriMMLU (72.75%) and MasakhaNEWS (77.71%) — though it still trails on medical reasoning.

3. Benchmark Performance Breakdown

The most telling story in the May data is not the overall rankings — it is what per-benchmark scores reveal about model specialization. The four benchmarks probe fundamentally different capabilities, and the gap between leaders varies significantly by task:

  • AfriMCQA & AfriMMLU (Academic & Cultural Nuance): Gemini 3.5 Flash dominates AfriMCQA at 90.14%, while Gemini 2.5 Pro retains a very strong hold on AfriMMLU at 79.47%.
  • MasakhaNEWS (Regional News & Media): Claude Opus 4.6 leads at 79.19%, closely followed by GPT-5.1 (78.61%). Anthropic and OpenAI remain the most sophisticated models for African journalistic text. Gemini 2.5 Pro's anomalous 51.22% score here remains the leaderboard's most significant outlier.
  • AfriMedQA (Localized Healthcare): Gemini 3.5 Flash sets the benchmark at 84.71%, with Claude Opus 4.6 (78.99%) as the clear second — results with profound implications for AI-driven healthcare deployments across the continent.
Per-Benchmark Scores — Top 7 Models (May 2026)
AfriMCQA AfriMMLU MasakhaNEWS AfriMedQA

DeepSeek-V4-Pro's AfriMCQA bar is absent — systematic evaluation incompatibility, not a capability gap.


Comparative View: Feb 2026 vs. May 2026

The table below outlines the evolution of the leaderboard, focusing on the top-performing flagships from February alongside all five newly introduced models from May.

Rank Model AfriMCQA AfriMMLU MasakhaNEWS AfriMedQA Overall Eval Date Status
1Gemini 3.5 Flash90.1480.8872.7484.7182.12%5/20/2026New · Leader
2Claude Opus 4.682.4568.1379.1978.9977.19%5/18/2026New
3DeepSeek-V4-Pro76.7575.8076.2476.26%5/17/2026New
4GPT-5.478.3772.7577.7174.7175.88%5/17/2026New
5GPT-5.181.2262.1378.6176.7274.67%2/22/2026↓ from #1
6Gemini 3.1 Flash Lite85.5869.3871.4472.0674.61%5/22/2026New
7Gemini 2.5 Pro85.5879.4751.2275.9573.05%2/22/2026↓ from #2
Overall Score: Feb 2026 Leaders vs. May 2026 New Entrants

Feb 2026 bars show the best-scoring models from the baseline cohort. May 2026 bars show the five new entrants only.


Strategic Takeaways for Enterprises

The May 2026 results carry clear strategic signals for organizations evaluating LLM deployments across African markets.

Takeaway 01
The Efficiency vs. Power Paradox is Shifting

Gemini 3.5 Flash proves that "Flash" or smaller-architecture models are no longer purely budget options. Through optimized training data, they can comprehensively beat legacy heavyweights on localized benchmarks — redefining the cost-performance calculus for African deployments.

Takeaway 02
Open-Weights & Alternate Players Are Maturing

DeepSeek's strong showing with V4-Pro demonstrates that non-Western AI labs are rapidly capturing the nuances of regional benchmarks, offering viable alternatives for cost-conscious enterprise applications. Their AfriMCQA blind spot aside, the consistency of their scores is impressive.

Takeaway 03
Localization Is the New Competitive Moat

Generic LLM capabilities are commoditizing. The real value for organizations operating in Africa lies in choosing models that excel in regional datasets like MasakhaNEWS and AfriMedQA — ensuring high-fidelity interactions and reducing hallucination rates in local contexts.


What Comes Next

The May 2026 evaluation cycle raised the bar significantly. Five new models, one new record, and a clearer picture of which families are investing in African language capability. But fifteen models is still a narrow slice of those being deployed across the continent, and four benchmarks cannot yet capture the full range of tasks African language AI must handle in production.

The most valuable near-term extension is language-disaggregated results: knowing a model scores 80% on AfriMCQA overall is less actionable than knowing it scores 91% on Hausa and 67% on Wolof. That granularity is what actually informs model selection for teams building products in specific African markets. Future cycles will add new benchmarks, new models, and eventually those per-language breakdowns.

"Three months ago, no model had crossed 75% on African benchmarks overall. Today, four have — and the leader is at 82%. The benchmarks are beginning to separate models that treat African languages as an afterthought from those genuinely investing in the capability."

The DataLens Africa LLM Leaderboard is updated continuously as evaluations are completed. Organizations deploying or developing models for African language applications can submit models for evaluation, or access annotation and training data infrastructure through DataLens Studio.