Executive Summary
As Large Language Models scale globally, assessing their efficacy within regional contexts is paramount for enterprise adoption. DataLens Africa has updated its LLM Leaderboard, introducing rigorous evaluation metrics tailored to African languages, cultural nuances, and localized knowledge systems.
A comparative analysis between the February 2026 baseline and the latest May 2026 update reveals significant leaps in localized performance. Most notably, Gemini 3.5 Flash has disrupted the standings — claiming the absolute top position — while new entries from Anthropic (Claude Opus 4.6), OpenAI (GPT-5.4), and DeepSeek (DeepSeek-V4-Pro) signal an intensifying race for dominance in African-centric AI capability.
May 2026: Full Leaderboard Rankings
The updated leaderboard now covers 15 models across five provider families. Here are the complete results, sorted by overall score:
| # | Model | AfriMCQA | AfriMMLU | MasakhaNEWS | AfriMedQA | Overall |
|---|---|---|---|---|---|---|
| 1 | Gemini 3.5 Flash New | 90.14 | 80.88 | 72.74 | 84.71 | 82.12% |
| 2 | Claude Opus 4.6 New | 82.45 | 68.13 | 79.19 | 78.99 | 77.19% |
| 3 | DeepSeek-V4-Pro New | — | 76.75 | 75.80 | 76.24 | 76.26% |
| 4 | GPT-5.4 New | 78.37 | 72.75 | 77.71 | 74.71 | 75.88% |
| 5 | GPT-5.1 | 81.22 | 62.13 | 78.61 | 76.72 | 74.67% |
| 6 | Gemini 3.1 Flash Lite New | 85.58 | 69.38 | 71.44 | 72.06 | 74.61% |
| 7 | Gemini 2.5 Pro | 85.58 | 79.47 | 51.22 | 75.95 | 73.05% |
| 8 | DeepSeek-R1 | — | 70.63 | 73.87 | 72.99 | 72.50% |
| 9 | Claude Sonnet 4.6 | 79.33 | 62.38 | 68.75 | 78.04 | 72.12% |
| 10 | GPT-5.2 | 75.50 | 67.38 | 75.27 | 69.97 | 72.03% |
| 11 | Gemini 2.5 Flash | 81.05 | 60.50 | 71.20 | 74.42 | 71.79% |
| 12 | Grok 4.1 Fast Reasoning | 72.60 | 59.63 | 64.90 | 70.13 | 66.81% |
| 13 | DeepSeek-V3.2 | — | 61.50 | 64.66 | 71.06 | 65.74% |
| 14 | Grok 4 Fast Reasoning | 74.76 | 54.13 | 60.26 | 71.32 | 65.12% |
| 15 | Claude Haiku 4.5 | 63.46 | 54.75 | 67.72 | 62.49 | 62.10% |
— indicates model was not evaluated on this benchmark or results were below validity threshold. View the live leaderboard at datalens.africa/llm-leaderboard.
Scores reflect weighted average of available benchmarks. — entries excluded from AfriMCQA calculation for DeepSeek models.
Key Insights & Market Dynamics
1. Google Seizes the Throne with Gemini 3.5 Flash
In the February baseline, OpenAI's GPT-5.1 led the pack with an overall score of 74.67%. The May update shows a dramatic shift: Gemini 3.5 Flash debuted with an exceptional overall score of 82.12%, outperforming the entire market by a notable margin.
Google's lighter iteration, Gemini 3.1 Flash Lite, also put up a fierce performance with an overall score of 74.61% — nearly matching GPT-5.1, the February leader, while optimizing for computational efficiency. This demonstrates that Google's Flash architecture scales down without collapsing on African benchmarks.
2. Premium Frontier Models Battle for the Top Tier
The mid-2026 releases have pushed enterprise-grade capabilities forward across all major providers:
- Claude Opus 4.6 secured the #2 spot overall with 77.19%, exhibiting the highest MasakhaNEWS score of any model at 79.19% — the first time any model has overtaken GPT-5.1 on African news classification across 16 languages.
- DeepSeek-V4-Pro achieved an overall score of 76.26% on just three benchmarks, securing #3 and demonstrating unusual consistency across AfriMMLU (76.75%), MasakhaNEWS (75.80%), and AfriMedQA (76.24%).
- GPT-5.4 improved upon its predecessor to reach 75.88% overall, posting strong gains in AfriMMLU (72.75%) and MasakhaNEWS (77.71%) — though it still trails on medical reasoning.
3. Benchmark Performance Breakdown
The most telling story in the May data is not the overall rankings — it is what per-benchmark scores reveal about model specialization. The four benchmarks probe fundamentally different capabilities, and the gap between leaders varies significantly by task:
- AfriMCQA & AfriMMLU (Academic & Cultural Nuance): Gemini 3.5 Flash dominates AfriMCQA at 90.14%, while Gemini 2.5 Pro retains a very strong hold on AfriMMLU at 79.47%.
- MasakhaNEWS (Regional News & Media): Claude Opus 4.6 leads at 79.19%, closely followed by GPT-5.1 (78.61%). Anthropic and OpenAI remain the most sophisticated models for African journalistic text. Gemini 2.5 Pro's anomalous 51.22% score here remains the leaderboard's most significant outlier.
- AfriMedQA (Localized Healthcare): Gemini 3.5 Flash sets the benchmark at 84.71%, with Claude Opus 4.6 (78.99%) as the clear second — results with profound implications for AI-driven healthcare deployments across the continent.
DeepSeek-V4-Pro's AfriMCQA bar is absent — systematic evaluation incompatibility, not a capability gap.
Comparative View: Feb 2026 vs. May 2026
The table below outlines the evolution of the leaderboard, focusing on the top-performing flagships from February alongside all five newly introduced models from May.
| Rank | Model | AfriMCQA | AfriMMLU | MasakhaNEWS | AfriMedQA | Overall | Eval Date | Status |
|---|---|---|---|---|---|---|---|---|
| 1 | Gemini 3.5 Flash | 90.14 | 80.88 | 72.74 | 84.71 | 82.12% | 5/20/2026 | New · Leader |
| 2 | Claude Opus 4.6 | 82.45 | 68.13 | 79.19 | 78.99 | 77.19% | 5/18/2026 | New |
| 3 | DeepSeek-V4-Pro | — | 76.75 | 75.80 | 76.24 | 76.26% | 5/17/2026 | New |
| 4 | GPT-5.4 | 78.37 | 72.75 | 77.71 | 74.71 | 75.88% | 5/17/2026 | New |
| 5 | GPT-5.1 | 81.22 | 62.13 | 78.61 | 76.72 | 74.67% | 2/22/2026 | ↓ from #1 |
| 6 | Gemini 3.1 Flash Lite | 85.58 | 69.38 | 71.44 | 72.06 | 74.61% | 5/22/2026 | New |
| 7 | Gemini 2.5 Pro | 85.58 | 79.47 | 51.22 | 75.95 | 73.05% | 2/22/2026 | ↓ from #2 |
Feb 2026 bars show the best-scoring models from the baseline cohort. May 2026 bars show the five new entrants only.
Strategic Takeaways for Enterprises
The May 2026 results carry clear strategic signals for organizations evaluating LLM deployments across African markets.
Gemini 3.5 Flash proves that "Flash" or smaller-architecture models are no longer purely budget options. Through optimized training data, they can comprehensively beat legacy heavyweights on localized benchmarks — redefining the cost-performance calculus for African deployments.
DeepSeek's strong showing with V4-Pro demonstrates that non-Western AI labs are rapidly capturing the nuances of regional benchmarks, offering viable alternatives for cost-conscious enterprise applications. Their AfriMCQA blind spot aside, the consistency of their scores is impressive.
Generic LLM capabilities are commoditizing. The real value for organizations operating in Africa lies in choosing models that excel in regional datasets like MasakhaNEWS and AfriMedQA — ensuring high-fidelity interactions and reducing hallucination rates in local contexts.
What Comes Next
The May 2026 evaluation cycle raised the bar significantly. Five new models, one new record, and a clearer picture of which families are investing in African language capability. But fifteen models is still a narrow slice of those being deployed across the continent, and four benchmarks cannot yet capture the full range of tasks African language AI must handle in production.
The most valuable near-term extension is language-disaggregated results: knowing a model scores 80% on AfriMCQA overall is less actionable than knowing it scores 91% on Hausa and 67% on Wolof. That granularity is what actually informs model selection for teams building products in specific African markets. Future cycles will add new benchmarks, new models, and eventually those per-language breakdowns.
"Three months ago, no model had crossed 75% on African benchmarks overall. Today, four have — and the leader is at 82%. The benchmarks are beginning to separate models that treat African languages as an afterthought from those genuinely investing in the capability."
The DataLens Africa LLM Leaderboard is updated continuously as evaluations are completed. Organizations deploying or developing models for African language applications can submit models for evaluation, or access annotation and training data infrastructure through DataLens Studio.