The Hidden Cost of African Language Tokenization.
Frontier LLMs cost up to 8.92× more to process African languages than English—even before a model runs a forward pass. This leaderboard ranks 11 tokenizers across 20 African languages and 3 scripts to reveal which frontier models impose the lowest tokenization costs, latency, and context penalties.
11
Tokenizers
20
Languages
3
Scripts
616
Data Points
Token Fertility Leaderboard
Lower mean premium = less tokenization overhead = lower API cost. Ranked ascending.
* Mean premium = average tokens-per-word ratio vs English across 19 African languages (16 Latin-script, 2 Ethiopic, 1 N'Ko). Lower is better. All values from paper (arXiv:2606.24460). Published: 27 June 2026.
The African Language Tax by Language
Tokenization premium vs English for all 19 African languages across all 11 tokenizers.
| Language | Script | Gemma 4 | aya-exp. | BLOOM | Qwen3 | o200k-H | Llama 4 | o200k-B | Tekken | DeepSeek | Llama 3.1 | cl100k |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Swahili | Latin | 1.70× | 1.88× | 1.29× | 2.00× | 1.54× | 1.73× | 1.54× | 1.88× | 2.01× | 2.00× | 2.02× |
| Yoruba | Latin | 2.07× | 2.26× | 1.36× | 2.34× | 1.85× | 2.15× | 1.85× | 2.39× | 2.45× | 2.35× | 2.55× |
| Igbo | Latin | 1.85× | 2.06× | 1.47× | 2.04× | 1.42× | 1.75× | 1.42× | 2.04× | 2.07× | 2.05× | 2.11× |
| Wolof | Latin | 1.55× | 1.59× | 1.48× | 1.67× | 1.52× | 1.55× | 1.52× | 1.63× | 1.65× | 1.66× | 1.69× |
| Sesotho | Latin | 1.53× | 1.61× | 1.49× | 1.67× | 1.40× | 1.53× | 1.40× | 1.62× | 1.67× | 1.67× | 1.68× |
| Lingala | Latin | 1.58× | 1.64× | 1.52× | 1.70× | 1.48× | 1.63× | 1.48× | 1.64× | 1.71× | 1.71× | 1.72× |
| Akan / Twi | Latin | 1.64× | 1.80× | 1.52× | 1.81× | 1.57× | 1.74× | 1.57× | 2.02× | 1.83× | 2.10× | 2.11× |
| Hausa | Latin | 1.51× | 1.61× | 1.54× | 1.69× | 1.35× | 1.51× | 1.35× | 1.65× | 1.69× | 1.72× | 1.74× |
| Afrikaans | Latin | 1.44× | 1.42× | 1.60× | 1.59× | 1.35× | 1.42× | 1.35× | 1.44× | 1.55× | 1.59× | 1.60× |
| Kinyarwanda | Latin | 2.16× | 2.22× | 1.73× | 2.32× | 1.88× | 2.18× | 1.88× | 2.22× | 2.33× | 2.32× | 2.34× |
| Bambara | Latin | 2.03× | 2.08× | 1.83× | 2.07× | 1.98× | 2.04× | 1.98× | 2.37× | 2.13× | 2.45× | 2.49× |
| Luganda | Latin | 2.27× | 2.34× | 2.08× | 2.42× | 2.07× | 2.28× | 2.07× | 2.33× | 2.42× | 2.41× | 2.44× |
| Xhosa | Latin | 2.56× | 2.63× | 2.25× | 2.75× | 2.22× | 2.50× | 2.22× | 2.67× | 2.77× | 2.76× | 2.78× |
| Shona | Latin | 2.37× | 2.48× | 2.28× | 2.68× | 2.15× | 2.37× | 2.15× | 2.51× | 2.64× | 2.63× | 2.70× |
| Zulu | Latin | 2.59× | 2.70× | 2.31× | 2.87× | 2.26× | 2.58× | 2.26× | 2.76× | 2.87× | 2.87× | 2.90× |
| Oromo | Latin | 2.38× | 2.44× | 2.44× | 2.56× | 2.10× | 2.41× | 2.10× | 2.48× | 2.60× | 2.57× | 2.59× |
| Tigrinya | Ethiopic | 2.82× | 7.38× | 5.84× | 4.73× | 6.79× | 3.42× | 6.79× | 8.66× | 6.97× | 8.85× | 8.86× |
| Amharic | Ethiopic | 2.47× | 8.06× | 6.37× | 5.16× | 7.37× | 3.05× | 7.37× | 9.47× | 7.61× | 9.67× | 9.68× |
| N'Ko | N'Ko | 8.73× | 8.89× | 8.75× | 5.96× | 8.92× | 8.86× | 8.92× | 8.62× | 8.87× | 8.82× | 8.82× |
All premium values from Table B.2 of paper (arXiv:2606.24460).
Fertility — Tokens per Word
Raw fertility F(L,T) from Table B.1 of the paper. English and French included as controls. Lower = more token-efficient.
| Language | Script | Gemma 4 | aya-exp. | BLOOM | Qwen3 | o200k-H | Llama 4 | o200k-B | Tekken | DeepSeek | Llama 3.1 | cl100k |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| English | Latin | 1.23 | 1.22 | 1.24 | 1.25 | 1.22 | 1.23 | 1.22 | 1.26 | 1.23 | 1.23 | 1.23 |
| French | Latin | 1.50 | 1.42 | 1.29 | 1.71 | 1.44 | 1.45 | 1.44 | 1.44 | 1.64 | 1.70 | 1.71 |
| Swahili | Latin | 2.08 | 2.29 | 1.60 | 2.51 | 1.87 | 2.12 | 1.87 | 2.37 | 2.46 | 2.46 | 2.49 |
| Yoruba | Latin | 2.55 | 2.76 | 1.69 | 2.93 | 2.26 | 2.64 | 2.26 | 3.01 | 3.00 | 2.89 | 3.14 |
| Igbo | Latin | 2.27 | 2.51 | 1.82 | 2.55 | 1.73 | 2.14 | 1.73 | 2.57 | 2.53 | 2.53 | 2.59 |
| Wolof | Latin | 1.91 | 1.94 | 1.83 | 2.09 | 1.85 | 1.90 | 1.85 | 2.06 | 2.02 | 2.05 | 2.08 |
| Sesotho | Latin | 1.88 | 1.97 | 1.85 | 2.09 | 1.70 | 1.87 | 1.70 | 2.04 | 2.04 | 2.05 | 2.07 |
| Lingala | Latin | 1.94 | 2.00 | 1.88 | 2.13 | 1.81 | 2.00 | 1.81 | 2.07 | 2.10 | 2.10 | 2.11 |
| Akan / Twi | Latin | 2.02 | 2.19 | 1.89 | 2.26 | 1.91 | 2.13 | 1.91 | 2.54 | 2.24 | 2.59 | 2.60 |
| Hausa | Latin | 1.85 | 1.96 | 1.91 | 2.12 | 1.65 | 1.85 | 1.65 | 2.09 | 2.07 | 2.12 | 2.14 |
| Afrikaans | Latin | 1.77 | 1.74 | 1.99 | 1.99 | 1.65 | 1.74 | 1.65 | 1.82 | 1.90 | 1.96 | 1.97 |
| Kinyarwanda | Latin | 2.65 | 2.72 | 2.15 | 2.90 | 2.29 | 2.67 | 2.29 | 2.80 | 2.86 | 2.86 | 2.88 |
| Bambara | Latin | 2.49 | 2.54 | 2.28 | 2.59 | 2.41 | 2.50 | 2.41 | 2.99 | 2.61 | 3.02 | 3.07 |
| Luganda | Latin | 2.79 | 2.86 | 2.59 | 3.03 | 2.52 | 2.80 | 2.52 | 2.94 | 2.97 | 2.97 | 3.01 |
| Xhosa | Latin | 3.14 | 3.21 | 2.79 | 3.45 | 2.70 | 3.07 | 2.70 | 3.37 | 3.39 | 3.39 | 3.43 |
| Shona | Latin | 2.92 | 3.03 | 2.84 | 3.35 | 2.62 | 2.91 | 2.62 | 3.17 | 3.24 | 3.24 | 3.33 |
| Zulu | Latin | 3.18 | 3.30 | 2.86 | 3.59 | 2.75 | 3.16 | 2.75 | 3.48 | 3.52 | 3.53 | 3.58 |
| Oromo | Latin | 2.92 | 2.98 | 3.03 | 3.21 | 2.56 | 2.96 | 2.56 | 3.13 | 3.18 | 3.17 | 3.19 |
| Tigrinya | Ethiopic | 3.46 | 9.02 | 7.25 | 5.92 | 8.27 | 4.19 | 8.27 | 10.92 | 8.54 | 10.90 | 10.91 |
| Amharic | Ethiopic | 3.04 | 9.84 | 7.91 | 6.45 | 8.97 | 3.74 | 8.97 | 11.94 | 9.33 | 11.91 | 11.92 |
| N'Ko | N'Ko | 10.73 | 10.86 | 10.87 | 7.47 | 10.87 | 10.87 | 10.87 | 10.87 | 10.87 | 10.87 | 10.87 |
All fertility values from Table B.1 of paper (arXiv:2606.24460). Values are tokens per word; English ~1.23 is the baseline.
Four Metrics of Tokenization Equity
Each metric captures a different dimension of the cost penalty imposed on African language speakers when using frontier LLMs.
How This Leaderboard Was Built
afri-fertility Python library (Apache 2.0, open-source). 1,012 sentences per pair were tokenized and word-token ratios computed with 95% confidence intervals.






