The Hidden Cost of African Language Tokenization.

Frontier LLMs cost up to 8.92× more to process African languages than English—even before a model runs a forward pass. This leaderboard ranks 11 tokenizers across 20 African languages and 3 scripts to reveal which frontier models impose the lowest tokenization costs, latency, and context penalties.

11

Tokenizers

20

Languages

3

Scripts

616

Data Points

Token Fertility Leaderboard

Lower mean premium = less tokenization overhead = lower API cost. Ranked ascending.

Rank
Tokenizer
Fertility
Latin Tax
Ethiopic Tax
N'Ko Tax
Mean Premium
👑1
Google
Gemma 4
Google · 262,144 vocab
2.93
tok/word
1.95×
2.64×
8.73×
2.38×
2
Meta
Llama 4
Meta · 200,000 vocab
3.01
tok/word
1.96×
3.23×
8.86×
2.46×
3
BigScience
BLOOM
BigScience · 250,680 vocab
3.21
tok/word
1.76×
6.10×
8.75×
2.59×
4
Qwen
Qwen3
Qwen / Alibaba · 151,643 vocab
3.30
tok/word
2.14×
4.94×
5.96×
2.63×
5
OpenAI
o200k_harmony
OpenAI · 201,088 vocab
3.28
tok/word
1.76×
7.08×
8.92×
2.70×
6
OpenAI
o200k_base
OpenAI · 200,019 vocab
3.28
tok/word
1.76×
7.08×
8.92×
2.70×
7
Cohere
aya-expanse
Cohere · 255,000 vocab
3.67
tok/word
2.05×
7.72×
8.89×
3.00×
8
DeepSeek
DeepSeek V3
DeepSeek · 128,000 vocab
3.73
tok/word
2.15×
7.29×
8.87×
3.04×
9
Mistral
Tekken
Mistral · 131,072 vocab
4.01
tok/word
2.10×
9.06×
8.62×
3.18×
10
Meta
Llama 3.1
Meta · 128,000 vocab
4.03
tok/word
2.18×
9.26×
8.82×
3.27×
11
OpenAI
cl100k_base
OpenAI · 100,277 vocab
4.07
tok/word
2.22×
9.27×
8.82×
3.31×

* Mean premium = average tokens-per-word ratio vs English across 19 African languages (16 Latin-script, 2 Ethiopic, 1 N'Ko). Lower is better. All values from paper (arXiv:2606.24460). Published: 27 June 2026.

All African Languages
Mean premium vs English · Top 5 tokenizers · Lower bars = lower overhead
Latin Script Languages
17 Latin-script African languages · Top 5 tokenizers
Ethiopic Script
Amharic & Tigrinya combined · Top 5 tokenizers
N'Ko Script
Manding/Bambara in N'Ko script · Top 5 tokenizers

The African Language Tax by Language

Tokenization premium vs English for all 19 African languages across all 11 tokenizers.

Premium scale ≤ 1.5× 1.5–2.0× 2.0–3.0× 3.0–5.0× 5.0–7.0× 7.0×+
Language Script Gemma 4 aya-exp. BLOOM Qwen3 o200k-H Llama 4 o200k-B Tekken DeepSeek Llama 3.1 cl100k
Swahili Latin 1.70× 1.88× 1.29× 2.00× 1.54× 1.73× 1.54× 1.88× 2.01× 2.00× 2.02×
Yoruba Latin 2.07× 2.26× 1.36× 2.34× 1.85× 2.15× 1.85× 2.39× 2.45× 2.35× 2.55×
Igbo Latin 1.85× 2.06× 1.47× 2.04× 1.42× 1.75× 1.42× 2.04× 2.07× 2.05× 2.11×
Wolof Latin 1.55× 1.59× 1.48× 1.67× 1.52× 1.55× 1.52× 1.63× 1.65× 1.66× 1.69×
Sesotho Latin 1.53× 1.61× 1.49× 1.67× 1.40× 1.53× 1.40× 1.62× 1.67× 1.67× 1.68×
Lingala Latin 1.58× 1.64× 1.52× 1.70× 1.48× 1.63× 1.48× 1.64× 1.71× 1.71× 1.72×
Akan / Twi Latin 1.64× 1.80× 1.52× 1.81× 1.57× 1.74× 1.57× 2.02× 1.83× 2.10× 2.11×
Hausa Latin 1.51× 1.61× 1.54× 1.69× 1.35× 1.51× 1.35× 1.65× 1.69× 1.72× 1.74×
Afrikaans Latin 1.44× 1.42× 1.60× 1.59× 1.35× 1.42× 1.35× 1.44× 1.55× 1.59× 1.60×
Kinyarwanda Latin 2.16× 2.22× 1.73× 2.32× 1.88× 2.18× 1.88× 2.22× 2.33× 2.32× 2.34×
Bambara Latin 2.03× 2.08× 1.83× 2.07× 1.98× 2.04× 1.98× 2.37× 2.13× 2.45× 2.49×
Luganda Latin 2.27× 2.34× 2.08× 2.42× 2.07× 2.28× 2.07× 2.33× 2.42× 2.41× 2.44×
Xhosa Latin 2.56× 2.63× 2.25× 2.75× 2.22× 2.50× 2.22× 2.67× 2.77× 2.76× 2.78×
Shona Latin 2.37× 2.48× 2.28× 2.68× 2.15× 2.37× 2.15× 2.51× 2.64× 2.63× 2.70×
Zulu Latin 2.59× 2.70× 2.31× 2.87× 2.26× 2.58× 2.26× 2.76× 2.87× 2.87× 2.90×
Oromo Latin 2.38× 2.44× 2.44× 2.56× 2.10× 2.41× 2.10× 2.48× 2.60× 2.57× 2.59×
Tigrinya Ethiopic 2.82× 7.38× 5.84× 4.73× 6.79× 3.42× 6.79× 8.66× 6.97× 8.85× 8.86×
Amharic Ethiopic 2.47× 8.06× 6.37× 5.16× 7.37× 3.05× 7.37× 9.47× 7.61× 9.67× 9.68×
N'Ko N'Ko 8.73× 8.89× 8.75× 5.96× 8.92× 8.86× 8.92× 8.62× 8.87× 8.82× 8.82×

All premium values from Table B.2 of paper (arXiv:2606.24460).

Fertility — Tokens per Word

Raw fertility F(L,T) from Table B.1 of the paper. English and French included as controls. Lower = more token-efficient.

Fertility scale < 2.0 2.0–3.0 3.0–5.0 5.0–7.0 7.0–9.0 9.0+
Language Script Gemma 4 aya-exp. BLOOM Qwen3 o200k-H Llama 4 o200k-B Tekken DeepSeek Llama 3.1 cl100k
English Latin 1.23 1.22 1.24 1.25 1.22 1.23 1.22 1.26 1.23 1.23 1.23
French Latin 1.50 1.42 1.29 1.71 1.44 1.45 1.44 1.44 1.64 1.70 1.71
Swahili Latin 2.08 2.29 1.60 2.51 1.87 2.12 1.87 2.37 2.46 2.46 2.49
Yoruba Latin 2.55 2.76 1.69 2.93 2.26 2.64 2.26 3.01 3.00 2.89 3.14
Igbo Latin 2.27 2.51 1.82 2.55 1.73 2.14 1.73 2.57 2.53 2.53 2.59
Wolof Latin 1.91 1.94 1.83 2.09 1.85 1.90 1.85 2.06 2.02 2.05 2.08
Sesotho Latin 1.88 1.97 1.85 2.09 1.70 1.87 1.70 2.04 2.04 2.05 2.07
Lingala Latin 1.94 2.00 1.88 2.13 1.81 2.00 1.81 2.07 2.10 2.10 2.11
Akan / Twi Latin 2.02 2.19 1.89 2.26 1.91 2.13 1.91 2.54 2.24 2.59 2.60
Hausa Latin 1.85 1.96 1.91 2.12 1.65 1.85 1.65 2.09 2.07 2.12 2.14
Afrikaans Latin 1.77 1.74 1.99 1.99 1.65 1.74 1.65 1.82 1.90 1.96 1.97
Kinyarwanda Latin 2.65 2.72 2.15 2.90 2.29 2.67 2.29 2.80 2.86 2.86 2.88
Bambara Latin 2.49 2.54 2.28 2.59 2.41 2.50 2.41 2.99 2.61 3.02 3.07
Luganda Latin 2.79 2.86 2.59 3.03 2.52 2.80 2.52 2.94 2.97 2.97 3.01
Xhosa Latin 3.14 3.21 2.79 3.45 2.70 3.07 2.70 3.37 3.39 3.39 3.43
Shona Latin 2.92 3.03 2.84 3.35 2.62 2.91 2.62 3.17 3.24 3.24 3.33
Zulu Latin 3.18 3.30 2.86 3.59 2.75 3.16 2.75 3.48 3.52 3.53 3.58
Oromo Latin 2.92 2.98 3.03 3.21 2.56 2.96 2.56 3.13 3.18 3.17 3.19
Tigrinya Ethiopic 3.46 9.02 7.25 5.92 8.27 4.19 8.27 10.92 8.54 10.90 10.91
Amharic Ethiopic 3.04 9.84 7.91 6.45 8.97 3.74 8.97 11.94 9.33 11.91 11.92
N'Ko N'Ko 10.73 10.86 10.87 7.47 10.87 10.87 10.87 10.87 10.87 10.87 10.87

All fertility values from Table B.1 of paper (arXiv:2606.24460). Values are tokens per word; English ~1.23 is the baseline.

Four Metrics of Tokenization Equity

Each metric captures a different dimension of the cost penalty imposed on African language speakers when using frontier LLMs.

01
Fertility F(L,T)
The average number of subword tokens produced per word in language L by tokenizer T. A fertility of 3.0 means 3 tokens per word on average. English fertility on GPT-4o is ~1.3.
F = n_tokens / n_words
02
Premium P(L,T)
The fertility of language L relative to English on the same tokenizer. A premium of 2.38× means processing that language costs 2.38× more in tokens — and therefore in API cost and latency — than English.
P = F(L,T) / F(English,T)
03
Chars per Token (CPT)
Average characters packed into each token. Higher CPT means each token carries more information. English CPT is ~4.8; Ethiopic languages often fall below 1.0, meaning tokens represent less than one character.
CPT = n_chars / n_tokens
04
Bytes per Token (BPT)
Average bytes packed per token. Captures the penalty from multi-byte Unicode scripts (Ethiopic, N'Ko) which require 2–4 bytes per character. A BPT of 1.0 in N'Ko means the tokenizer is falling back to byte-level encoding.
BPT = n_bytes / n_tokens

How This Leaderboard Was Built

01
Parallel corpus selection
FLORES-200+ was used as the primary evaluation corpus — a professionally translated parallel dataset across all 20 languages — ensuring content differences do not confound language effects.
02
Fertility computation
Each language–tokenizer pair was run through the afri-fertility Python library (Apache 2.0, open-source). 1,012 sentences per pair were tokenized and word-token ratios computed with 95% confidence intervals.
03
Premium normalisation
Each language's fertility was divided by the English fertility on the same tokenizer. This isolates the tokenizer's language-specific penalty from overall tokenizer efficiency differences.
04
Open dataset release
All 616 language–tokenizer rows (fertility, premium, CPT, BPT with confidence intervals) are publicly available on HuggingFace under Apache 2.0 for independent reproduction and extension.
Research Paper
The African Language Tax: Quantifying the Cost, Latency, and Context Penalty of Tokenizing African Languages in Frontier LLMs
This leaderboard is based on the first systematic, cross-tokenizer measurement of pre-inference computational overhead for African languages. It documents a structural penalty encoded directly into the subword vocabularies of frontier tokenizers.
AuthorOlaoye Anthony Somide (DataLens Africa Research)
arXiv2606.24460
LicenseCC-BY 4.0
DatasetCipherSenseAI/afri-fertility-results
PublishedJune 2026
Read on arXiv View Dataset on HuggingFace
DataLens Studio

Help fix the African
language tax.

Better tokenizers require better training data. By annotating African language text through DataLens Studio, you directly contribute to corpora that tokenizer teams use to expand vocabulary coverage — reducing fertility penalties for millions of speakers.

1
Annotate African language text
Label text across 50+ African languages including those with high fertility penalties.
2
Build richer tokenizer training corpora
Your work feeds the datasets that allow tokenizers to learn African subword patterns.
3
Watch the premium drop on this leaderboard
Each percentage point reduction in premium lowers costs for millions of African language speakers.
DataLens
DataLens Studio
African Language Annotation Platform
Text, audio & RLHF annotation tasks
50+ African languages supported
Earn while contributing to African AI
Quality-reviewed by African language experts

Every annotation is a step toward AI that doesn't tax African languages.