What is the Token Fertility Leaderboard?

The Token Fertility Leaderboard ranks 11 frontier LLM tokenizers on how efficiently they handle 20 African languages. It is based on the peer-reviewed study 'The African Language Tax' (arXiv:2606.24460) by Olaoye Anthony Somide. Lower fertility and lower premium scores indicate a more efficient tokenizer.

What is token fertility?

Token fertility F(L,T) measures the average number of subword tokens a tokenizer T produces per word in language L. A fertility of 2.0 means the tokenizer uses twice as many tokens per word as a perfectly efficient tokenizer would. Higher fertility means higher cost and slower inference.

Which tokenizer is best for African languages?

As of June 2026, Google Gemma 4's tokenizer leads with a mean premium of 2.38× across 20 African languages — the lowest of the 11 tokenizers evaluated. Cohere's aya-expanse ranks second at 2.48×. OpenAI's cl100k_base is the worst at 3.31×.

Which African language suffers the worst tokenization penalty?

N'Ko, the script used for writing Manding/Bambara and several West African languages, suffers the highest penalty across all tokenizers — up to 8.92× the cost of English. Amharic (Ethiopic script) and Tigrinya follow closely, with penalties of 5–8× depending on the tokenizer.

Why does the N'Ko script have such a high tokenization penalty?

N'Ko uses a unique right-to-left script that is extremely rare in the training corpora of most tokenizers. As a result, the subword vocabulary contains almost no N'Ko tokens, forcing the tokenizer to fall back to byte-level representations — exploding the token count by up to 9×.

The Hidden Cost of African Language Tokenization.

Q: What is the tokenization premium?

The tokenization premium P(L,T) is the ratio of a language's fertility to English fertility on the same tokenizer. A premium of 2.38× means processing that language costs 2.38× more in tokens — and therefore in API cost and inference time — compared to English.

Frontier LLMs cost up to 8.92× more to process African languages than English—even before a model runs a forward pass. This leaderboard ranks 11 tokenizers across 20 African languages and 3 scripts to reveal which frontier models impose the lowest tokenization costs, latency, and context penalties.

Explore the Rankings Read the Paper

11

Tokenizers

20

Languages

3

Scripts

616

Data Points

Rankings

Token Fertility Leaderboard

Lower mean premium = less tokenization overhead = lower API cost. Ranked ascending.

Rank

Tokenizer

Fertility

Latin Tax

Ethiopic Tax

N'Ko Tax

Mean Premium

👑1

Gemma 4

Google · 262,144 vocab

2.93

tok/word

1.95×

2.64×

8.73×

2.38×

Llama 4

Meta · 200,000 vocab

3.01

tok/word

1.96×

3.23×

8.86×

2.46×

BLOOM

BigScience · 250,680 vocab

3.21

tok/word

1.76×

6.10×

8.75×

2.59×

Qwen3

Qwen / Alibaba · 151,643 vocab

3.30

tok/word

2.14×

4.94×

5.96×

2.63×

o200k_harmony

OpenAI · 201,088 vocab

3.28

tok/word

1.76×

7.08×

8.92×

2.70×

o200k_base

OpenAI · 200,019 vocab

3.28

tok/word

1.76×

7.08×

8.92×

2.70×

aya-expanse

Cohere · 255,000 vocab

3.67

tok/word

2.05×

7.72×

8.89×

3.00×

DeepSeek V3

DeepSeek · 128,000 vocab

3.73

tok/word

2.15×

7.29×

8.87×

3.04×

Tekken

Mistral · 131,072 vocab

4.01

tok/word

2.10×

9.06×

8.62×

3.18×

Llama 3.1

Meta · 128,000 vocab

4.03

tok/word

2.18×

9.26×

8.82×

3.27×

cl100k_base

OpenAI · 100,277 vocab

4.07

tok/word

2.22×

9.27×

8.82×

3.31×

* Mean premium = average tokens-per-word ratio vs English across 19 African languages (16 Latin-script, 2 Ethiopic, 1 N'Ko). Lower is better. All values from paper (arXiv:2606.24460). Published: 27 June 2026.

All African Languages

Mean premium vs English · Top 5 tokenizers · Lower bars = lower overhead

Latin Script Languages

17 Latin-script African languages · Top 5 tokenizers

Ethiopic Script

Amharic & Tigrinya combined · Top 5 tokenizers

N'Ko Script

Manding/Bambara in N'Ko script · Top 5 tokenizers

Language Breakdown

The African Language Tax by Language

Tokenization premium vs English for all 19 African languages across all 11 tokenizers.

Premium scale ≤ 1.5× 1.5–2.0× 2.0–3.0× 3.0–5.0× 5.0–7.0× 7.0×+

Language	Script	Gemma 4	aya-exp.	BLOOM	Qwen3	o200k-H	Llama 4	o200k-B	Tekken	DeepSeek	Llama 3.1	cl100k
Swahili	Latin	1.70×	1.88×	1.29×	2.00×	1.54×	1.73×	1.54×	1.88×	2.01×	2.00×	2.02×
Yoruba	Latin	2.07×	2.26×	1.36×	2.34×	1.85×	2.15×	1.85×	2.39×	2.45×	2.35×	2.55×
Igbo	Latin	1.85×	2.06×	1.47×	2.04×	1.42×	1.75×	1.42×	2.04×	2.07×	2.05×	2.11×
Wolof	Latin	1.55×	1.59×	1.48×	1.67×	1.52×	1.55×	1.52×	1.63×	1.65×	1.66×	1.69×
Sesotho	Latin	1.53×	1.61×	1.49×	1.67×	1.40×	1.53×	1.40×	1.62×	1.67×	1.67×	1.68×
Lingala	Latin	1.58×	1.64×	1.52×	1.70×	1.48×	1.63×	1.48×	1.64×	1.71×	1.71×	1.72×
Akan / Twi	Latin	1.64×	1.80×	1.52×	1.81×	1.57×	1.74×	1.57×	2.02×	1.83×	2.10×	2.11×
Hausa	Latin	1.51×	1.61×	1.54×	1.69×	1.35×	1.51×	1.35×	1.65×	1.69×	1.72×	1.74×
Afrikaans	Latin	1.44×	1.42×	1.60×	1.59×	1.35×	1.42×	1.35×	1.44×	1.55×	1.59×	1.60×
Kinyarwanda	Latin	2.16×	2.22×	1.73×	2.32×	1.88×	2.18×	1.88×	2.22×	2.33×	2.32×	2.34×
Bambara	Latin	2.03×	2.08×	1.83×	2.07×	1.98×	2.04×	1.98×	2.37×	2.13×	2.45×	2.49×
Luganda	Latin	2.27×	2.34×	2.08×	2.42×	2.07×	2.28×	2.07×	2.33×	2.42×	2.41×	2.44×
Xhosa	Latin	2.56×	2.63×	2.25×	2.75×	2.22×	2.50×	2.22×	2.67×	2.77×	2.76×	2.78×
Shona	Latin	2.37×	2.48×	2.28×	2.68×	2.15×	2.37×	2.15×	2.51×	2.64×	2.63×	2.70×
Zulu	Latin	2.59×	2.70×	2.31×	2.87×	2.26×	2.58×	2.26×	2.76×	2.87×	2.87×	2.90×
Oromo	Latin	2.38×	2.44×	2.44×	2.56×	2.10×	2.41×	2.10×	2.48×	2.60×	2.57×	2.59×
Tigrinya	Ethiopic	2.82×	7.38×	5.84×	4.73×	6.79×	3.42×	6.79×	8.66×	6.97×	8.85×	8.86×
Amharic	Ethiopic	2.47×	8.06×	6.37×	5.16×	7.37×	3.05×	7.37×	9.47×	7.61×	9.67×	9.68×
N'Ko	N'Ko	8.73×	8.89×	8.75×	5.96×	8.92×	8.86×	8.92×	8.62×	8.87×	8.82×	8.82×

All premium values from Table B.2 of paper (arXiv:2606.24460).

Fertility — Tokens per Word

Raw fertility F(L,T) from Table B.1 of the paper. English and French included as controls. Lower = more token-efficient.

Fertility scale < 2.0 2.0–3.0 3.0–5.0 5.0–7.0 7.0–9.0 9.0+

Language	Script	Gemma 4	aya-exp.	BLOOM	Qwen3	o200k-H	Llama 4	o200k-B	Tekken	DeepSeek	Llama 3.1	cl100k
English	Latin	1.23	1.22	1.24	1.25	1.22	1.23	1.22	1.26	1.23	1.23	1.23
French	Latin	1.50	1.42	1.29	1.71	1.44	1.45	1.44	1.44	1.64	1.70	1.71
Swahili	Latin	2.08	2.29	1.60	2.51	1.87	2.12	1.87	2.37	2.46	2.46	2.49
Yoruba	Latin	2.55	2.76	1.69	2.93	2.26	2.64	2.26	3.01	3.00	2.89	3.14
Igbo	Latin	2.27	2.51	1.82	2.55	1.73	2.14	1.73	2.57	2.53	2.53	2.59
Wolof	Latin	1.91	1.94	1.83	2.09	1.85	1.90	1.85	2.06	2.02	2.05	2.08
Sesotho	Latin	1.88	1.97	1.85	2.09	1.70	1.87	1.70	2.04	2.04	2.05	2.07
Lingala	Latin	1.94	2.00	1.88	2.13	1.81	2.00	1.81	2.07	2.10	2.10	2.11
Akan / Twi	Latin	2.02	2.19	1.89	2.26	1.91	2.13	1.91	2.54	2.24	2.59	2.60
Hausa	Latin	1.85	1.96	1.91	2.12	1.65	1.85	1.65	2.09	2.07	2.12	2.14
Afrikaans	Latin	1.77	1.74	1.99	1.99	1.65	1.74	1.65	1.82	1.90	1.96	1.97
Kinyarwanda	Latin	2.65	2.72	2.15	2.90	2.29	2.67	2.29	2.80	2.86	2.86	2.88
Bambara	Latin	2.49	2.54	2.28	2.59	2.41	2.50	2.41	2.99	2.61	3.02	3.07
Luganda	Latin	2.79	2.86	2.59	3.03	2.52	2.80	2.52	2.94	2.97	2.97	3.01
Xhosa	Latin	3.14	3.21	2.79	3.45	2.70	3.07	2.70	3.37	3.39	3.39	3.43
Shona	Latin	2.92	3.03	2.84	3.35	2.62	2.91	2.62	3.17	3.24	3.24	3.33
Zulu	Latin	3.18	3.30	2.86	3.59	2.75	3.16	2.75	3.48	3.52	3.53	3.58
Oromo	Latin	2.92	2.98	3.03	3.21	2.56	2.96	2.56	3.13	3.18	3.17	3.19
Tigrinya	Ethiopic	3.46	9.02	7.25	5.92	8.27	4.19	8.27	10.92	8.54	10.90	10.91
Amharic	Ethiopic	3.04	9.84	7.91	6.45	8.97	3.74	8.97	11.94	9.33	11.91	11.92
N'Ko	N'Ko	10.73	10.86	10.87	7.47	10.87	10.87	10.87	10.87	10.87	10.87	10.87

All fertility values from Table B.1 of paper (arXiv:2606.24460). Values are tokens per word; English ~1.23 is the baseline.

Methodology

Four Metrics of Tokenization Equity

Each metric captures a different dimension of the cost penalty imposed on African language speakers when using frontier LLMs.

Fertility F(L,T)

The average number of subword tokens produced per word in language L by tokenizer T. A fertility of 3.0 means 3 tokens per word on average. English fertility on GPT-4o is ~1.3.

F = n_tokens / n_words

Premium P(L,T)

The fertility of language L relative to English on the same tokenizer. A premium of 2.38× means processing that language costs 2.38× more in tokens — and therefore in API cost and latency — than English.

P = F(L,T) / F(English,T)

Chars per Token (CPT)

Average characters packed into each token. Higher CPT means each token carries more information. English CPT is ~4.8; Ethiopic languages often fall below 1.0, meaning tokens represent less than one character.

CPT = n_chars / n_tokens

Bytes per Token (BPT)

Average bytes packed per token. Captures the penalty from multi-byte Unicode scripts (Ethiopic, N'Ko) which require 2–4 bytes per character. A BPT of 1.0 in N'Ko means the tokenizer is falling back to byte-level encoding.

BPT = n_bytes / n_tokens

The Research

How This Leaderboard Was Built

Parallel corpus selection

FLORES-200+ was used as the primary evaluation corpus — a professionally translated parallel dataset across all 20 languages — ensuring content differences do not confound language effects.

Fertility computation

Each language–tokenizer pair was run through the afri-fertility Python library (Apache 2.0, open-source). 1,012 sentences per pair were tokenized and word-token ratios computed with 95% confidence intervals.

Premium normalisation

Each language's fertility was divided by the English fertility on the same tokenizer. This isolates the tokenizer's language-specific penalty from overall tokenizer efficiency differences.

Open dataset release

All 616 language–tokenizer rows (fertility, premium, CPT, BPT with confidence intervals) are publicly available on HuggingFace under Apache 2.0 for independent reproduction and extension.

Research Paper

The African Language Tax: Quantifying the Cost, Latency, and Context Penalty of Tokenizing African Languages in Frontier LLMs

This leaderboard is based on the first systematic, cross-tokenizer measurement of pre-inference computational overhead for African languages. It documents a structural penalty encoded directly into the subword vocabularies of frontier tokenizers.

AuthorOlaoye Anthony Somide (DataLens Africa Research)

arXiv2606.24460

LicenseCC-BY 4.0

DatasetCipherSenseAI/afri-fertility-results

CodeCipherSenseAI/afri-fertility

PublishedJune 2026

Read on arXiv View Dataset on HuggingFace

DataLens Studio

Help fix the African
language tax.

Better tokenizers require better training data. By annotating African language text through DataLens Studio, you directly contribute to corpora that tokenizer teams use to expand vocabulary coverage — reducing fertility penalties for millions of speakers.

Annotate African language text

Label text across 50+ African languages including those with high fertility penalties.

Build richer tokenizer training corpora

Your work feeds the datasets that allow tokenizers to learn African subword patterns.

Watch the premium drop on this leaderboard

Each percentage point reduction in premium lowers costs for millions of African language speakers.

DataLens Studio

African Language Annotation Platform

Text, audio & RLHF annotation tasks

50+ African languages supported

Earn while contributing to African AI

Quality-reviewed by African language experts

Start Annotating Sign in to your account

Every annotation is a step toward AI that doesn't tax African languages.

The Hidden Cost of African Language Tokenization.

11

20

3

616

Token Fertility Leaderboard

The African Language Tax by Language

Fertility — Tokens per Word

Four Metrics of Tokenization Equity

How This Leaderboard Was Built

Help fix the Africanlanguage tax.

Help fix the African
language tax.