Back to Glossary
GEO & AI Search

Tokenization

Tokenization is the process of breaking text down into tokens, the smallest units a language model actually processes. In English, one token corresponds to roughly 4 characters (about 3/4 of a word), and subword algorithms such as BPE split words into smaller pieces for the model to handle.

  • Tokenization is the preprocessing step that splits human-readable text into tokens, the smallest units a model operates on.
  • Modern LLMs work in subword units rather than whole words or single characters, and the most widely used method is BPE (Byte Pair Encoding).
  • For English, OpenAI estimates that 1 token averages about 4 characters and 100 tokens map to roughly 75 words.
  • Tokens are both cost and context length: API pricing and how much a model can read at once are both measured in tokens.
  • Morphologically rich languages like Korean tend to consume more tokens than English for the same meaning, so token efficiency belongs in any GEO and content-operations plan.

What Tokenization Is

Tokenization breaks input text into tokens, the smallest units a language model truly works with. A model never reads raw letters or words; it converts tokens into a sequence of integer IDs and takes that numeric sequence as input. Tokenization is therefore the mandatory gateway every piece of text passes through right before it reaches the model, and the very same sentence will be split differently and yield a different token count under a different tokenizer.

Early approaches split text on whitespace into words or broke it down character by character, but word-level splitting cannot handle words it has never seen before (OOV, out-of-vocabulary), while character-level splitting produces sequences that grow far too long. Most modern LLMs adopt the compromise between the two: subword tokenization. Common words stay intact as a single token, while rare words and compounds are broken into smaller pieces.

The Leading Method: BPE (Byte Pair Encoding)

The most widely used subword algorithm today is BPE (Byte Pair Encoding). BPE started life as a text-compression algorithm, but once OpenAI brought it into the tokenizer for its GPT family it became the de facto standard in NLP. According to the Hugging Face documentation, BPE is used in many models including GPT, GPT-2, RoBERTa, BART, and DeBERTa.

The training principle behind BPE is simple. It first takes every individual character that appears in the corpus as the base vocabulary, then repeatedly merges the pair of characters that most frequently occur next to each other, gradually building up longer and longer subwords. This continues until the target vocabulary size is reached.

Consider the exact example from Hugging Face. If the word frequencies are ("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5), the base character vocabulary is as follows.

Base vocabulary: ["b", "g", "h", "n", "p", "s", "u"]

Merge 1: ("u", "g") -> "ug"   # most frequent, appearing 20 times in total
Merge 2: ("u", "n") -> "un"   # 16 times
Merge 3: ("h", "ug") -> "hug" # 15 times

Once training is complete, new text is turned into tokens by applying normalization, pre-tokenization, a breakdown into individual characters, and the learned merge rules in that order. A tokenizer built this way can represent words absent from the training data through combinations of characters and subwords, which solves the OOV problem.

An intuitive example of this subword splitting appears in OpenAI's tiktoken documentation. tiktoken is the BPE tokenizer for OpenAI models, and it splits the word encoding into encod and ing, for instance. Doing so lets the model recognize common fragments such as frequently recurring stems and suffixes, making it easier to learn grammatical patterns.

Evidence and Numbers

The central figure in tokenization is how much text gets compressed into tokens. The English conversion rules OpenAI provides are as follows (OpenAI Help Center and the Tokenizer documentation).

  • 1 token ≈ 4 characters (English)
  • 1 token ≈ 3/4 of a word
  • 100 tokens ≈ 75 words

These are only rough estimates for English and shift with the language and the content. The tiktoken documentation explains that a BPE token corresponds to roughly 4 bytes on average. tiktoken also lists four properties of the BPE it defines: (1) it is reversible and lossless, so tokens can be converted back into the original text; (2) it works on arbitrary text even when that text was not in the training data; (3) the token sequence is shorter than the original bytes, so it compresses the text; and (4) it helps the model see common subwords. On performance, tiktoken states it runs 3-6x faster than the open-source tokenizers it was compared against. Available encodings include o200k_base for GPT-4o and cl100k_base for the preceding generation.

The starting point for bringing BPE into neural machine translation in earnest is the 2016 paper by Sennrich, Haddow, and Birch, "Neural Machine Translation of Rare Words with Subword Units" (ACL 2016, pp.1715-1725). The paper introduced a way to represent rare and unknown words as sequences of subword units, moving past the limits of a fixed vocabulary, and it went on to become the foundation for virtually every modern LLM tokenizer.

Tokens vs. Words vs. Characters

UnitHow It SplitsStrengthsLimitations
WordSplit on whitespace and punctuationIntuitive, preserves units of meaningCannot handle OOV words, vocabulary explosion
CharacterBroken into one character at a timeTiny vocabulary, no OOVSequences grow too long and inefficient
Subword (BPE)Pieces formed by merging frequent character pairsSolves OOV plus moderate length and compression efficiencyRequires tokenizer training, efficiency varies by language

SEO and GEO Implications

Tokens are not merely an internal concept; they tie directly to operating cost. There are two reasons tokens matter for generative search and LLM-based services.

  • Tokens = cost: Almost all LLM API pricing is billed by the number of input and output tokens. Structuring the same content to use fewer tokens delivers substantial savings at scale.
  • Tokens = context limit: The amount a model can process at once, the context window, is defined in tokens rather than characters. When you feed documents through RAG or have a long page summarized or quoted, content is truncated once the token budget is exceeded.
  • Token efficiency by language: English is relatively efficient at roughly 1 token per 4 characters, but morphologically rich languages, Korean among them, tend to consume more tokens for the same meaning. Multilingual content operations should account for these per-language token cost differences.
  • Chunking from a GEO standpoint: For a generative search engine to cite your content, it first slices the page into chunks for processing. Because those chunk boundaries are decided in token units, packing the key information into a single token block makes it easier for the AI to quote it as a whole. Placing core answers in a short, self-contained form, the heart of AEO strategy, is exactly where it meets token structure.

References