What is Tokenization?

Tokenization is the first step in how AI processes text. It's the process of breaking down text into smaller units called "tokens." These tokens are the basic building blocks that the AI model can understand and work with.

Think of tokens as puzzle pieces that make up your message. The AI needs to break down your words into these pieces before it can understand what you're saying.

Simple tokenization visualization

How Words Become Tokens

Let's see how some common English text gets tokenized:

Example 1: Simple Words

"Hello world"
โ†“
Hello world

Common words are often single tokens.

Example 2: Longer Words

"Understanding transformers"
โ†“
Under standing transform ers

Longer words are often broken down into meaningful parts.

Example 3: Special Characters

"email@example.com"
โ†“
email @ example . com

Special characters are often separate tokens.

Example 4: Uncommon Words

"supercalifragilisticexpialidocious"
โ†“
super cal if rag il istic expial id ocious

Rare or very long words get broken into many smaller pieces.

Why Tokenization Matters

๐Ÿ”

Vocabulary Size

AI models have a fixed vocabulary of tokens (typically 50,000 to 100,000). Tokenization allows them to handle any word, even ones they've never seen before, by breaking them into familiar pieces.

๐Ÿ“Š

Input Limits

Models have a maximum number of tokens they can process at once (like 2048, 4096, or 8192 tokens). This is why long conversations or documents may need to be broken up.

๐ŸŒ

Multilingual Support

Tokenization methods allow AI to work with many languages, though some languages (like English) typically require fewer tokens than others (like Japanese or Chinese).

๐Ÿ’ก

Efficiency

Good tokenization strikes a balance between keeping common words whole and breaking down rare words, making the model more efficient at processing language.

How Tokenization Works

Different AI models use different tokenization methods:

Byte Pair Encoding (BPE)

Used by: GPT models, RoBERTa

How it works: Starts with individual characters and merges the most common pairs until reaching the desired vocabulary size.

1 Start with characters: "l", "o", "w", "e", "r"
2 Merge common pairs: "lo", "w", "e", "r"
3 Merge again: "low", "e", "r"
4 Final token: "lower"

WordPiece

Used by: BERT, DistilBERT

How it works: Similar to BPE but uses a different merging criterion based on likelihood.

Often marks subword pieces with ## prefix.

"playing" โ†’ ["play", "##ing"]

"unknowable" โ†’ ["un", "##know", "##able"]

SentencePiece

Used by: T5, XLNet, and many multilingual models

How it works: Treats the text as a sequence of Unicode characters, including spaces, making it language-independent.

Works well across languages as it doesn't rely on language-specific rules.

Very useful for languages without clear word boundaries like Japanese.

Important:

The exact tokenization depends on the specific model. The examples above are simplified to show the general concept. Real tokenizers may split text differently depending on their training and vocabulary.

Try It Yourself

Type some text below to see how it might be tokenized:

Tokenized result will appear here

Note: This is a simplified demonstration. Real tokenizers may split text differently.