Tokenization: Breaking Text into Pieces
How AI transforms your words into smaller chunks it can understand
What is Tokenization?
Tokenization is the first step in how AI processes text. It's the process of breaking down text into smaller units called "tokens." These tokens are the basic building blocks that the AI model can understand and work with.
Think of tokens as puzzle pieces that make up your message. The AI needs to break down your words into these pieces before it can understand what you're saying.
How Words Become Tokens
Let's see how some common English text gets tokenized:
Example 1: Simple Words
Common words are often single tokens.
Example 2: Longer Words
Longer words are often broken down into meaningful parts.
Example 3: Special Characters
Special characters are often separate tokens.
Example 4: Uncommon Words
Rare or very long words get broken into many smaller pieces.
Why Tokenization Matters
Vocabulary Size
AI models have a fixed vocabulary of tokens (typically 50,000 to 100,000). Tokenization allows them to handle any word, even ones they've never seen before, by breaking them into familiar pieces.
Input Limits
Models have a maximum number of tokens they can process at once (like 2048, 4096, or 8192 tokens). This is why long conversations or documents may need to be broken up.
Multilingual Support
Tokenization methods allow AI to work with many languages, though some languages (like English) typically require fewer tokens than others (like Japanese or Chinese).
Efficiency
Good tokenization strikes a balance between keeping common words whole and breaking down rare words, making the model more efficient at processing language.
How Tokenization Works
Different AI models use different tokenization methods:
Byte Pair Encoding (BPE)
Used by: GPT models, RoBERTa
How it works: Starts with individual characters and merges the most common pairs until reaching the desired vocabulary size.
WordPiece
Used by: BERT, DistilBERT
How it works: Similar to BPE but uses a different merging criterion based on likelihood.
Often marks subword pieces with ## prefix.
"playing" โ ["play", "##ing"]
"unknowable" โ ["un", "##know", "##able"]
SentencePiece
Used by: T5, XLNet, and many multilingual models
How it works: Treats the text as a sequence of Unicode characters, including spaces, making it language-independent.
Works well across languages as it doesn't rely on language-specific rules.
Very useful for languages without clear word boundaries like Japanese.
Important:
The exact tokenization depends on the specific model. The examples above are simplified to show the general concept. Real tokenizers may split text differently depending on their training and vocabulary.
Try It Yourself
Type some text below to see how it might be tokenized:
Tokenized result will appear here
Note: This is a simplified demonstration. Real tokenizers may split text differently.