What Is Tokenisation?
Tokenisation is in machine learning, the process of breaking text or other data into smaller units called tokens, which a model processes as its basic input and output elements.
Tokenisation — in machine learning, the process of breaking text or other data into smaller units called tokens, which a model processes as its basic input and output elements.
Tokenisation is the first step in how a language model reads and writes text. Model limits and usage costs are typically measured in tokens, and tokenisation choices can affect how different languages and scripts are handled — a fairness consideration for global deployments. (This is distinct from the unrelated data-security technique of replacing sensitive values with tokens.)
Source: Machine-learning literature
Plain-language explanation
Tokenisation is the first step in how a language model reads and writes text. Model limits and usage costs are typically measured in tokens, and tokenisation choices can affect how different languages and scripts are handled — a fairness consideration for global deployments. (This is distinct from the unrelated data-security technique of replacing sensitive values with tokens.)
Related terms
See where you stand on AI governance
Take the free 7-question maturity assessment and get a personalised action plan.
Free assessment — 3 minutes →