AIRiskAware
AI Governance Glossary
Governance Concept

What Is Tokenisation?

Tokenisation is in machine learning, the process of breaking text or other data into smaller units called tokens, which a model processes as its basic input and output elements.

Definition

Tokenisationin machine learning, the process of breaking text or other data into smaller units called tokens, which a model processes as its basic input and output elements.

Tokenisation is the first step in how a language model reads and writes text. Model limits and usage costs are typically measured in tokens, and tokenisation choices can affect how different languages and scripts are handled — a fairness consideration for global deployments. (This is distinct from the unrelated data-security technique of replacing sensitive values with tokens.)

Source: Machine-learning literature

Plain-language explanation

Tokenisation is the first step in how a language model reads and writes text. Model limits and usage costs are typically measured in tokens, and tokenisation choices can affect how different languages and scripts are handled — a fairness consideration for global deployments. (This is distinct from the unrelated data-security technique of replacing sensitive values with tokens.)

Primary source: Machine-learning literature

Related terms

Large Language Model (LLM) Inference (AI) Embedding Context Window

See where you stand on AI governance

Take the free 7-question maturity assessment and get a personalised action plan.

Free assessment — 3 minutes →