Why model drift happens
Every AI model is trained on historical data that reflects the world as it was at the time the data was collected. The world does not stop changing after the model is deployed. Consumer behaviour changes. Economic conditions change. Regulatory requirements change. The composition of the population using a service changes. New products, services, and interactions emerge that were not in the training data. As the gap between the training data distribution and the real-world data distribution widens, the model's predictions become less accurate — this is model drift.
Data drift is the simpler form: the characteristics of the inputs to the model change over time. A fraud detection model trained on pre-pandemic transaction patterns will encounter a different distribution of transactions post-pandemic — different locations, different merchants, different transaction sizes. The model was not trained on this distribution and will perform less well on it. Concept drift is more subtle and more damaging: the underlying relationship between inputs and the correct output changes. A credit risk model trained when economic conditions were benign may have learned relationships that no longer hold when economic conditions deteriorate. The model receives the same types of inputs but the correct output for those inputs has changed in ways the model cannot detect because it only knows the world it was trained on.
Monitoring for model drift: the practical programme
Step 1 — Define performance metrics at deployment. Before deploying an AI model, define the metrics that will be used to assess its performance in production: accuracy on a held-out test set, fairness metrics across demographic groups, business outcome metrics, and any regulatory compliance metrics. Document these metrics and their acceptable ranges. Step 2 — Establish monitoring infrastructure. Implement logging of model inputs, outputs, and ground truth outcomes (when available) in production. For models where ground truth is delayed (credit models where loan outcomes are only known months later), implement proxy metrics that can provide earlier signals of performance degradation. Step 3 — Set alert thresholds. Define the thresholds that will trigger a review: how much degradation in each metric before an alert is generated? Set these thresholds deliberately — too sensitive and you generate alert fatigue; too insensitive and you miss genuine drift. Step 4 — Conduct scheduled performance reviews. Beyond alert-triggered reviews, conduct scheduled performance assessments at defined intervals — at minimum quarterly for high-risk AI. These reviews should examine performance trends, not just point-in-time performance.