Data Drift
Data and concept drift are frequently mentioned in ML monitoring, but what exactly are they, and how are they detected? Furthermore, given the common misconceptions, are data and concept drift things to be avoided at all costs or natural and acceptable consequences of training models in production? Read on to find out.
What Is It?
Perhaps the more common of the two is data drift, which refers to any change in the data distribution after training the model. In other words, data drift commonly occurs when the inputs a model is presented within production fail to correspond with the distribution it was provided during training. This typically presents itself as a change in the feature distribution, i.e., specific values for a given feature may become more common in production. In contrast, other values may see a decrease in prevalence. For example, consider an e-commerce company serving an LTV prediction model to optimize marketing efforts. A reasonable feature for such a model would be a customer’s age. However, suppose this same company changed its marketing strategy, perhaps by initiating a new campaign targeted at a specific age group. In this scenario, the distribution of ages being fed to the model would likely change, causing a distribution shift in the age feature and perhaps a degradation in the model’s predictive capacity. This would be considered data drift.