Uncertainty for single predictions becomes more and more important in machine learning and is often a requirement at clients. Especially when the consequences of a wrong prediction are high, you need to know what the probability distribution of an individual prediction is. In order to calculate this, you (I at least) immediately think about using Bayesian methods. But, these methods also have their downsides. For example, it can be computationally expensive when dealing with large amounts of data or lots of parameters. What I didn’t know was that you actually can get similar results using frequentist methods. Data Scientist Yu Ri explains how you can calculate different types of uncertainty for regression problems using quantile regression and Monte Carlo dropout. All code used in this blog will be published on Github.
In order to calculate the uncertainty, you need to distinguish the type of uncertainty. You can define many different types of uncertainty, but I like the distinction between aleatoric and epistemic uncertainty. Aleatoric (also referred to aleatory) uncertainty is uncertainty in the data and epistemic uncertainty is the uncertainty in your model. With model uncertainty I do not mean uncertainty about the modelling approach. The decision between a LinearRegression or RandomForestRegressor for example is still up to you.
Figure 1: Plot showing a linear fit including aleatoric uncertainty boundary.
Aleatoric uncertainty (or statistical uncertainty) is the uncertainty in your data. This can be uncertainty caused by errors in measuring the data, or by the variability in the data. With variability in the data I mean the following. Let's say that you have one input feature being house area to predict the house price. It is very likely that there are different house prices in the data set with the same house area. This variance in the house price is defined as aleatoric uncertainty. In our plot above, the aleatoric uncertainty is equal to the mean plus or minus 2 times the standard deviation.
Figure 2: Plot showing multiple linear fits (epistemic uncertainty), which all fit reasonably well.
Epistemic uncertainty (or systematic uncertainty) is the uncertainty in the model. You can interpret this uncertainty as uncertainty due to a lack of knowledge. For example, I am uncertain about the number of people living in the Netherlands, but this information can be obtained. In data science epistemic uncertainty can be reduced by improving the model. In our example plotted above you can say that all lines fit the data reasonably well, but which line fits the data the best? Using linear models from SKlearn for example, we choose a model that performs best for a certain metric (global or local optimum) and ignore the epistemic uncertainty.
Now we know what types of uncertainty we have to deal with. To model these different uncertainties, we use two different techniques called quantile regression and Monte Carlo Dropout . When we have calculated both uncertainties, we can sample and approximate the posterior predictive distribution. Let's start with quantile regression. Read more.