While walking around on holiday in Mexico I was listening to Lex Fridman’s Artificial Intelligence podcast with the guest Tomaso Poggio where he said something really interesting that struck a chord with me:
How can you get something that really works when you have so much freedom?
It is trained on the images of ImageNet. These images have a size of 224×224 pixels and channels for Red, Green and Blue (RGB), which means that there are (224x224x3) 150.528 activations of pixels, resulting in an equivalent input size. ImageNet furthermore has 1.280.000 training images with 1.000 labels/classes. If the images are perfectly balanced between the different classes, then in the ideal case there are 1.280 images per class.
Everything that is explained about dimensionality in classical machine learning says that learning a model build up of 60 million learnable parameters with a 150k dimensional input and just 1.280 training samples per class should never be able to learn properly, let alone generalize well.
In this blogpost, I would like to dive into why is it that the really advanced neural networks work so well while they have vastly more parameters than training samples. To do this, we will need some background of the classical Curse of Dimensionality, as it is often taught in classical textbooks.
Subsequently, I will take a look at neural networks and explain the Universal Approximation Theorem. Since the Universal Approximation Theorem only holds for single layer neural networks, we will look if it possible to extend this unique property of single layer neural networks to deeper networks. Once we have these building blocks in place, we are going to combine them to see how they can be used to approximate the way giant neural networks can still learn.