Deep Learning Vs Traditional Computer Vision
Do we need to study traditional computer vision since deep learning can handle anything more efficiently?
These are good questions. Deep learning (DL) has certainly revolutionized computer vision (CV) and artificial intelligence in general. So many problems that once seemed improbable to be solved are solved to a point where machines are obtaining better results than humans. Image classification is probably the prime example of this. Indeed, deep learning is responsible for placing CV on the map in the industry, as I’ve discussed in previous posts of mine.
But deep learning is still only a tool of computer vision. And it certainly is not the panacea for all problems. So, in this post I would like to elaborate on this. That is, I would like to lay down my
arguments for why traditional computer vision techniques are still very much useful and therefore should be learnt and taught.
Let’s break the post up into the following sections/arguments:
- Deep learning needs big data
- Deep learning is sometimes overkill
- Traditional CV will help you with deep learning
But before I jump into these arguments, I think it’s necessary to first explain in detail what I mean by “traditional computer vision”, what deep learning is, and also why it has been so revolutionary.
Before We Start
Before the emergence of deep learning if you had a task such as image classification, you would perform a step called feature extraction. Features are small “interesting”, descriptive or informative patches in images. You would look for these by employing a combination of what I am calling in this post traditional computer vision techniques, which include things like edge detection, corner detection, object detection, and the like.
In using these techniques – for example, with respect to feature extraction and image classification – the idea is to extract as many features from images of one class of object (e.g. chairs, horses, etc.) and treat these features as a sort of “definition” (known as a bag-of-words) of the object. You would then search for these “definitions” in other images. If a significant number of features from one bag-of-words are located in another image, the image is classified as containing that specific object (i.e. chair, horse, etc.).
The difficulty with this approach of feature extraction in image classification is that you have to choose which features to look for in each given image. This becomes cumbersome and pretty much impossible when the number of classes you are trying to classify for starts to grow past, say, 10 or 20. Do you look for corners? edges? texture information? Different classes of objects are better described with different types of features. If you choose to use many features, you have to deal with a plethora of parameters, all of which have to be fine-tuned by you.
Well, deep learning introduced the concept of end-to-end learning where (in a nutshell) the machine is told to learn what to look for with respect to each specific class of object. It works out the most descriptive and salient features for each object. In other words, neural networks are told to discover the underlying patterns in classes of images.
So, with end-to-end learning you no longer have to manually decide which traditional computer vision techniques to use to describe your features. The machine works this all out for you. Wired magazine puts it this way:
If you want to teach a [deep] neural network to recognize a cat, for instance, you don’t tell it to look for whiskers, ears, fur, and eyes. You simply show it thousands and thousands of photos of cats, and eventually it works things out. If it keeps mis-classifying foxes as cats, you don’t rewrite the code. You just keep coaching it.
The image below portrays this difference between feature extraction (using traditional CV) and end-to-end learning:
Deep Learning Needs Big Data
First of all, deep learning needs data. Lots and lots of data. Those famous image classification models mentioned above are trained on huge datasets. The top three of these datasets used for
- ImageNet – 1.5 million images with 1000 object categories/classes,
- Microsoft Common Objects in Context (COCO) – 2.5 million images, 91 object categories,
- PASCAL VOC Dataset – 500K images, 20 object categories.
In this post we saw why deep learning has overtaken traditional computer vision methods and hence why we should study and learn the latter. Firstly, we looked at the problem of DL frequently requiring lots of data to perform well. Sometimes this is not a possibility and traditional computer vision can be considered as an alternative in these situations. Secondly, occasionally deep learning can be overkill for a specific task, or it is waste of time and resources. In such tasks, standard computer vision can solve a problem much more efficiently, and in short amount of time and in fewer lines of code than DL. Thirdly, knowing traditional computer vision can actually make you better at deep learning. This is because you can better understand what is happening under the hood of DL and you can perform certain pre-processing steps that will improve DL results.
In a nutshell, deep learning is just a tool of computer vision that is certainly not a panacea. Don’t only use it because it’s trendy now. Traditional computer vision techniques are still very much useful and knowing them can be really helpful and save you from a lot of problems.