In these times, computer vision is growing as never before. Many things can be mentioned as a reason, but in my view the main reasons are the following:
- Advancements in hardware
- The emergence of deep learning
- The advent of large datasets
- The increase in computer vision applications
Better and More Dedicated Hardware
One of the main reasons why image processing is such a difficult problem is that it deals with an immense amount of data. To process this data, you need memory and processing power. These have been increasing in size and power regularly for over 50 years.
Because of these, algorithms run faster to the point where more and more things are now capable of being run in real-time (e.g. face recognition, object detection).
We have also seen an emergence and proliferation of dedicated pieces of hardware for graphics and image processing calculations. GPUs are the prime example of this. A GPU clock speed may generally be slower than a regular CPU’s, but it can still outperform one for these specific tasks.
Dedicated pieces of hardware are becoming so expensive, nowadays in computer vision that many companies, including Nvidia, intel and AMD started designing and producing them.
- RTX 2080 Ti: 11 GB of VRAM
- Titan RTX: 24 GB of VRAM
- Tesla V100: 32 GB of VRAM
- Titan V: 12 GB of VRAM
are the most known computer vision dedicated GPUs.
The Emergence of Deep Learning
Deep learning, a sub field of machine learning, has been revolutionary in computer vision. Because of it, machines are now getting better results than humans in important tasks such as image classification (i.e. detecting what object is in an image).
Previously, if you had a task such as image classification, you would perform a step called feature extraction. Features are small “interesting”, descriptive or informative patches in images. The idea is to extract as many of these from images of one class of object (e.g. chairs, horses, etc.) and treat these features as a sort of “definition” (known as a bag-of-words) of the object. You would then search for these “definitions” in other images. If a significant number of features from one bag-of-words are located in another image, the image is classified as containing that specific object (i.e. chair, horse, etc.).
The difficulty with this approach is that we have to choose which features to look for in each given image. This becomes complicated and impossible when the number of classes you are trying to classify for starts to grow past, 15 or 20. Do you look for corners? edges? texture information? Different classes of objects are better described with different types of features. If you choose to use many features, you have to deal with a plethora of parameters, all of which have to be fine-tuned.
Well, deep learning introduced the concept of end-to-end learning where the machine is told to learn what to look for with respect to each specific class of object. It works out the most descriptive features for each object. In other words, neural networks are told to discover the underlying patterns in classes of images.
The image below portrays this difference between feature extraction and end-to-end learning:
We can say that deep learning put computer vision on the map in the industry. Without it, chances are we would all still be stuck with it in academia.
To allow a machine to learn the underlying patterns of classes of objects it needs A LOT of data. That is, it needs large datasets. More and more of these have been emerging and have been instrumental in the success of deep learning and therefore computer vision.
Before around 2012, a dataset was considered relatively large if it contained 100+ images or videos. Now, datasets exist with numbers ranging in the millions. It has become very simple collecting large amount of data in very short amount of time and small amount of cost.
Here are some of the most known image classification databases currently being used to test and train the latest state-of-the-art object classification/recognition models. They have all been meticulously hand annotated by the open source community.
- ImageNet – 15 million images, 22,000 object categories.
- Open Images – 9 million images, 5,000 object categories.
- Microsoft Common Objects in Context (COCO) – 330K images, 80 object categories.
- PASCAL VOC Dataset – a few versions exist, 20 object categories.
- CALTECH-101 – 9,000 images with 101 object categories.
All those datasets plays a vital role of taking computer vision to the next level. The more data sets we have to train our models, the better our outcome gets.
Faster machines, larger memories, and other advances in technology have increased the number of useful things machines have been able to do for us in our lives. We now have autonomous cars (well, we’re close to having them), drones, factory robots, cleaning robots – the list goes on. With an increase in such vehicles, devices, tools, appliances, etc. has come an increase in the need for computer vision.
and other things such as incorrect labelling. Here’s a picture of one of these robots at work:
Agriculture as well is capitalizing on the growth of computer vision. iUNU, for example, is developing a network of cameras on rails to assist greenhouse owners to keep track of how their plants are growing.
There’s no need to mention autonomous cars here. We are constantly hearing about them on the news. It’s only a matter of time before we’ll be jumping into one.
Computer vision is definitely here to stay. In fact, it’s only going to get bigger with time.
Our planet has many problems that can be solved with computer vision. These problems we face are forcing us to find more efficient way to run our computer vision system. Because Computer vision gets advantage of the progress made in Hardware, data sets and even software to grow in such fast pace.