Look around and you will see a lot of day to day objects. You recognize them almost instantaneously and involuntarily. You don’t have to wait for a few minutes after looking at a table to understand that it is in fact a table. Machines, on the other hand, find it very difficult to do this task. People have been working for decades to find a solution to this problem, but they have only been able to achieve an accuracy of around 65%. Why is it so hard for machines to recognize and categorize objects like humans? What’s so difficult here? We do it everyday and we get it right almost every single time. What’s the missing link? This is actually the holy grail of computer vision!
How do humans do it?
Let’s take a look at how humans recognize and categorize objects. The processing of visual data happens in the ventral visual stream. It is a hierarchy of areas in the brain which helps in object recognition. Humans can easily recognize different sized objects and put them in the same category. This happens because of the invariances we develop. Whenever we look at any object, our brain extracts the features and in such a way that the size, orientation, illumination, perspective etc don’t matter. You remember an object by its shape and inherent features. It doesn’t matter how the object is placed, how big or small it is or what side is visible to you. There is a hierarchical build-up of invariances first to position and scale and then to viewpoint and more complex transformations requiring the interpolation between several different object views.
We have cells in our visual cortex that respond to simple shapes like lines and curves. As we move along the ventral stream, we get more complex cells which respond to more complex objects like faces, cars etc. Neurons along the ventral stream show an increase in receptive field size as well as in the complexity of their preferred stimuli. Humans take remarkably little time to recognize and categorize objects. This suggests that there is some form of feedforward processing of information going on. This means that the information processed by the cells in the current level in the ventral stream hierarchy is used by the next level. This helps speed up the process by a huge factor.
Why is it hard for machines?
We know the mechanism by which the visual data enters the human visual system and how it’s processed. But the problem is that we are still not exactly sure how our brain categorizes and organizes the data. So we try to extract features from an image and ask our machines to learn from it. There are variations like size, angle, perspective, occlusion, illumination etc. The same object looks very different to a machine when it is presented with a different perspective. Humans, on the other hand, will immediately recognize an object from anywhere. One way to go would be to store all possible sizes, angles, perspectives etc, but this would be infeasible. It would take an enormous amount of space and time to recognize an object. If a chair is partially blocked, we will still identify it. Machines will fail in this situation because it is now a new object according to them.
How do machines currently do it then?
A few powerful features and algorithms have been formulated in the last decade to tackle this problem. These features are invariant to scale, rotation and to some extent illumination. Some of the examples are SIFT, SURF, PHOG, FAST, BRIEF, Pyramid Match Kernel etc. Each feature descriptor has its own advantages and disadvantages. Researchers have tried to come up with formulations that are close to the human visual system so that the recognition accuracy increases. If you are interested, you can read up on Gabor filters and HMAX model. The current state-of-the-art technologies can achieve an accuracy of around 65-70% depending on the dataset and conditions.
For a technology to become accepted, it should be fast and have sufficiently high accuracy. Generic object recognition is computationally very intensive and the accuracy is still low. One of the very good examples of a successful application in this domain is Google Goggles. It’s an app by Google and it’s the first meaningful attempt towards bringing this technology to the masses. It’s a great app, give it a try sometime! Having said that, I will also say that we are still far away from a day when you can just point your device at something and it completely understands what’s happening around.