Computers Can See Too

One of the most influential technologies that has changed the course of history in the last century has been camera technology: it was the start of our machines beginning to see. Photography has allowed us to capture the best and worst moments in history, which has allowed for awareness, reflection and action. The power and influence of photography only became stronger with the advent of camera phones in the early 2000s. Suddenly, millions of people had the means to document everything from the horrors of life under dictators to their child’s first steps, right in their pockets. A few years later, with the growth of the mobile internet and social media, not only did we have a way to document our lives, we also had the means to shout about it to (almost) the entire world. We glimpsed the power of all of this back in 2011 with the Arab Spring, where millions of oppressed people in the Middle-East used their camera phones, and Facebook accounts to overthrow numerous powerful authoritarian regimes.

Today, almost half of the world’s population have access to a smartphone and an internet connection and from Snapchat to YouTube almost all of the content we consume on a daily basis is visual. 60 million pictures are uploaded to Instagram every day and 300 hours of video are uploaded to YouTube every second(yes.. every second). And as we move into an even more connected world with the Internet of Things, there are even more cameras in the “things” around us. What does all of this mean? It means that the cloud is filling up with Exabytes(1000000TB) of visual data that is waiting to be leveraged for valuable insights that can solve many of the world’s problems(sounds familiar right?). Just think about it, if one image is worth a thousand words then how much insight can we gain about the world by analyzing trillions of images?

In the past, the most influential photos and videos have spread awareness about the world’s problems and led to reflective moments that in turn spark action. But due to the lack of both data and processing power, large amounts of visual data has never really been analyzed at the macro level like we’re seeing today. But today, we can use the large amount of visual data we have, coupled with the amount of processing power to build Artificial Visual Intelligence or Computer Vision, technology that can open up a world of new possibilities.

Vision is a unique human ability that allows us to identify and process what we see around us and act upon it. Then, what is Artificial Visual Intelligence? Think Facebook’s face detection or Tesla’s impressive autopilot technology. Artificial Visual Intelligence in the simplest terms, is giving a computer the ability to see and make sense of the world around it. Yes that’s pretty cool, but only Zuckerberg and Musk can do this magic right, why would I ever use this technology?

No matter what you do or what industry you are in, it’s pretty clear that technology is going to be the future and you have the choice to either be a disruptor or get disrupted in your space. And Artificial Visual Intelligence or Computer Vision is a technology that definitely has the potential to disrupt entire industries and solve major problems we face world over. Imagine people in a developing country, lacking healthcare facilities, using computer vision to get a diagnosis of a skin disease, or a farmer using computer vision to diagnose the pests and diseases plaguing his crops or even a blind man using a wearable device and computer vision to navigate the world around him. All of this can be done thanks to computer vision and the possibilities are only limited by your imagination and creativity.

Now that you’re sold on Computer vision and its potential, your next question is probably going to be about how to implement it and use it in your own projects. Well, this is where it gets a bit complicated. It’s not so simple to get a computer that thinks in 0s and 1s to analyze and give intelligent insights about something as complex as an image.

One of the most basic use cases of computer vision is classification, where you want to classify the contents of an image as either one thing or another(does this image have an apple or an orange.) To do this you would have to build a Machine Learning model called a neural network that would use a few thousand of existing images of apples and oranges to learn the correlations between the pixels in the image and whether it is an apple or orange. But to do this, in addition to having a pretty powerful machine, you need to have a good knowledge of Machine Learning and also have a large dataset containing images of apples and oranges. This is a very exciting technology that has a lot of potential and I definitely recommend that you learn it the hard way, especially if you love Computer Science and Math. But if you are looking to quickly add some computer vision to your own projects, then your best option is : Google’s Vision API.

Vision Demo
Google Vision API Demo

An API or Application Programming Interface is a service that allows systems to communicate information between one another. In the case of the Vision API, you pass your image/video to Google’s side and then Google will put your image through it’s Computer Vision algorithm and return a list of labels and confidence scores between 0 and 1(how confident whether that label applies to that image) as seen above.

This is an extremely powerful solution, because it lets you quickly integrate computer vision into your projects without worrying about any of the speed bumps I mentioned earlier. Since, Google has basically the entire internet’s images and videos, it has one of the largest datasets of images and video. Google also has numerous data centers packed with power — so processing power is also not an issue and finally, some of the most talented engineers have already built the vision model so you don’t need to worry about that either. You might be wondering whether Google just hands out all of this for free, if that’s the case don’t worry the good guys at google let you try out the service for free for the first 1000 requests per month(plenty for tinkering purposes), and after that they charge a few dollars for every 1000 requests you make.

Once you have the labels and confidence scores from the Vision API, the hard ML stuff is done, so then you just need to figure out what you want to do and manipulate the data accordingly. When I first used Google Vision for one of my projects, I wanted to identify whether an image contained any kind of electronics or not. So I tried a bunch of possible images of electronics with Google Vision and made a comprehensive list of all of the possible labels it would return. Then I built a basic algorithm that would take whatever Google Vision’s response labels are for any given image and check if there are common labels between the comprehensive list and the response. Then the algorithm would perform a basic linear combinations operation, where it would calculate the sum of the confidence scores of the common labels and divide by the number of common labels to come up with a final confidence score between 0 and 1 whether the image contained electronics or not.

Clearly, this is not the most elegant solution to the problem, but at the time I did not have the resources in terms of data or ML knowledge in order to build and train my own model, and this solution just works. I used Google’s solution because it is quite easy to setup and use, but if you want features like facial authentication, you can also try IBM’s Watson, Microsoft’s Azure or Amazon’s Rekognition(the most powerful out of the lot). Another very promising solution is Google’s Auto ML engine, which is still in its Alpha stage, but essentially the Auto ML engine adds more customizability to build your own models on top of Google’s existing service. With Auto ML I could have created my own model using Google’s resources without having to Jerry rig my own solution.

Computer Vision or Artificial Visual Intelligence is an extremely powerful and promising technology that definitely has a place in the future. While these APIs are a good way to get started, they can only give you a taste of the possibilities. If you are really serious about making a change in the world with technology or if you’re just passionate about technology or math, I highly encourage you to dive deep into the subject.

I hope you obtained some value and learnt something from this and I can’t wait to see your applications of this technology. Also, look forward to an explanation about Machine Learning in simple terms coming up soon!

Leave a Reply