The graph above, obtained from Google Trends, shows the popularity of the term ‘Machine Learning’ on the internet in the last 5 years. According to the graph, the term Machine Learning has become extremely popular in the last few years and if you have followed any of the happenings in the tech world you would not be surprised. Machine Learning is one of the most talked about and hyped topics in the tech world. From Elon Musk to Sundar Pichai, everyone is talking about how Machine Learning is the future and almost everyone in the tech space is rushing to get in on the action. But why is this? why has Machine Learning become so popular in the last few years?
The answer to the recent popularity of ML is Big Data. In the last 300,000 years, since the birth of the human species, until the year 2005, humans had created 130 Exabytes of data, including the words we have spoken, the books we have written, the art we have created, and all the other content we have produced. This might not seem too impressive at first, but just keep reading… The Amazon rain forest is made up of 1.4 billion acres of trees, each of which has around 500 trees. This means that there are around 700 billion trees in the entire rain forest. Hypothetically, if you cut down all of these trees make paper out of it and fill it with words, then those pages will hold 1-2 Petabytes of data and 1 Exabyte is 1000 times the size of a Petabyte.
130 Exabytes should now sound more impressive. But what’s more impressive is what happened after 2005: by 2010 we had created 1200 Exabytes and by 2015 – 7900 Exabytes. There is clearly an exponential growth in the amount of data we are collecting and by 2020 it’s predicted to be 40900 exabytes. As everything around us is going digital, more and more data is being created. What does all of this data mean? Well data is an extremely powerful tool that can provide us with valuable insights to solve problems. The valuable insights that data can provide make data even more valuable than money. This is the reason why companies like Google and Facebook provide their services for free – the data they collect by providing a free service is more valuable to them.
Traditionally, humans have analysed data and then manually written down the insights they find through the data. But as the amount of data we have grows beyond human analysis capabilities, extracting valuable insights from the data becomes challenging. Human data scientists only have the potential to leverage a fraction of all of our data. This is where Machine Learning comes in. Machine Learning gives us the potential to leverage all of the excess data we have to gain valuable insights about our world that can solve many of our problems. So what is Machine Learning?
In essence, Machine Learning is using Data to answer Questions. This is a great definition because it encapsulates the two main aspects of Machine Learning: Training and Predicting. We use data to train the machine and then we answer questions by predicting.
The diagram below illustrates the process behind the training part in Machine Learning. Through training, data can be used to create a Machine Learning model. To gain a deeper understanding of how the training process works, we must first look at what the data we input looks like.
The table below contains a very simple sample dataset of some users of a Social Network. The dataset contains information about the users’ salaries, ages and whether they purchased an item. The goal is to better target advertising on the social network based on whether any given user will purchase the item or not. This data is a simplified version of the kind of data we would use to train a Machine Learning model, but its simplicity allows us to easily understand how the data is used to create a model. So how can this data be used to create a model?
First we need to understand the difference between an Independent and Dependent variable. You may be familiar with these terms from high school. All of the data we input in the training stage will always be categorized into independent or dependent variables. Independent variables are the parameters or variables that we use to make the prediction. A dependent variable is the prediction we make based on the independent variables. This may not seem too clear yet, but we unconsciously categorize much of the data we use into independent and dependent variables in our daily lives. Let’s look at an example.
You are about to leave the house, and are wondering whether to take an umbrella or not. So you look outside and decide you need one. How did you make this decision? You saw that the wind was strong and the clouds were dark, which led you to make the prediction that it would rain. In this case the variables you used to make the prediction were the strong wind and the dark clouds – the independent variables and you predicted that it would rain and you would need an umbrella- the dependent variable.
This means that in the sample dataset shown above, the independent variables are the Age and the Estimated Salary of the user and the dependent variable is whether the user purchased the item or not. Now that we know what independent and dependent variables are, understanding the training aspect of Machine Learning is quite simple. During training all that happens is that the machine identifies the correlations between the independent and dependent variables to create a model. Since computers can’t think like us this correlation is identified through mathematics, but that’s not important right now. This means that the more training data we have the better our model will perform. The essence of training is that:
During training the machine identifies the correlations between the Independent and Dependent Variables to create a model
The predicting stage is very simple. This is where the real value is obtained from Machine Learning. Through the training stage after the model is created, we then use this model to predict future outcomes. As seen in the diagram below, this is done by inputting future data and predicting the outcome. Since the machine has not trained on this future data before, it will simply use what it has learnt before and apply the model to predict the dependent variable of that scenario. In a few words:
When predicting the machine uses the model to predict the outcomes for future data
The whole process behind Machine Learning has been designed to follow the human learning process. To understand why let’s go back to the umbrella problem we looked at above. You gathered some data by looking out of the window and then identified the independent variables – the dark clouds and the strong winds. How did you know that this meant it was going to rain? You knew this because you had previous experience of this situation. Ever since you were a little kid, you had seen that dark clouds and strong wind have a correlation with the weather and result in rain – you had unconsciously used data to train yourself and create a mental model. Then when you needed to figure out if it was going to rain or not, you collected some new data you did not know of and applied it back to the model you had created. You thought, ‘ok, there are dark clouds, and the wind is strong, in the past this led to rain. So I think it’s going to rain.’ You were trained on past data, which allowed you to make predictions on future data. This is exactly what happens in Machine Learning. The machine identifies the patterns and then makes predictions on what it learnt before. So why is Machine Learning so similar to the Human Learning process?
Let’s look at how we would approach this problem without Machine Learning. Without ML, we would have to first do some human research on what factors cause rain and record all of these variables. Then we would have to manually program every single one of these factors and how they would affect the outcome. For example, if there are dark clouds and the wind is strong, then it will rain, is what we would have to program. This might not seem too difficult, but when the problems get more complex and more independent variables affect the final outcome the research and programming process become much more cumbersome.
With Machine Learning, the research and programming part is not required, since the Machine itself spots the correlations and creates the model. Furthermore, Machine Learning models can be even more accurate than humans, because it may identify correlations that humans overlook or cannot comprehend. For example, a Machine could be better at diagnosing cancer because it identified some correlations that we have not found yet because we never analyzed such a large dataset. Through Machine Learning and its bottom up approach to solving problems with data we can engineer scalable and highly accurate solutions that will use data to improve everyone’s lives.
Since this is a very high level overview of Machine Learning, you might not still see the full power and potential of this technology. So let’s look at some powerful applications of this technology. Machine Learning and health data can be used to diagnose patients and also predict possible illnesses. Crop data can be used with ML to identify pests and diseases plaguing crops and improve the efficiency of agriculture. Machine Learning can even be used to predict the winner of the next World Cup.
The techniques and algorithms that power Machine Learning have been around since the early 1900s, but they are only picking up steam today because of our access to data and computational processing power. We live in a world where we are collecting thousands of exabytes of data about everything and everyone around us and we have the power to obtain value from all of this data. The limits to the potential of Machine Learning is only your imagination.
I hope this gave you a better understanding of Machine Learning, that you can use when engineering your own solutions to problems. I will go in depth about different Machine Learning techniques and algorithms in my next post.