What is Machine Learning?
Machine learning has been referred to as “the science of getting computers to act without being explicitly programmed.”
What is machine learning? How is ML different from AI? How exactly can a machine learn by itself? These are all common questions surrounding one of the biggest technology innovations of the 21st century: Artificial Intelligence. With use cases from predicting the stock market to curing cancer, machine learning and AI is already revolutionizing our world as we know it.
A term I’ve heard used that really encapsulates the machine learning process is that ML is “the science of getting computers to act without being explicitly programmed.” In this blog, we’re going to go over the components that make up machine learning, break down machine models and algorithms, and review common use cases in the cybersecurity and IT field.
Let’s start by clarifying the AI vs. ML terminology. Artificial Intelligence, or AI, is a general term we use when we talk about machines making decisions. This could be as simple as making recommendations for your next song on Spotify to predicting stock prices based on massive amounts of data. Machine Learning is one approach or subset of AI that is rooted in statistical algorithms to make data-driven decisions. Deep learning is a further subset of machine learning which uses neural networks to use reason in determining whether the machine learning output was correct and adjust if necessary.
At its core, ML algorithms parse data, learn from data and then apply what they’ve learned to make informed decisions. More specifically, algorithms take data, along with training that is received, to provide an output or a decision. Over time, the training process improves the overall quality of the output. Each piece of information that is learned feeds back into the machine learning model in a continuous cycle of output and re-training.
There are three important components to a machine learning model, including the data, the algorithm, and the training. By definition, training consists of learning some new information, either by someone teaching you or by experience. That is why an important part of the training process involves how information is going to be learned by the machine learning model, and there are different types for different use cases. It is very important to understand the notion of supervised, unsupervised, semi-supervised, and reinforcement learning.
In supervised learning, the machine is given some prior knowledge – what is good or bad, what works, and what doesn’t. A good example of this could be image classification, where we train the machine to detect a dog from a series of pictures so it can learn what a dog is. Likewise, if we wanted the machine to classify a ‘cat,’ we would need to train the machine to know a cat by using a series of pictures of cats so that it can recognize and classify a dog vs. a cat from an image.
With unsupervised training, the machine has no prior knowledge and is essentially starting from a blank slate. It still has data to work from, but it has no trained concept of what good or bad is. This kind of learning is used frequently in algorithms like regression analysis, where you’re taking past data to predict a future value. Unsupervised learning is good for finding hidden patterns in data where we don’t know exactly what we’re looking for, but we want to find anomalies or relationships we didn’t know were there.
Semi-supervised learning is a combination of the two learning approaches because you’re giving the machine some data to work from, but it ultimately has to make its own decision. A classic example of this is facial recognition on video, where the machine needs a known picture (supervised) to match with a face in a video (unsupervised).
Lastly, Reinforcement training is about having a machine learn to make a sequence of decisions based on trial and error. Rewards or penalties reinforce good or bad behavior on its ultimate goal of solving some complex problem. Autonomous cars are an excellent example of reinforcement training because a programmer cannot possibly account for every possible scenario in the road and must leave it to the machine to make decisions as it learns them. This gives the algorithm the flexibility to make the decision on the fly based on some feedback, like proximity sensors and objects on the road.
At this point, it’s important to note that how the machine learns the data is part of the training process that makes up the machine learning model. But models themselves are not to be confused with algorithms. As we mentioned previously, as you train an algorithm with data, it becomes a model. So, to put it another way, Models = Training (Algorithms plus data). The more data and time that runs the algorithm, the more fine-tuned it becomes, which leads to better decisions. And just like there are different learning methods for different use cases, so too are there different algorithms based on the need. We’re going to review the five most common algorithms, which include: Classification, Regression, Recommendation, Dimensionality reduction, and Clustering.
Let’s start off with the Classification Algorithm, which, as its name indicates, is focused on classifying or categorizing data. In a cybersecurity context, this type of algorithm is one of the most common because the machine is classifying what is good or bad, clean or malicious. This kind of algorithm is common in email servers to classify spam messages. Part of the learning process would be to train the algorithm with some examples of spam messages, for example, having an unknown email address or having too many exclamation marks or sentences that sound like they were created by a bot. The training process would also include giving the algorithm examples of real emails: for example, emails that have a personal address, clear language, etc. Classification algorithms are also being used in the SIEM space, where a SIEM device is taking in data from devices in your network and learning what normal behavior is to help identify outliers. Unlike traditional SIEMs without machine learning, where rules are static, the whole point of machine learning is that it’s learning and training on its own. Over time, this means the more accurate decisions, the more that data is trained.
The clustering algorithm works in a similar way but for an entirely different purpose. With clustering, we’re focused on finding similarities between data points and grouping or clustering them. The main difference between classification and clustering is that clustering identifies similarities between objects, whereas classification uses predefined classes. In other words, clustering automatically finds and groups similarities, whereas classification puts data into pre-configured classes. Say, for example, you have an e-commerce site and want to classify your user traffic for marketing purposes. Based on their cookies or traffic information, you can classify them into groups or clusters like new customers, high-income earners, etc. Once users are classified, it’s up to you to decide what to do with the data, like using the recommendation algorithm to make recommendations based on income or demographics. Clustering has many different use cases, and it’s common to use in combination with other algorithms. Using our e-commerce example, once we have our user groups grouped together, we can use our next algorithm to make calculated recommendations.
Next, we have the Regression algorithm, which is focused on predicting values based on past data. Or, put another way, the knowledge about the existing data is utilized to predict new data. For example, say you had the details of every house sale over the last ten years. This includes sq ft, number of beds, sale price, and so on. This past data can be used as the input for the regression algorithm to predict future price values and trends. While this kind of algorithm is most popular in fields like medicine, stock markets, and real estate – the reality is that any industry can benefit from this kind of data mining. Credit card companies and banks use it for fraud detection, whereby your historical purchase serves as the data that is learned. When something out of the norm is detected, an alert goes off that the purchase may be fraudulent.
Regression algorithm is great at predicting based on past data that we have identified as important. But what if we don’t know what’s important? What if we have a lot of different data points and we’re not really sure what’s significant, or we want to identify outliers? The dimensionality reduction algorithm is exactly for this use case, to find outliers or significant factors from a large data set. Let’s take the stock market again as an example. If you’ve ever looked at a fundamental’s breakdown, you’ll see there are lots of data points that can have a significant impact on the performance of the stock. The dimensionality reduction algorithm can take all these data points across the entire industry and over a specified number of years to help identify important data points. The same concept can be used in threat hunting. A 12% sudden increase in CPU usage may be another Windows update on a PC. But when combined with other data points, like the fact that the user also went to an uncategorized URL and had an old version of Firefox, it can point to something more serious. Dimensionality reduction is designed to find these patterns of suspicious behavior in a mountain of data that simply cannot be matched by human hands or static SIEM rules.
As its name indicates, ‘Recommendation algorithms‘ are focused on making recommendations based on past data. The logic is simple; based on past data about users – the recommendation algorithm finds trends that people bought or view X also buy or view Y. Algorithms like this are a multibillion-dollar industry and are used by YouTube, Amazon, and Facebook to make calculated recommendations based on who you are and what the recommendation algorithms think will get you engaged. The recent documentary on Netflix called the ‘Social Dilemma’ talks in-depth about this kind of recommendation engines and the dangers that come with it for society at large.
As we wrap up the algorithm discussion, it’s important to note is that algorithms are not mutually exclusive. It’s common to have two or more algorithms combined, or ensembled, to produce a final verdict.
Let’s put it all together by looking at the high-level machine learning flow. Data comes in as the input to the machine learning model. Depending on the decision that needs to be made about that data, a model is selected. As we saw previously, a model is made up of Training (Algorithms plus data). The more data runs through the algorithm, the more a model is ultimately trained. Our inputted data is now output as a decision which is sent off to something like an application that decides what to do with that decision.