Over a million developers have joined DZone.

Applying Bayes Theorem to a Big Data World

DZone's Guide to

Applying Bayes Theorem to a Big Data World

In this article, we take a look at Bayes Theorem and how it can be used in big data projects such as predictive analysis.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Bayes Theorem

One of the challenges in analyzing Big Data is of course its volume - there is just so much of it. Then mix in high velocity, or Fast Data, and standard analytical methodologies to make sense of it break down, and become cumbersome and ineffective.

Machine learning techniques that self-adjust and improve over time are a cost effective approach. Bayes, a machine learning methodology, is an effective tool for classifying or categorizing data as it streams in. It is not dependent on modeling or on managing complex rule sets.

However, people often get confused when Bayes is described to them. It is the purpose of this article to clear up confusion and boil Bayes down to very simple concepts that will help to describe it.

In Wikipedia, Bayes is defined as a way of determining “the probability of an event, based on prior knowledge of conditions that might be related to the event.” But Bayes has also been termed a "statistical method for classification." For instance, Wikipedia defines Naïve Bayes as a "family of simple probabilistic classifiers based on applying Bayes theorem." So is Bayes a way of predicting the future, or is it a way to classify? The answer is that it is both. You can classify what an event will most likely be in the future based on historical data from the past.

So how can we use Bayes in Computer Science to predict what will happen in the future? Or worded another way... how can we use Bayes in Computer Science to classify what something will be in the future?

To answer this question, one must first understand that for a Bayes Classifier to work, it must first be "trained" with data. As we are in a Big Data world, we have lots of historical data to train it with. And it must be trained with data in which we "already know the answer." Data we use to train with is referred to as "features." The classifications Bayes comes up with are referred to as "categories."

Predicting the Weather

Let's take an example... we wish to predict the weather. We wish to predict the following categories of weather, rain, snow, and sunshine, based on certain atmospheric conditions or features such as, temperature, air pressure, wind speed, and wind direction. In order to use this data to train with, we must know what happened in the past. So we would gather the features from the past along with our knowledge of what the weather (categories) actually was like on those days in the past. This data will be used to train our Bayes classifier.




Air Pressure

Wind Speed

Wind Direction



40 degrees

1002.4 MB

10 mph




20 degrees

999 MB

40 mph

North East



39 degrees

853 MB

20 mph




42 degrees

1200 MB

25 mph




Figure 1: Training data for the Bayes Classifier

As per Figure 1, if we trained our Bayes Classifier with 12 months of data from January to December of 2016, then in 2017 the Classifier can be used to predict the type of days we will have in 2017. So let's assume that it's now January 1, 2017. It's 6:00am in the morning. We know the temperature, air pressure, wind speed, wind direction, etc. We now stream this data into our Bayes Classifier. This is similar to what we did when we were training it. Only this time, we are not telling it the category, we are asking for it. It's like we're asking it, "Hey Bayes, based on how these features determined the category of day in the past, what do you think today will be like? Will today be rain, snow, or sunshine?" To re-iterate, when we trained our classifier, we told it what the day was like; now we are asking it to predict, so we are asking it what the weather will be like.

The more data that is used to train the classifier, the more accurate it will become over time. So if we continue to train it with actual results in 2017, then what it predicts in 2018 will be more accurate. Also, when Bayes gives a prediction, it will attach a probability. So it may answer the above question as follows: "Based on past data, I predict with 60% confidence that it will rain today."

So the classifier is either in training mode or predicting mode. It is in training mode when we are teaching it. In this case, we are feeding it the outcome (the category). It is in predicting mode when we are giving it the features, but asking it what the most likely outcome will be.

How Does Bayes Work?

How does Bayes go about determining predictions you may be wondering? Without giving too much of an elaborate explanation (as those can be found by Googling) let us say that Bayes comes up with its predictions by using statistical measures on real data that actually occurred to see how features cause events to happen. Using these statistical measures, it is able to predict with certain probabilities of confidence.

Bayes and jKool

jKool, also referred to as AutoPilot Insight has incorporated Naïve Bayes Theorem into its analytics. We've found this to be a helpful tool for customers wishing to predict things such as sentiment analysis, if a new deployment of software is going to perform well or have issues, etc. You can see an example repository utilizing Bayes by logging into jKool and viewing the sample Mobile repository.

You can make use of Bayes within jKool by creating Bayes Classifiers and specifying how to train it with learning data (hardcoded values) and/or learning queries. When this is done, newly streamed data will be classified, the classification will be updated on the SetName field, and the Bayes probability of confidence will be updated as a property.

jKool Sample Repository Demonstrating Bayes

Let's use the sentiment analysis in the sample Mobile repository to explain this further. In this example, we wish to predict if a customer will have positive or negative sentiment with a company based on the notes taken by the Customer Service Department. To do this, we train Bayes to know what a "customer with negative sentiment" is by querying for data about the customer immediately before they cancelled an account. And we train Bayes to know what a "customer with positive sentiment" is by querying for data about the customer immediately before they placed an order with us. Training, in this example, uses a known outcome and assumes that customers that have cancelled their accounts have negative sentiment and customers that place orders have positive sentiment.

This is demonstrated in jKool's Sample Dashboard for Bayes Classification. The Viewlet in 'Error: Reference source not found' demonstrates what was used to train Bayes:

  • Get Event Fields tokenize(message,' ') where name='cancel.account'
  • Get Event Fields tokenize(message,' ') where name='place.order'

jKool used the tokenized words of the message field. In this example, the message field is representing customer service notes. So to train Bayes what a customer with positive sentiment is, we tokenize customer service notes immediately before placing an order and to train Bayes what a customer with negative sentiment is, we tokenize customer service notes immediately before an account is cancelled.

It's important to note that client-side instrumentation must be setup to stream the data over into jKool in this manner (with customer service notes being retrieved and placed in the message field when accounts are cancelled or orders placed). This instrumentation is done with minimal code using jKool open-source collectors.

The Viewlet in 'Figure 2. Error: Reference source not found' demonstrates how newly streamed data is predicted. For newly streamed data, the customer service notes are tokenized and passed through the classifier. Based on past results and probabilities, whether or not the customer will have positive or negative sentiment will be predicted. Figure 2 shows the breakdown of happy and unhappy users.

Figure 2: Predicting Customer Sentiment

Figure 3: 100% of users utilizing the app Mobile Orders version 3.1 are at risk for leaving. The users are all on iOS V10.2 and are distributed across several carriers. The second half of the chart shows satisfied users.

The Viewlet in Figure 3 predicts that customers running iOS version 10.2 with app version 3.1 are likely to cancel their accounts. This would give the company the created the app an indication that there may be an issue (such as a bug) with version 3.1 of the app running on iOS version 10.2. It is also clear from the chart that the problem occurs across all carriers and thus is not carrier-specific.


Bayes Theorem is an effective technique for classifying data in real-time and also in predicting future behavior based on historical results. By combining classification with prediction, Bayes can be valuable in understanding current customer sentiment, potential customer actions, and many other types of observational data either at rest or in motion.

For more information, take a look at AutoPilot ITOA Analytics.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

big data ,bayes theorem ,predictive analysis ,data analysis

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}