Remember what building a model for statistical analysis was like back in the day? Man, that stuff was hard. Weeks – even months – of playing around with complicated C code as C++, painstakingly building a sprawling beast of a program. And if one tiny piece was wrong, the crushing realization that you’d have to pull the whole thing apart or maybe even start again.
Thankfully, things have moved on since then. Rather than creating a static monolith that you’re stuck with, warts and all, it’s now possible to create a system for analyzing data that’s constantly adapting, evolving, and improving. One of the best ways to do that is with R.
What Exactly Is R?
A richly-featured (if laconically-named) programming language, used for all kinds of data science, statistics and visualization projects. R’s popularity — two million users and counting! — is partly down to being an open-source program with an extensive support network of forums and free tutorials, allowing the data-curious to train themselves. But the low barrier to entry is far from the only draw. R is powerful and versatile; so much so, everyone from Facebook to Google uses it — the former to analyze feeds and the latter to assess the effectiveness of their ads.
What Does This Have to Do With BI?
One of the great things about R is that it can be integrated with BI platforms to help developers and analysts get the most out of business-critical data. This creates smarter and smarter ways to investigate what’s working in your current operations and strategies and assess how future business decisions could pan out. This ranges from statistical functions like K-means clustering to predictive models such as linear regression. In the case of Sisense’s BI platform, it also allows you to build and run statistical models using Sisense data, automatically updating these as new information flows into the model.
How to Do It: The 4 Stages of an R and Business Analytics Project
As the Hollywood screenwriting mantra goes: show, don’t tell. So, let’s take a look at how you might run a real Business Analytics project using R and real data. For this demonstration, we’ll use a simple example. Imagine you’re analyzing your company’s sales strategy. You want to ensure your sales guys and girls target the right people in their customer’s businesses to maximize their conversion rate. Using that information, you also want to look at their leads and predict their chances of closing the deal.
For this, we’re going to focus on the authority level of the contract, the value of the deal, and whether the deal went through.
There are four steps to this process:
- Exploratory data analysis: Evaluating which predictive variables you should use.
- Data preparation: Engineering features and munging data.
- Build and train model: Iteratively constructing and honing the model for analysis.
- Scoring: Running the model using fresh data to predict the outcome of future deals.
Okay, let’s dive in.
Step 1: Exploratory Data Analysis
The first question is: Which variable is most useful in determining future success?
If you look at the visualization above, you’ll see that the graph on the left shows deal size vs. level of authority within the company, while the one on the right shows conversion rate vs. authority level — how often you close the deal compared with who you’re talking to in the organization. There’s not much variation in the left-hand graph, suggesting that deal size isn’t affected by who you talk to. But on the right, there’s significant variation. It’s clear from this data that if you’re talking to upper management or executives, you’re much more likely to close the deal.
Step 2: Data Preparation
The next thing to consider is: what do you need to do to “munge” the data, i.e., prepare the data for analysis?
In this case, there’s an apparent problem. The variable we’re running with (seniority) is a categorical variable. It’s text, not a numerical value. That’s not much use in a linear regression model. You can’t add or multiply the word “employee” or “manager.” This means you need to find a way to represent this numerically. The solution is to convert your text data into indicator variables or dummy variables. Instead of listing the job title for each entry on the list, you introduce a column that says, for example: “Is this an employee, yes or no?” and then put a 1 for yes, or a 0 for no.
The sharp-eyed among you may have noticed that, although there are four types of predictor variable for seniority (employee, manager, executive, upper manager), we only use three in the regression. Without going into too much detail here, this is to do with the laws of linear algebra… we need to do this to avoid multi-co-linearity. So, in this case, just use one variable as the baseline and, in effect, make the values of the co-efficients for the other three variables relative to that baseline. Ta-da!
For this demo, we’ve pre-prepared the dummy variables, but you can do that yourself by adding additional fields to the data inside the Sisense platform, then applying whatever logic, functions or tools you need. The beauty of this is that once done (and set up to display as a 1 or a 0), this will automatically apply to any new data that’s fed into the platform. It automatically updates, refreshing and enriching your model without you having to do any additional prep.
Step 3: Build and Train Model
One of the satisfying things about working with R is that it’s incredibly easy to create a linear model or a generalized linear model. In fact, if you want to call a linear function, it’s a single line of code: LM. To call a generalized linear model, it’s GLM. Not only are these functions simple to call, but they also run in C and Fortran routines, making them lightning-quick, too.
Okay, now it’s time to define your outcome variable or response variable. In our case, that’s “conversion” (as in, did the sale convert or not?) and it’s a binary variable — a straightforward yes or no. Using the model we’ve created, we’ll now predict whether a sale will convert, based on the seniority level of the person you’re talking to.
To recap: Our response variable is conversion, our predictor variables are the indicator/dummy variables for seniority (the 1s and 0s prepared in Sisense), and, because our response variable is binary, we’ll use binary regression — more commonly known as logistical regression. This will let us model the outcome with a number between 0 or 1, indicating the likelihood of conversion. In other words, we’re building a model that helps us predict the chance of success based on historical results. And because you can save this to your repository and re-apply it as new data arrives, it’s totally reusable and will adapt with the data and insights you give it to work with. Which leads us to…
Step 4: Scoring
Hooray! You’ve built a working model for statistical analysis. Now, how to use this effectively to make predictions on new data?
Luckily, it’s incredibly easy to make predictions in R. Once you’ve built your model using LM or GLM and you have a model object, you can use that model object to score new data. This involves multiplying two vectors together using the predict function. The predict function draws on two arguments. The first is the model object you created earlier. The second argument is your new data. As new data flows in, all you need to do is take that information about the customer and feed it into the predict function. This runs the information through the model and, based on what you know about this customer’s seniority within their company, gives you a number between 1 and 0. This score tells you the probability of closing the deal based on the authority level of the person you’re talking to.
And there you have it! A working prediction model that automatically updates with new data, ensuring you always have the latest business intelligence to hand.