{{announcement.body}}
{{announcement.title}}

How Auth0’s Data Team uses R and Python

DZone 's Guide to

How Auth0’s Data Team uses R and Python

In this article, we discuss how AuthO's Data team uses R and Python for analysis, reporting, and machine and deep learning.

· Big Data Zone ·
Free Resource

Our Data team is responsible for crunching, reporting, and serving data. The team also does data integrations with other systems and creates machine and deep learning models.

With this post, we intend to share our favorite tools, which are proven to run with thousands of millions of pieces of data. Scaling processes in real-world scenarios is a hot topic among new people coming to data science.   

R or Python?

Well... both! 

R is a GNU project, thought of as a statistical data language. It was originally developed at Bell Laboratories around 1996.   

Python, developed in 1991 by Guido van Rossum, is a general-purpose language with a focus on code readability.   

Both R and Python are highly extensible through packages. We mainly use R for our data processes and ML projects and Python to perform integrations and Deep Learning projects.   

Our stack is R with RStudio, and Python 3 with Jupyter Notebook.   

r-vs-python

R vs Python

RStudio is an open-source IDE capable of browsing data and objects created during the session, creating  plots, debugging code, among many other options. It also provides an enterprise-ready solution.

Jupyter is also an open-source IDE aimed to interface Julia, Python, and R. Today, it is widely used by data scientists to share their analysis. Recently, Google created "Colab," a Jupyter Notebook environment capable of running in Google Drive.   
You may also like: Top 10 Reasons to Learn R.

Is R Capable of Running In Production?

Yes. We run several heavy data preparations and predictive models every day, every hour, and every few minutes with R. 

We use Airflow, an open-source project created by Airbnb, as an orchestrator. Airflow is an incredible and robust project, which allows us to schedule processes, assign priorities, create rules, keep detailed log, etc.   

For development, we still use the form: Rscript my_awesome_script.R. Airflow is a Python-based task scheduler that allows us to run chained processes with many complex dependencies, monitoring the current state of all of them and firing alerts if anything goes wrong to Slack. This is ideal for running import jobs to populate a Data Warehouse with fresh data every day.   

Do We Have a Data Warehouse?

Yes, and it's huge! It's mounted on Amazon Redshift, a suitable option if scaling is a priority. Visit their website to learn more about it.   

amazon-redshift

Amazon Redshift

R connects directly to Amazon Redshift thanks to the rauth0 package, which uses the redshiftTools package, developed by Pablo Seibelt.   

Generally, data is uploaded from R to Amazon Redshift using redshiftTools. This data can be either plain files or from DataFrames created during the R session.   

We use Python to import and export unstructured data, since R does not useful libraries currently to handle that task.   

We have experimented with JSON libraries in R, but the result is much worse than using Python in this scenario. For example, using RJSONIO, a dataset is automatically transformed into an R DataFrame, with little control of how the transformation is done. This is only useful for very simple JSON data structures and is very difficult to manipulate in R, compared to Python.      

Data Preparation Using R

We have two scenarios, data preparation for data engineering and data preparation for machine learning/AI.   

One of the biggest strengths of R is the tidyverse package, which is a set of packages developed by lots of ninja developers, some of them working at RStudio Inc company. They provide a common API and a shared philosophy for working with data. We will cover an example in the next section.   

tidyverse

Dplyr and Tidyverse

Tidyverse, especially the dplyr package, contains a set of functions that make the exploratory data analysis and data preparation quite comfortable.

For certain tasks in crunching data prep and visualization, we use the funModeling package. It was the seed for an open-source book I published some time ago: Data Science Live Book.

It contains some good practices we follow related to deploying models on production, dealing with missing data, handling outliers, and more.   

Does R Scale?

One of the key advantages of dplyr is that it can be run on databases, thanks to another package with a pretty similar name: dbplyr.   

This way, we write R syntax ( dplyr), and it is "automagically" converted to SQL syntax that then runs in production. There are some cases in which these conversions from R to SQL are not made automatically. For such cases, we are still able to do a mix of SQL syntax in R.   

Take the following dplyr syntax as an example: 
R




x


1
flights %>%
2
group_by(month, day) %>%
3
summarise(delay = mean(dep_delay))


   

Generates:   

SQL




xxxxxxxxxx
1


 
1
SELECT month, day, AVG(dep_delay) AS delay
2
FROM nycflights13::flights
3
GROUP BY month, day


   

This way, dbplyr makes transparent for the R user working with objects in RAM or in a foreign database.

Not many people know, but many key pieces of R are written in C++ (concretely, the Rcpp package).   

How Do We Share the Results?

Mostly in Tableau. We have some integrations with Salesforce. In addition, we do have some reports deployed in Shiny, especially the ones that need complex customer interaction.

Shiny allows custom reports to be built using simple R code without having to learn Javascript, Python, or other frontend and backend languages. Through the use of a "reactive" interface, the user can input parameters that the Shiny application can use to react and redraw any report.

In contrast with tools like Tableau, Domo, PowerBI, etc. which are more "drag and drop", the programmatic nature of Shiny apps allows them to do almost anything the developer can conceive in their imagination, which might be more difficult or impossible in other tools.   

rmarkdown

For ad hoc reports (HTML), we use R markdown, which shares some functionality with Jupyter Notebook. It allows a script to be created with an analysis that ends in a dashboard, PDF report, web-based reports, and books!

Machine Learning/AI

We use both R and Python. For Machine Learning projects, we mainly use the caret package in R. It provides a high-level interface to many machine learning algorithms, as well as common tasks in data preparation, model evaluation, and hyper-tuning parameter.   

For Deep Learning, we use Python, specifically Keras with TensorFlow as the backend.
Keras is an API to build complex neural networks. These can easily scale by training them on the cloud, in services like AWS.   

Nowadays, we are also doing some experiments with the fastai library for NLP problems.   

Summing up!

The open-source languages are leading the data path. R and Python have strong communities, and there are free and top-notch resources to learn.   

Here, we wanted to share the not-so-common approach of using R for data engineering tasks and our favorite R and Python libraries, with a focus on sharing the results, explaining some of the practices we do every day.   

We think the most important stages in a data project are the data analysis and data preparation. Choosing the right approach can save a lot of time and make the project to scale.   

We hope this post encourages you to try some of the suggested technologies and rock your data projects! 

Found me on Linkedin and Twitter. Any Questions? Leave them in the comments.


Further Reading

Topics:
r ,python ,data science ,ai ,data engineer ,aws

Published at DZone with permission of Pablo Casas , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}