Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Big Data Python: 3 Big Data Analytics Tools

DZone's Guide to

Big Data Python: 3 Big Data Analytics Tools

It's no secret that Python is frequently used in the world of Big Data. See how you can get started with three Python tools/libraries to tear through your own big data.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

In this post, we will discuss three awesome big data Python tools to increase your big data programming skills using production data.

Introduction

In this article, I assume that you are running Python in it's own environment using virtualenv, pyenv, or some other variant.

The examples in this article make use of IPython so make sure you have it installed to follow along if you like.

$ mkdir python-big-data
$ cd python-big-data
$ virtualenv ../venvs/python-big-data
$ source ../venvs/python-big-data/bin/activate
$ pip install ipython
$ pip install pandas
$ pip install pyspark
$ pip install scikit-learn
$ pip install scipy

Now let's get some data to play around with.

Python Data

As we go through this article, I will be using some sample data to go through the examples.

The Python Data that we will be using are actual production logs from this website over the course of a couple days time. This data isn't technically big data yet because it is only about 2 Mb in size, but it will work great for our purposes.

I have to beef up my infrastructure a bit in order to get big data sized samples ( > 1Tb ).

To get the sample data you can use git to pull it down from my public GitHub repo: admintome/access-log-data

$ git clone https://github.com/admintome/access-log-data.git

The data is a simple CSV file so each line represents an individual log and the fields separated by commas:

2018-08-01 17:10,'www2','www_access','172.68.133.49 - - [01/Aug/2018:17:10:15 +0000] "GET /wp-content/uploads/2018/07/spark-mesos-job-complete-1024x634.png HTTP/1.0" 200 151587 "https://dzone.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"'

Here is the schema for a log line:

Now that we have the data we are going to use, let's checkout three big data Python tools.

Because of the complexity of the many operations that can be performed on data, this article will focus on demonstrating how to load our data and get a small sample of the data.

For each tool listed, I will give links to find out more information.

Python Pandas

The first tool we will discuss is Python Pandas. As it's website states, Pandas is an open source Python Data Analysis Library.

Let's fire up IPython and do some operations on our sample data.

import pandas as pd

headers = ["datetime", "source", "type", "log"]
df = pd.read_csv('access_logs_parsed.csv', quotechar="'", names=headers)

After about a second it should respond back with:

[6844 rows x 4 columns]

In [3]: 

As you can see, we have about 7000 rows of data and we can see that it found four columns which matches our schema described above.

Pandas created a DataFrame object representing our CSV file automatically! Let's check out a sample of the data imported with the head() function.

In [11]: df.head()
Out[11]: 
           datetime source        type                                                log
0  2018-08-01 17:10   www2  www_access  172.68.133.49 - - [01/Aug/2018:17:10:15 +0000]...
1  2018-08-01 17:10   www2  www_access  162.158.255.185 - - [01/Aug/2018:17:10:15 +000...
2  2018-08-01 17:10   www2  www_access  108.162.238.234 - - [01/Aug/2018:17:10:22 +000...
3  2018-08-01 17:10   www2  www_access  172.68.47.211 - - [01/Aug/2018:17:10:50 +0000]...
4  2018-08-01 17:11   www2  www_access  141.101.96.28 - - [01/Aug/2018:17:11:11 +0000]...

There is a ton you can do with Python Pandas and Big Data. Python alone is great for munging your data and getting it prepared. Now with Pandas you can do data analytics in Python as well. Data scientists typically use Python Pandas together with IPython to interactively analyze huge data sets and gain meaningful business intelligence from that data. Checkout their website above for more information.

PySpark

The next tool we will talk about is PySpark. This is a library from the Apache Spark project for Big Data Analytics.

PySpark gives us a lot of functionality for Analyzing Big Data in Python. It comes with its own shell that you can run from the command line.

$ pyspark

This loads the pyspark shell.

(python-big-data)[email protected]:~/Development/access-log-data$ pyspark Python 3.6.5 (default, Apr 1 2018, 05:46:30) [GCC 7.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. 2018-08-03 18:13:38 WARN Utils:66 - Your hostname, admintome resolves to a loopback address: 127.0.1.1; using 192.168.1.153 instead (on interface enp0s3) 2018-08-03 18:13:38 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address 2018-08-03 18:13:39 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.3.1 /_/ Using Python version 3.6.5 (default, Apr 1 2018 05:46:30) SparkSession available as 'spark'. >>>

And when you start the shell, you also get a web GUI to see the status of your jobs; simply browse to http://localhost:4040 and you will get the PySpark Web GUI.

Let's use the PySpark Shell to load our sample data.

dataframe = spark.read.format("csv").option("header","false").option("mode","DROPMALFORMED").option("quote","'").load("access_logs.csv")
dataframe.show()

PySpark will give us a sample of the DataFrame that was created.

>>> dataframe2.show()
+----------------+----+----------+--------------------+
|             _c0| _c1|       _c2|                 _c3|
+----------------+----+----------+--------------------+
|2018-08-01 17:10|www2|www_access|172.68.133.49 - -...|
|2018-08-01 17:10|www2|www_access|162.158.255.185 -...|
|2018-08-01 17:10|www2|www_access|108.162.238.234 -...|
|2018-08-01 17:10|www2|www_access|172.68.47.211 - -...|
|2018-08-01 17:11|www2|www_access|141.101.96.28 - -...|
|2018-08-01 17:11|www2|www_access|141.101.96.28 - -...|
|2018-08-01 17:11|www2|www_access|162.158.50.89 - -...|
|2018-08-01 17:12|www2|www_access|192.168.1.7 - - [...|
|2018-08-01 17:12|www2|www_access|172.68.47.151 - -...|
|2018-08-01 17:12|www2|www_access|192.168.1.7 - - [...|
|2018-08-01 17:12|www2|www_access|141.101.76.83 - -...|
|2018-08-01 17:14|www2|www_access|172.68.218.41 - -...|
|2018-08-01 17:14|www2|www_access|172.68.218.47 - -...|
|2018-08-01 17:14|www2|www_access|172.69.70.72 - - ...|
|2018-08-01 17:15|www2|www_access|172.68.63.24 - - ...|
|2018-08-01 17:18|www2|www_access|192.168.1.7 - - [...|
|2018-08-01 17:18|www2|www_access|141.101.99.138 - ...|
|2018-08-01 17:19|www2|www_access|192.168.1.7 - - [...|
|2018-08-01 17:19|www2|www_access|162.158.89.74 - -...|
|2018-08-01 17:19|www2|www_access|172.68.54.35 - - ...|
+----------------+----+----------+--------------------+
only showing top 20 rows

Again we can see that there are four columns in our DataFrame which matches our schema. A DataFrame is simply an in-memory representation of the data and can be thought of as a database table or Excel spreadsheet.

Now on to our last tool.

Python SciKit-Learn

Any discussion on big data will invariably lead to a discussion about Machine Learning. And, luckily for us, Python developers have plenty of options to make use of Machine Learning algorithms.

Without going into too much detail on Machine Learning, we need to get some data on which to perform machine learning. The sample data I have provided in this article doesn't work well as-is because it is not numerical data. We would need to manipulate the data and present it into a numerical format which is beyond the scope of this article. For example, we could map the log entries by time to get a DataFrame with two columns: number of logs in a minute and the current minute:

+------------------+---+
| 2018-08-01 17:10 | 4 |
+------------------+---+
| 2018-08-01 17:11 | 1 |
+------------------+---+

With our data in this form we can perform Machine Learning to predict the number of visitors we are likely to get in a future time. But, like I mentioned, that is outside of the scope of this article.

Luckily for us, SciKit-Learn comes with some sample data sets! Let's load some sample data and see what we can do.

In [1]: from sklearn import datasets

In [2]: iris = datasets.load_iris()

In [3]: digits = datasets.load_digits()

In [4]: print(digits.data)
[[ 0.  0.  5. ...  0.  0.  0.]
 [ 0.  0.  0. ... 10.  0.  0.]
 [ 0.  0.  0. ... 16.  9.  0.]
 ...
 [ 0.  0.  1. ...  6.  0.  0.]
 [ 0.  0.  2. ... 12.  0.  0.]
 [ 0.  0. 10. ... 12.  1.  0.]]

This loads two datasets that are used for classification machine learning algorithms for classifying your data.

Checkout the SciKit-Learn Basic Tutorial for information.

Conclusion

Given these three Python Big Data tools, Python is a major player in the Big Data game along with R and Scala.

I hope that you have enjoyed this article. If you have then please share it. Also please comment below.

If you are new to Big Data and would like to know more then be sure to register for my free Introduction to Big Data course at AdminTome Online-Training.

Also be sure to see other great Big Data articles on AdminTome Blog.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
python ,big data ,pyspark ,scikit-learn ,pandas ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}