Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Using Python for Big Data Workloads (Part 2)

DZone's Guide to

Using Python for Big Data Workloads (Part 2)

Check out a continuation of the series on how, where, and why to use Python for Big Data workloads for Machine Learning, Deep Learning, and Big Data.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Why should you use Python for Big Data workloads? We have discussed a few reasons, but here, we'll talk about more. Some surveys on the internet are showing that Python is gaining near 50% penetration in the Machine Learning language of choice.

  1. Deep Learning: TensorFlow, Keras, and PyTorch.

  2. OpenCV Python bindings.

  3. NLTK.

  4. PySpark.

  5. Apache Arrow, Parquet, and other project support.

  6. Apache Beam 2.0 support for Python.

  7. Speech recognition.

  8. API support.

  9. Sci-Kit Learn and other cool Machine Learning libraries.

  10. Utilities abound — a PiP away.

Here's a cool GitHub example of using Keras with OpenCV and Python for face detection.

Step one is to install OpenCV with Python.

There's some big news from Google about a new release of Apache Beam 2.0, and Python is now supported. You can now do streaming with Flink, Spark, and more using Python:

pip install apache-beam 

After a simple PiP install, you can run Beam jobs:

python -m apache_beam.examples.wordcount --input MANIFEST.in --output counts 

Check out some details on speech recognition with Python, Python support for upcoming Apache Arrow and Parquet, and some cool Spark SQL code and UDF with Python.

OpenCV has a great Python library and tons of fun examples that work with robots, cars, and drones.

Cool Python image utilities are very abundant for all types of graphic and image manipulation

APIs are everywhere for Python! Here's an example on Spotify. There's also libraries for Facebook, Twitter, Instagram, Google services, Amazon services, Microsoft services, and tons of other feeds and services.

TensorFlow, NLTK, and Stanford CoreNLP have a Python wrapper, and TextBlob has so many cool libraries, utilities, and helpers. They're all easy to install, and most are well-documented.

Python also has SciKit-Learn, which is great and has a ton of great Machine Learning goodies.

Python runs everywhere — Windows, OSX, Linux, and lots of devices. Here's an example on an ASUS Tinkerboard.

There's a lot of Python libraries that run everywhere. Here are a few I recommend installing on every platform:

  • Numpy.

  • SciPy.

  • NLTK.

  • Wheel.

  • Pandas.

  • MatPlotLib.

  • PyTorch.

  • TensorFlow.

  • TextBlob.

  • spACy.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
python ,big data ,machine learning ,deep learning

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}