Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Big Data Performance: Part I

DZone's Guide to

Big Data Performance: Part I

Maximize your cluster. Choose a solid current distribution of your software that all works together. Follow documented best practices.

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

Maximize your cluster. Choose the right hardware or vendor instance — having more RAM and more CPUs really matter.  If you are going to be running TensorFlow, Deep Learning 4 J, MXNet, PyTorch or other Deep Learning packages then GPUs matter a lot.

Choose a solid current distribution of your software that all works together. Hortonworks HDP 2.5 includes the latest tested and stable versions of Hadoop, YARN, Spark, Hive, HBase, and other must-have technologies. You need this solid base.

Follow Apache, Hortonworks, Microsoft, HPE, Databricks, IBM, and other best practices that are documented.

Tweet your application cluster frameworks like Spark. This talk is excellent for boosting Spark performance.

Make sure your applications are well-written, unit-tested, integration-tested (Hadoop mini-clusters are nice here), tuned, and documented. Often, applications that are not peer reviewed or properly tested will have serious flaws in running in parallel. It's easy to forget things in Spark that could yield variables and other items not be run on the cluster.

Make sure you deploy your apps to the YARN cluster and not just one one node or on your local machine. You need the power of the cluster to maximized performance. Check the running instances and Spark history to make sure you used available JVM RAM and instances. It's possible you have 2,000 cores available and run one on four. If you have it available in your YARN queue, use it. Administrators need to make sure that you allocate resources as appropriate so that mission-critical jobs get the resources they need when they are running as scheduled. However, often, the cluster is underutilized. Check your Grafana, Spark history, YARN UI, logs, and Ambari to see when your cluster is idle. Run then!

Look at other companies and groups with similar use cases and read what they did. Very often, best practices that are vertically specific are shared in the community. Look at this excellent presentation on Spark for Energy.

Reduce, reuse, recycle. Most problems are not unique. Check DZone, GitHub, Hortonworks Community, and other community resources to solve your problem. For example, there's a ton of great functions for you to use in Hive, Pig, and Spark in Apache HiveMall.

To install HiveMall, check out this quick primer. The official Apache releases are not out yet, but you can download this stable release. Install this in your Hadoop cluster.

cd /opt/demo
wget https://github.com/myui/hivemall/releases/download/v0.4.2-rc.2/define-all-as-permanent.hive
wget https://github.com/myui/hivemall/releases/download/v0.4.2-rc.2/hivemall-core-0.4.2-rc.2-with-dependencies.jar
chmod 777 *

hadoop fs -mkdir -p /apps/hivemall
hadoop fs -put hivemall-with-dependencies.jar /apps/hivemall

hive cli

set hivevar:hivemall_jar=hdfs:///apps/hivemall/hivemall-core-0.4.2-rc.2-with-dependencies.jar;

source /opt/demo/define-all-as-permanent.hive;

CREATE DATABASE IF NOT EXISTS hivemall;
USE hivemall;

See here about how to install in your Hive space.  

References

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
big data ,hive ,spark ,clusters

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}