Big Data Performance: Part I

DZone 's Guide to

Big Data Performance: Part I

Maximize your cluster. Choose a solid current distribution of your software that all works together. Follow documented best practices.

· Big Data Zone ·
Free Resource

Maximize your cluster. Choose the right hardware or vendor instance — having more RAM and more CPUs really matter.  If you are going to be running TensorFlow, Deep Learning 4 J, MXNet, PyTorch or other Deep Learning packages then GPUs matter a lot.

Choose a solid current distribution of your software that all works together. Hortonworks HDP 2.5 includes the latest tested and stable versions of Hadoop, YARN, Spark, Hive, HBase, and other must-have technologies. You need this solid base.

Follow Apache, Hortonworks, Microsoft, HPE, Databricks, IBM, and other best practices that are documented.

Tweet your application cluster frameworks like Spark. This talk is excellent for boosting Spark performance.

Make sure your applications are well-written, unit-tested, integration-tested (Hadoop mini-clusters are nice here), tuned, and documented. Often, applications that are not peer reviewed or properly tested will have serious flaws in running in parallel. It's easy to forget things in Spark that could yield variables and other items not be run on the cluster.

Make sure you deploy your apps to the YARN cluster and not just one one node or on your local machine. You need the power of the cluster to maximized performance. Check the running instances and Spark history to make sure you used available JVM RAM and instances. It's possible you have 2,000 cores available and run one on four. If you have it available in your YARN queue, use it. Administrators need to make sure that you allocate resources as appropriate so that mission-critical jobs get the resources they need when they are running as scheduled. However, often, the cluster is underutilized. Check your Grafana, Spark history, YARN UI, logs, and Ambari to see when your cluster is idle. Run then!

Look at other companies and groups with similar use cases and read what they did. Very often, best practices that are vertically specific are shared in the community. Look at this excellent presentation on Spark for Energy.

Reduce, reuse, recycle. Most problems are not unique. Check DZone, GitHub, Hortonworks Community, and other community resources to solve your problem. For example, there's a ton of great functions for you to use in Hive, Pig, and Spark in Apache HiveMall.

To install HiveMall, check out this quick primer. The official Apache releases are not out yet, but you can download this stable release. Install this in your Hadoop cluster.

cd /opt/demo
wget https://github.com/myui/hivemall/releases/download/v0.4.2-rc.2/define-all-as-permanent.hive
wget https://github.com/myui/hivemall/releases/download/v0.4.2-rc.2/hivemall-core-0.4.2-rc.2-with-dependencies.jar
chmod 777 *

hadoop fs -mkdir -p /apps/hivemall
hadoop fs -put hivemall-with-dependencies.jar /apps/hivemall

hive cli

set hivevar:hivemall_jar=hdfs:///apps/hivemall/hivemall-core-0.4.2-rc.2-with-dependencies.jar;

source /opt/demo/define-all-as-permanent.hive;

USE hivemall;

See here about how to install in your Hive space.  


big data ,clusters ,hive ,spark

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}