Big Data Performance: Part I
Big Data Performance: Part I
Maximize your cluster. Choose a solid current distribution of your software that all works together. Follow documented best practices.
Join the DZone community and get the full member experience.Join For Free
The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.
Maximize your cluster. Choose the right hardware or vendor instance — having more RAM and more CPUs really matter. If you are going to be running TensorFlow, Deep Learning 4 J, MXNet, PyTorch or other Deep Learning packages then GPUs matter a lot.
Choose a solid current distribution of your software that all works together. Hortonworks HDP 2.5 includes the latest tested and stable versions of Hadoop, YARN, Spark, Hive, HBase, and other must-have technologies. You need this solid base.
Follow Apache, Hortonworks, Microsoft, HPE, Databricks, IBM, and other best practices that are documented.
Tweet your application cluster frameworks like Spark. This talk is excellent for boosting Spark performance.
Make sure your applications are well-written, unit-tested, integration-tested (Hadoop mini-clusters are nice here), tuned, and documented. Often, applications that are not peer reviewed or properly tested will have serious flaws in running in parallel. It's easy to forget things in Spark that could yield variables and other items not be run on the cluster.
Make sure you deploy your apps to the YARN cluster and not just one one node or on your local machine. You need the power of the cluster to maximized performance. Check the running instances and Spark history to make sure you used available JVM RAM and instances. It's possible you have 2,000 cores available and run one on four. If you have it available in your YARN queue, use it. Administrators need to make sure that you allocate resources as appropriate so that mission-critical jobs get the resources they need when they are running as scheduled. However, often, the cluster is underutilized. Check your Grafana, Spark history, YARN UI, logs, and Ambari to see when your cluster is idle. Run then!
Look at other companies and groups with similar use cases and read what they did. Very often, best practices that are vertically specific are shared in the community. Look at this excellent presentation on Spark for Energy.
Reduce, reuse, recycle. Most problems are not unique. Check DZone, GitHub, Hortonworks Community, and other community resources to solve your problem. For example, there's a ton of great functions for you to use in Hive, Pig, and Spark in Apache HiveMall.
cd /opt/demo wget https://github.com/myui/hivemall/releases/download/v0.4.2-rc.2/define-all-as-permanent.hive wget https://github.com/myui/hivemall/releases/download/v0.4.2-rc.2/hivemall-core-0.4.2-rc.2-with-dependencies.jar chmod 777 * hadoop fs -mkdir -p /apps/hivemall hadoop fs -put hivemall-with-dependencies.jar /apps/hivemall hive cli set hivevar:hivemall_jar=hdfs:///apps/hivemall/hivemall-core-0.4.2-rc.2-with-dependencies.jar; source /opt/demo/define-all-as-permanent.hive; CREATE DATABASE IF NOT EXISTS hivemall; USE hivemall;
Opinions expressed by DZone contributors are their own.