As a kid, my parents would sometimes let me stay up late to watch The Tonight Show With Johnny Carson. I loved his bit as Carnac the Magnificent, who could predict unknown answers to unknown questions. Before the age of the internet, Carnac, with his feathered ruby red turban, was America’s first big data scientist.
The spirit of predicting and uncovering the unknown is alive and well. In this post, based on nearly a decade of IT monitoring experience as well as user and partner feedback, I thought I’d share tips on how to accurately predict what your Hadoop analytics engine is up to — before you experience an outage.
1. Automate Your Metric Collection and Discovery
There's a lot going on in a Hadoop cluster, and attempting to access each node/component individually and manually to gather data doesn't scale. In my previous Hadoop post, I listed examples of different technology ecosystem metrics that are important to Hadoop performance. One way to bring all those metrics into your monitoring platform is through auto-discovery. During each data collection cycle, you could open an HTTP connection to the nodes and resource/node managers to retrieve resource metrics — for example, for HDFS and YARN. You can set policies for how deep the collection goes or the interval, say every five minutes, in between auto-discovery. You can collect performance data, relationships (associations), and events for the cluster, name and data nodes, resource manager, and node manager.
2. Centralize Monitoring With Dashboards for At-a-Glance Health Indicators
I’ve already covered why you should use a centralized monitoring platform. To recap: a comprehensive view of thousands of health, performance, and availability metrics for Hadoop clusters and nodes, as well as historical details for in-depth troubleshooting, helps you pinpoint and resolve issues faster. For the size and scale of the clusters, it still surprises me that teams don’t have enterprise-class monitoring for Hadoop. The monitoring dashboards truly bring your data layer into this decade.
3. Add Alerts on Critical Metrics and Recommendations to Ensure Compliance
Beyond insight into more metrics, dashboards, and reports, administrators should use intelligent alerts and recommendations. These, along with embedded expertise on how to resolve the issue, improve troubleshooting and optimize performance across the Hadoop environment. Detailed alerts and notifications enable administrators to pinpoint the source of problems before they cause a major impact to your analytics applications.
4. Share Reports Among Teams and Senior Management
It’s important to avoid blind spots. Reports for Hadoop performance should cover capacity, utilization, health, availability, and OS and Java configuration. If your monitoring platform can export and distribute reports out-of-the-box to key stakeholders like DevOps, DBAs, IT admins, and executives, even better!
5. Monitor Your Cloud and Virtualized Infrastructure
Often, an underlying problem in the virtual or cloud layer can become a bigger issue, causing performance bottlenecks — or worse, downtime. Make sure your Hadoop monitoring platform also collects data on your infrastructure and relates it to the relevant Hadoop components.
For any organization, Hadoop offers a scalable, flexible Big Data ecosystem. While the infrastructure offers appeal for data management, it can be challenging to monitor — making it difficult to see how your Hadoop clusters are performing across the ecosystem and, in turn, ensuring data integrity. Luckily, you don’t have to be Carnac to have visibility into key metrics and health monitoring for the likes of HDFS and YARN. Just make sure inside the envelope is the license key to a centralized monitoring platform.