Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

HDF 3.0 for Utilities (Part 1)

DZone's Guide to

HDF 3.0 for Utilities (Part 1)

Learn from what I learned working at a real-time utilities monitoring startup to see how utilities can benefit from new streaming technologies.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Utilities companies have unique needs based on their infrastructure, equipment, and large SAP installations.  This is common across electric, gas, water, solar, and stream. When I worked at a real-time utilities monitoring startup, we saw the difficulties first-hand in installing devices at various types of plants, field locations, offices, warehouses, and other exotic locations.

Often, the most valuable data is locked away in specialized SCADA systems and not available to central IT to run machine learning, predictive analytics, or even basic reporting. Often, a manual export is the only source of occasional disconnected summary data.

Today's enterprises, including utilities, need real-time access to all streams of data. No one wants to wait until Amazon and/or Google become utilities. It's time to get your data, analyze it mid-stream, and land it for machine learning and deep learning.

One use case is ingesting drone data, as it is analyzing various hard infrastructure. This data consists of images, GPS, and metadata encoded inside the images and often additional sensors for LIDAR, temperature, air pressure, and humidity. For a detailed solution, see my talk at a recent Oracle conference and this meetup presentation. For one part of the flow using Apache NiFi, I analyze the image with TensorFlow image recognition. This could be expanded to use im2txt from TensorFlow, which will produce a nice paragraph of the image. This is good for reporting and for anomaly detection.  

Image title

Another use case where the data can be integrated and correlated via location is Twitter and social media data. This data is often just used for sentiment analysis. For utilities, it can also be used for real-time alerting and replying to reported outages. Utilizing machine learning, you can determine how confident you should be in the tweet's validity, as there is a lot of noise, liars, fakers, bots, and garbage in social media streams.

One method I recommend is using supervised machine learning to create white, gray, and black lists of tweeters. The black list would be blocked and any tweets from them ignored, filtered out, or blocked entirely. The white list is high-priority tweeters such as first responders, government agencies, professional news media, and other reputable sources. Gray is for normal people who fit the profile, location, and characteristics of a legitimate reporter of an issue. You can also match social media user IDs and information with internal customer data to see if your customers are reporting an issue for themselves and send this right to customer service. I would recommend all utilities add social media account information to their billing/status portal profiles for all customers, including corporate and home.

A third use case is ingesting sensor data directly from edge devices utilizing Apache MiniFi.

In HDF 3, there are several features that enable all of these use cases and more for utilities and all enterprises.   

  • HDF 3 Apache NiFi 1.2 supports running queries on live data streams for easy filtering.

  • HDF 3 Streaming Analytics Manager supports real-time streaming with live queries and stream joins for complex event processing.

  • HDF 3 Schema Registry allows for easy conversion between types and record manipulations of thousands of different types of data with different evolving schemas without the long cycle of code and deploy.

Image title

By utilizing these techniques, you can have a flexible, agile, real-time streaming Big Data solution that does not require laborious, error-prone, hand-coding, and manual deploy cycles that lead to delays and issues. You can now visually develop streaming microservices utilizing this next-generation streaming platform. Utilities will be leaping past the manual coding of big data in MapReduce, Spark, and other second-generation tools.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,hdf 3.0 ,utilities ,data streaming ,real-time data

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}