Cloud computing is one of the big three trends impacting IT architectures today. What some may not realize is that an underlying connected data architecture is not only essential for cloud, but sits at the confluence of all three trends. Here’s why.
The first big trend is IoT. According to BI Intelligence, we can now expect 24B connected devices by 2020 and almost $6 trillion spent on IoT solutions in the next five years. This dynamic is fueling a new wave of predictive retail, factory automation, and connected everything applications for cars, tractors, cities, patients, energy meters, oil wells, and more. Cloud is a natural place to process, analyze, and store this IoT data since much of it is created outside of the data center.
The second big trend is cloud computing itself. Cloud computing enables new levels of business agility for IT, developers, and data scientists while providing a pay-as-you-go model of unlimited scale and no upfront hardware costs to get started. Cloud services running on Microsoft Azure or Amazon Web Services are especially good for ephemeral use cases where you want to spin up some analytic jobs, get the results, and then shut things down so you can manage your costs. Long running, always-on use cases for streaming analytics, online applications, and interactive reporting and analytics enrich this ephemeral world and typically extend into a “hybrid” architecture that spans the data center and cloud. Hybrid cloud strategies are common. According to The Forrester Wave: Big Data Hadoop Cloud Solutions, Q2 2016, “37% told us they plan to increase their investment in big data cloud by 5% to 10%, while another 14% plan to increase it by more than 10%.”
The third big trend is Big Data and analytics which intertwines with these two. We are seeing the rise of modern data applications that run not only in the data center, but also leverage public cloud for data prep and transformation, data science, exploratory analytics, machine learning, and more. That represents petabytes of data going through data centers and across public cloud services like Microsoft Azure and Amazon Web Services every day.
Many customers I talk to are defining digital transformation initiatives that sit at the convergence of these three big trends, with data residing at the very heart of it all. By this I mean, the ability to use all of the data in the enterprise, cloud, and IoT ecosystem, whether historical data sitting in lakes or databases, or live data streaming in the here-and-now (or anywhere in between). The goal is to get continuous insights that not only deliver but automate data-informed business decisions.
Digital transformation is not hyperbole. Connected tractors are transforming heavy equipment manufacturers into farming analytics providers. Smart meters are enabling stodgy utility companies to transform how they manage energy and deliver value that’s customized for each household.
It’s not Just Hybrid, It’s Connected!
While the new wave of applications are inherently hybrid (that is, they span the data center and cloud), the thing that matters most is that the data architecture be connected. Thinking in terms of a connected architecture is essential because this implies that your systems are inherently able to, well, connect the legacy and IoT worlds, connect the data in the data center with data in the cloud and data at the outermost edge (think remote oil rigs). It also implies connecting to open source innovation from projects like Apache Hadoop, Spark, NiFi, Kafka, and Storm that are all part of The Apache Software Foundation!
That’s not to say that connected data architectures ignore the architectures of the past, such as mainframes and relational databases, or Web and SaaS. But the fact is that connected data is now at the center of transformation initiatives and the new world of user-centric applications.
What do I mean by ‘a new world?’ The best way to explain this is via connected cars, but it applies in all industries.
In order to drive an autonomous connected car, many types of apps need to work together to benefit of the car and driver based on a common data logic. Besides the obvious ones .. driving and navigation, there may be predictive maintenance apps, design innovation, warranty, pricing, insurance premium, weather, sensor management, scheduled and predictive maintenance, connected city, recall, infotainment, and route optimization apps, and so on and so on. In fact, there may be hundreds of these new data-driven apps. Delivered by different companies in industries ranging from insurance, to government agencies, infotainment, software and the auto industry … all of which are part of the value chain of the car.
As a result, the concept of the modern app changes. In the connected model, there is no single app that is the source of truth. Software needs to act on data where it is born, where it flows, and where it ultimately comes to rest, in a way that delivers value continuously. The car is always moving. The ‘app’ might be a stream processing application underpinned by Apache Storm and Kafka. It may be a series of code snippets running in Apache NiFi and MiNiFi (the edge agent for NiFi) that acts on the data from the point of inception, analyzes it, and sets its priority from a data logistics perspective so it’s delivered to downstream apps that want or need to derive further value. The purpose of a connected data architecture is, therefore, to provide the data logic and enable these modern apps to work in a way that they can deliver continuous insights, in a way that is secure, reliable, manageable, flexible and so on.
So, it’s not just about a single app sitting in a data center or the cloud. Sure, some of the data might be coming in from the cloud. Some from the cars. Some, such as manufacturing line data, might flow directly into your data center tracking operational issues or defect/warranty use cases. These apps are data driven, components of a larger picture, connected, and reside at the places where it’s optimal for analysis. At the point of streaming, in the cloud, in the data center, at the edge, and in the machine itself.
The point is you are going to have ‘apps’ scattered everywhere. Edge analytics or stream analytics apps, historical analysis apps, or machine learning apps. You want them to act on, analyze, and drive value from the data in the most opportune place. The connected data architecture manages the data flow and logistics, the metadata, and the corresponding security and governance policies. That’s the connective tissue.
So digital transformation is fueled by connected data logic that is able to deliver those continuous insights — the car experience — via a composition of apps that any participant in the overall value chain can tap into.
So, if you are out there considering this, what are the attributes of a connected architecture?
First, it’s an architecture that can acquire and deliver data anywhere. This is not just about moving data from point a to point b, it’s also about actively managing the data logistics. Think FedEx for data delivery: you not only receive the product you purchased, but you’re also able to track every step of the package delivery process. It’s about getting data with provenance to where it needs to be, when it needs to be there. This kind of continuous data delivery is vital for driving continuous insights across a range of apps — from stream processing at the edge, to data prep, query, and analysis in the cloud, to historical analysis in the data lake or ERP system in the data center.
Second, you need things like dynamic governance and security policies and having business metadata always available in a way that works with always-on applications as well as ephemeral apps in the cloud that come and go. It’s about being able to tag differing types and structures of data sets and treating them in a consistent way. For example, tagging files, table, columns, and cells across various systems with a PII (personally identifiable information) tag and setting up security policies based on that tab so you can manage who is allowed to access such data. This ability to enable a dynamic automated approach to enforcing policies is important in the world of connected data.
Third, it has to be fueled by the ecosystem of open innovation. By this, I mean the dozens of open projects now in Apache. If you look at the Apache Software Foundation, it continues to be an impressive hub for attracting a wide range of innovative data processing related open source projects. Starting with Apache Hadoop over 10 years ago, to Apache Spark, Hive, HBase, Spark, Storm, Kafka, NiFi, Zeppelin, and dozens more.
Fourth, it needs to be across any delivery model – data center, cloud, or for many enterprises, a hybrid of both. In a way that’s open. With all brands of public cloud and within a hybrid cloud strategy.
Fifth, to be truly transformational, it must enable hundreds or thousands of use cases (not just tens). Ephemeral use cases such as data prep, query and analysis, in-memory analysis, advanced statistics modeling and machine learning, NoSQL data storage or real-time event processing. And it needs to do so while providing the always-on security and governance services needed for today’s enterprise.
Above all, it needs to be easy to deploy and deliver value. You should be able to get started quickly and graduate into increasingly more sophisticated in a crawl, walk, run manner.
Enterprises see the IoT and a data tsunami coming at them and many realize they need to evolve their skills. At the same time, the cloud business case for many is proven. LOB is pushing IT. So they want cloud in all its forms. They know they need to get started with big data within a cloud world. They realize that the combination of cloud, IoT, and Big Data analytics can be transformational.