Welcome back to my blogging adventure. When we left off in my Cybersecurity Architecture series, Cybersecurity Architecture: All about Sensors, we started to touch on some of the requirements our sensors require from the modern data plane. Today, we are going to dive deep into those requirements and walk through how we can leverage Hortonworks DataFlow to help address our modern data plane needs.
Before we dive in, let's review the high-level conceptual architecture and see how this modern data plane interacts with the rest of the design. Looking at the below diagram, today's focus is on the red arrow labeled Data Plane.
Modern Data Plane
The great thing about a conceptual architecture design is that we can clearly see the separation of concern and the interfaces between the conceptual components. Looking at the red arrow, we see it has five interface points:
- Data Lake
- Automated Response.
Great! Let's dive in and see how these components interact and see if we can continue to detail out our requirements for each component.
As we discussed in my last article, All About Sensors, our sensor network is deployed in a distributed manner as close as possible to the applications, data, and systems as possible. From the perspective of the data plane, we want the sensors to connect safely and securely so as to maintain the chain of custody for this activity data. Unlike designs on the drawing board, in real life things change. We need the ability for the data plane to reach back and send messages, such as reconfiguration requests, back to the sensor to adjust to these changes. Connectivity between these sensors and the data plane may be intermittent or limited in bandwidth, so a queuing and data priority forwarding strategy embedded into the sensor is required.
- Bidirectional: The interface between the sensor and the data plane needs to be bidirectional. The data plane receives data from the sensor and the data plane sends data packets back that allows for dynamic configuration of the sensor based on changing needs.
- Authentication: Both the data plane and sensor must be able to authenticate the other. The data plane needs to authenticate the sensor to identify the source of data and maintain chain of custody of the data being passed in. The sensor needs to authenticate the data plane to prevent unauthorized disclosure of the data it is feeding into the data plane and determine if it should listen for incoming configuration messages. Once the course level sensor to data plane authentication is established, fine-level authentication at the message level needs to be supported as necessary as detailed in authorization below.
- Authorization: Both the data plane and sensor require fine-level authorization. Using the authenticated identities, each message type needs the ability to be authorized. This allows the data plane to determine the type of data it will accept from an individual sensor and the sensor to determine if the identity is authorized to push a configuration message.
- Confidentiality: To enable the creation of a secure data channel across potentially untrusted communication links, both the data plane and sensor must ensure the confidentiality of the data while in transit and at rest. This includes network communication, file storage, and in memory confidentiality as the sensor and data plane needs to be enabled to run in a wide range of environments such as public/private cloud, virtualization, and containerization in multi-tenant environments.
- Integrity: To support chain of custody, the handoff of the data between the data plane and sensor must not only maintain the integrity of the data — positive audited validation that the data sent wasn't modified in transit or storage — it needs to pass the audit trail from when the data was first created to when it was passed on to the data plane.
- Availability:The interface between the sensor and the data plane must assume that the connection is untrusted, unreliable, and limited in bandwidth. This interface must ensure the data is available from creation to handoff through some form of store and forward and/or intelligent routing.
The data lake is already a well-documented concept with mature architectures available. Let's focus on the interface between the data plane and the data lake, as this creates a significant departure from the existing data lake architectures.
The existing platform, based Hadoop architectures, makes several implicit assumptions on how users interact with the platform, such as developmental research versus production applications. While this was perfectly good in a research mode, as we move to a modern data application architecture, we need to bring back modern application concepts to the Hadoop ecosystem.
For example, existing Hadoop architectures tightly couple the user interface with the source of data. This is done for good reasons that apply in a data discovery research context, but cause significant issues in developing and maintaining a production application. We see this in some of the popular user interfaces such as Kibana, Banana, Grafana, etc. Each user interface is directly tied to a specific type of data lake and imposes schema choices on that data.
The reason modern application architectures evolved to use the basic View -> Controller -> Model (MVC) is to address this issue of tight coupling and maintain separation of concern. In a scalable application, the user interface has no concept of a data source — it requests and responds with data to a service and doesn't know or care where that data came from. The data could be delivered from storage, compiled from multiple data sources, pulled from a live stream, or even computed on demand — the user interface doesn't and shouldn't care. The great thing of leveraging Hortonworks DataFlow to enable a modern data plane is that we can take a hybrid approach that provides the stability and scalability of the MVC approach and the immediacy of data access design of the Hadoop architecture.
- Authentication and Authorization: To leverage the data lake as a forensic repository, all access to the data lake must be authenticated.
- Confidentiality: Because raw data may be captured, sensitive information, such as Personally Identifiable Information (PII), Personal Health information (PHI), financial information, or account details such as credential secrets may be collected. This information must be protected from unauthorized disclosure.
- Integrity: As a forensic repository, the data lake must maintain the integrity of the data to a full chain of custody and non-repudiation levels.
- Availability: As data is centralized from point security solutions into a consolidated data lake as the authoritative repository of information, the data lake becomes a business critical business application data store with the requisite high availability and recovery requirements.
The interface between the data plane and the analytics engine is complex with many different bidirectional data flows. These data flows change frequently, as the system must adapt to the enterprise's environment, new analytic use cases, new workflow, and response rules. This pushed for a loosely coupled interface that can be changed quickly without code development. While the individual data flow interfaces between the analytic engine and the data plane will be measured in the hundreds in a mature cybersecurity analytic platform; the flows can be categorized into three main types:
- Sensor data ingestion. As new data flows into the system, data that is relevant to the specific streaming analytic use cases flow into the analytic ingestion topology for processing and streaming time decisioning.
- Model training. Data flows from the data lake back through the data plane and into the analytic engine versus the analytic engine directly accessing the data lake. This allows for production-like testing and training of the models as the historical data arrives to the analytic engine in the same manner as new streaming data.
- Scoring model decision access. With the goal of wide integration into the enterprise IT systems and applications, scoring decisioning models can be exposed to IT systems as enterprise web services such as adaptive authentication scoring engines.
The interface between the data plane and analytic engine shares the same requirements expressed above in the sensor and data lake sections and adds the below:
The response rules engine is the middleware between the analytics engine, automated response, and workflow component. The data plane maintains the principles of loose coupling and segmentation of concern between these components. While the analytic engine's scoring models implement the predictive analytic result, it is the combination of workflow and rules engine that determine the prescriptive response. Let's walk through the lifecycle of an automated response use case to see how these four components interact.
- The analytic engine processes the streaming data through its analytic models and provides scoring and enriched data streams available.
- This scoring and enriched data flows through the data plane to both the workflow and dashboard components.
- The dashboard and workflow components act as the controller in the MVC modern application architecture model by formatting the data for visual interface consumption and receiving and responding to user interface feedback from the user.
- The user interface acts as the security gateway and abstraction layer between the abstract user interface and the display medium, abstracting mobile application, desktop application, and web client application differences from the cybersecurity application logic. It displays content to the system users and responds to their feedback. The user interface allows the user to configure response workflow including automated response rules.
- The configured automated response rules are sent to the data plane and rules engine for implementation.
- The data plane is configured to take the configured alert feed configured by the workflow and send trigger events to the rules engine.
- The rules engine received configuration details from the workflow and events from the analytics engine via the data plane and triggers response decisions.
- These decisions are sent to the data plane for delivery to the actual security components for triggered response. E.g. firewall API integration, case management systems, etc.
Like our sensor network, automated response represents all the different security controls and application integration points available. They are deployed in a distributed manner as close as possible to the applications, data, and systems as possible. From the perspective of the data plane, we want the automated response to connect safely and securely so as to maintain the chain of custody for this activity data. Bidirectional communication is necessary to validate the response request was received and new feed of follow-on activity as the new automated response triggers.
Okay, after that long-winded walk through the middle of our conceptual architecture for cybersecurity as we look at what makes up a modern data plane; it should be clear that the modern data plane is the core critical element that glues our architecture together. It provides the separation of concerns and loosely coupled architectural principles that help with maintainability and scaling of the platform, it insulates the platform from the constant changing data and mess that is the outside world, and acts as a security barrier between the platform and the user interfaces that consume that data. Choosing the right technologies to enable the modern data plane is critical, and Hortonworks DataFlow is well-positioned to meet these requirements. For the curious and diligent readers who caught that there are two additional interfaces, workflow and dashboards, we will be covering them in a future article as part of user interfaces.