Reactive Data Platforms, Real-Time Analytics, and Scale
Reactive Data Platforms, Real-Time Analytics, and Scale
Streaming capabilities are getting faster to meet the demands of users for real-time analytics. See what role Reactive programming plays in this.
Join the DZone community and get the full member experience.Join For Free
Build vs Buy a Data Quality Solution: Which is Best for You? Gain insights on a hybrid approach. Download white paper now!
Thanks to Alex Silva, Principal Data Architect at Pluralsight, for sharing his insights on the current and future state of the company’s data platform. Alex will be speaking during the Reactive Summit in Austin, Texas on Thursday, October 19, on “Designing A Reactive Real-Time Data Platform: Architecture and Infrastructure Challenges.” Click here to register for the summit.
Q: How is your company using a data platform to provide solutions to clients?
A: Pluralsight provides unlimited online developer, creative, and IT courses. We use our data platform to collect product usage data for enterprise customers to use in their analytics portal and to drive product features that rely heavily on data, such as personalization.
Q: What do you see as the most important elements of the Data Platform?
A: Most of us have heard about big data before. To me, big data revolves around storing and analyzing massive amounts (hundreds of terabytes or even petabytes) of data at rest. Think something like files in HDFS and Hadoop.
Fast data, on the other hand, adds components of velocity and motion to the mix. As such, fast data has very different characteristics, requiring a different set of technologies to harness its value. This is where our data platform comes in: we enable users to analyze, decide, and act on data as fast as it arrives.
The ability to apply the four tenets of The Reactive Manifesto across our infrastructure has been pivotal in achieving the right level of responsiveness, asynchronicity, and elasticity across our entire data architecture.
Q: Which programming languages, frameworks, and tools do you, or your company use, to develop the data platform?
A: Currently we rely heavily on Scala, Akka, Apache Kafka, and Apache Spark, including several of the submodules offered by these libraries. For instance, we use Akka HTTP to provide a set of endpoints to abstract interaction with the underlying data stores in the platform. On the streaming side, we leverage frameworks such as Apache Spark, Apache Kafka, and Akka streams. We take into consideration the characteristics of each data replication request to determine which streaming framework is better suited to fulfill that use case.
Q: How has the development and deployment on the platform changed recently?
A: We leverage several open source libraries and frameworks and have recently decided to open source some of our internal frameworks as well. Hydra is the primary example of this initiative for us. We developed Hydra to be a real-time streaming and data replication platform that "unbundles" the receiving, transforming, and transporting of data streams.
Internally, we use Hydra to handle the ingestion of several thousands of events across a variety of formats and use cases, with producers ranging from bounded contexts, database tables and message exchanges to product data.
We plan on open sourcing the streaming side of the platform, a project named Hydra Streams, in the next few months as well.
Q: What kind of security techniques and tools do you find most effective for the data platform?
A: Since the vast majority of interaction between users and the platform is mediated through APIs, we can control authentication and authorization using well-known approaches like LDAP, OAuth, and API tokens to verify credentials and grant permissions. Kafka also supports authentication between brokers from clients (producers and consumers) using SSL or Kerberos.
Q: What are some real-world problems being solved by the data platform?
A: Some examples include:
- Data replication across bounded contexts.
- Ingestion of real-time and batch events.
- Real-time streaming analytics at large scale.
- Building reusable streaming data pipelines for data engineering and analytics teams.
- Prediction and recommendation algorithms.
Q: What are the most common issues you see affecting the development and deployment on the platform?
A: Some issues include:
Making the transition from batch data processing pipelines to continuously running, streaming models brings a new set of challenges around previously established data paradigms that can be difficult to overcome.
Currently, we provide both APIs and DSLs to interface with the platform. Finding the right balance between ease of use and empowering users has been a unique challenge for us.
As far as development, we use frameworks and tools that are constantly upgraded, usually at very aggressive paces. This requires the team to constantly be in “learning mode”, coupled with a product development lifecycle that can react quickly to changes in the fast data ecosystem.
Q: Do you have any concerns regarding the current state of the platform?
A: No. However, since the data universe is very rich and complex, it is easy to get caught up in implementation details. As a team, we have to constantly remind ourselves that developing an infrastructure around streaming data in near real-time requires some significant tradeoffs, and sometimes trying to find the most optimal or perfect solution for a problem can lead to analysis paralysis.
Q: What’s the future for the platform — where do the greatest opportunities lie?
A: These are a few of the areas we will be focusing on next:
Better data governance and metadata collection. Currently, we leverage Avro as our data storage format and while it allows us to store some basic metadata within our streams, we need better tooling capabilities around data discovery. Part of our upcoming efforts will involve exploring richer metadata storage protocols, including robust data lineage and governance constructs across all the pillars of our data ecosystem.
We are also researching how we can make our streaming APIs easier to use, such as improving our DSLs and include a federation layer on top of the streaming frameworks we support.
Q: What do developers need to keep in mind when working on developing and deploying on the data platform?
A: Some topics that come to mind:
- Have a passion for learning!
- Be ready to integrate and support different tools and frameworks.
- Keep calm and let it fail: Failures are inevitable; develop and react to them accordingly.
- Don’t get too attached to your code because implementations will change.
- When in doubt, over communicate!
Opinions expressed by DZone contributors are their own.