Why Multi-Tenancy Is Critical for Your Data Pipeline
Why Multi-Tenancy Is Critical for Your Data Pipeline
Bringing multi-tenancy to data pipelines addresses key bottlenecks to help keep up with the needs of users and apps for faster access to more data. Check out how Yahoo! has been addressing this need!
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Multi-tenancy is not a new concept — in the early days of computing in the 1960s, time-sharing emerged as a way to share the cost of expensive computer hardware among multiple departments or even companies. That initial concept went through many evolutions that ultimately led to multi-tenancy. According to Wikipedia, multi-tenancy means, "a software architecture in which a single instance of software runs on a server and serves multiple tenants" — a contrast with monolithic, single-tenant applications which can only service one user at a time.
The cost efficiencies possible with multi-tenancy made it appealing, but what has brought it into the mainstream is the rise of software-as-a-service (SaaS) applications and cloud computing. SaaS and cloud applications often need to service thousands and even hundreds of thousands of clients, something that would be nearly impossible without a multi-tenant architecture; the cost and complexity of operating and scaling without multi-tenancy would be crippling.
Not Just for SaaS Anymore: Multi-Tenancy Inside the Enterprise
While we tend to associate multi-tenant applications with SaaS applications and public cloud providers, multi-tenancy has also become critical within an enterprise. Modern application architectures based on microservices, containers, and APIs are impossible to build without multi-tenancy. Not only that, but the advantages of multi-tenancy that have made it overwhelmingly common in SaaS and in the cloud are also compelling inside the enterprise:
Cost savings: Having multiple users share the same software component or application brings significant cost savings — not only hardware cost savings but also operational cost savings. That is particularly valuable in the common case where workload demands are highly variable.
Collaboration and sharing: Data needs to be shared among teams within an enterprise to enable decision-making and customer-facing applications. Multi-tenancy makes data sharing and collaboration between teams a lot easier and breaks down siloes of information by consolidating systems.
Operational simplicity: Sharing an application among multiple users makes it dramatically easier to deploy and manage. Updates can be applied once, with all users benefiting. This is very different from single-tenant systems, which need to be updated and managed individually.
Multi-Tenancy for the Data Pipeline
Given the exploding number of data sources, applications, and users, multi-tenancy is especially relevant to data processing. It has been an important feature of databases for some time, with enterprise-class relational database systems (RDBMS) providing sophisticated security frameworks, configuration options, and tuning parameters to support access by many users and applications.
However, the data pipeline that feeds these databases has failed to adapt. Most data pipelines today are still batch-oriented, processed by single-tenant systems that laboriously process data via a monolithic application. As enterprises rush to accelerate the delivery of data insights to decision-makers and applications, the data pipeline has become a critical bottleneck.
Traditional ETL (extract, transform, and load) systems, which are still the most common way to prepare data for databases, are a prime example of this challenge. Designed and architected long before today’s explosion of data and use cases, the architectures of these systems mean that data is best processed batch-by-batch, one job at a time. Newer “big data” systems that are often leveraged to support data transformation (Hadoop and Spark being prime examples), were also not architected for multi-tenancy and so access to those systems is typically indirect and limited. These systems are typically owned and managed by central teams, in part because they were not designed to support multi-tenant access.
The same needs are relevant to the messaging and streaming data processing systems such as Apache Storm, Heron, Apache Kafka, and Apache Pulsar that have become an increasingly important part of the data pipeline. They are being asked to support exploding demands to move and process more data faster than ever, however many offerings were not architected for the multi-tenancy needed to accomplish this easily and efficiently. For those systems that do not support multi-tenancy, it becomes necessary to deploy and manage more and more instances, creating more complex data pipelines. Without multi-tenancy, the application teams working with different parts of that application become more siloed. There is less data sharing that can happen, and more cost associated with operating and maintaining data pipelines at enterprise scale.
In Practice: Multi-Tenancy in the Data Pipeline at Yahoo!
It is incredibly hard, even practically impossible, to take a piece of software that was not designed to be multi-tenant and retrofit it to support multi-tenancy. Multi-tenancy needs to be part of the application architecture, it can’t be simply bolted onto a single-tenant architecture after the fact.
Given that, how are enterprises addressing these bottlenecks and bringing multi-tenancy to their data pipelines? Yahoo! is one example of an enterprise that needed to address this need. As a massive enterprise with multiple teams and product lines, Yahoo! faced demanding requirements for moving data to applications and users. Applications like Yahoo! Mail, Yahoo! Finance, Yahoo! Sports, Flickr, and Yahoo!'s Gemini ads platform deal with large amounts of data that needs to be efficiently moved and processed. Creating multiple forks and duplicated deployments of their data pipeline for various users or functions was simply not acceptable, as that would have made it almost impossible to share real-time data across departments, creating silos. It also would have made it prohibitively expensive for the organization.
For Yahoo!, multi-tenancy was critical to solving this challenge. As they detailed in a blog post about how they developed the Pulsar messaging system (which has since been open sourced and is currently an effort undergoing incubation at The Apache Software Foundation), creating a multi-tenant messaging system to support their data pipelines required a number of key architectural features:
A distributed architecture to support the necessary scale of data producers and consumers.
A modular architecture to isolate the performance impact of reads, writes, message storage, and message handling.
A programming model to logically separate applications, messages, and topics.
Building multi-tenancy into their solution allowed Yahoo! to deploy it globally across more than ten datacenters. In addition to making it possible to avoid data silos and duplicate data pipelines, the multi-tenancy in Pulsar (as further described in this blog post) was a critical factor in helping them to achieve:
Greater than 100 billion messages/day published
More than 1.4 million topics
Average publish latency across the service of less than 5 ms
Like Yahoo!, other data-driven enterprises are also turning to new approaches and new technologies and architectures to modernize their data pipelines. Bringing multi-tenancy to their data pipelines is addressing key bottlenecks so that they can keep up with the needs of users and applications for faster access to more data.
Opinions expressed by DZone contributors are their own.