Business Intelligence in Microservices: Improving Performance
An in-depth review of the different ways to improve your microservices.
Join the DZone community and get the full member experience.Join For Free
Do you know why microservice design is so popular within the development of BI tools? The answer is clear: it helps to develop scalable and flexible solutions. But microservice architecture has a great drawback. Its performance usually requires great improvements.
The FreshCode team also faced the problem and I’ve decided to show how we coped with it. The article is written together with FreshCode CTO and based on our recent case of development reporting microservice. You will find here its tech scheme, estimates, as well as a list of tools for on-premise and SaaS products.
Meet Microservice Design: What Should You Care About?
If you wonder why is microservice style so popular, you should think about the recent IT trends. The demand for Agile and DevOps practices led to microservice popularity. Today such great players as Uber, Airbnb, Netflix use microservices to solve their business problems.
The best way to explain what does microservice design mean is to compare it with a common monolith app. The monolithic system uses one processor for all the logic. Meanwhile, microservice includes a few separate processors. They usually are:
Any change in the system leads to the deployment of a new version of the server part of the system. Let’s consider the concept in detail.
Microservice Design In Detail
Microservice design means a set of services, but the definition is vague. I can single out 4 features that a microserver usually has:
- The decentralized control of languages and data.
- Responsibility for a specific business need.
- Use of automatic deployment.
- The presence of endpoints.
In the picture below you can see microservice design compared to a monolith app.
Scalability in Microservices
One of the main benefits of the microservice design is its scalability. You can scale several services without changing the whole system. So, you save resources and keep the app less complex. One of the most famous cases that prove this fact is Netflix user base. The company had to cope with the growing subscribers’ database. The microservice design was a great solution for scaling it.
Each microservice needs its own database. Otherwise, you can’t use all the benefits of the modularization pattern. The variety of databases leads to challenges in the reporting process. We will discuss the problem later.
Microservice design speeds up app development and allows us to launch the product earlier. Each part can be rolled out separately, making the deployment of microservices is quicker and easier.
Pros of Microservices
- The possibility of convenient horizontal system scaling.
- Increased development team members' productivity.
- Simplification of the debugging and maintenance processes.
- The ability to work in smaller teams and use an Agile approach.
- Flexibility in continuous integration and deployment.
Cons of Microservices
Despite all these benefits, microservice architecture has its own drawbacks. I mean the necessity of operating many systems and completing various tasks in the distributed environment. So, the main microservice pitfalls are:
- The complexity of microservice design makes the developer plan and act more carefully.
- The external API communication in microservice architecture leads to more significant risks of attacks.
- Sometimes it’s difficult to switch between them in the development and deployment processes.
Reporting In Microservice System
We worked on a legacy EdTech project. The system was very complex and included many microservices. Its main parts were:
- A sophisticated financial and billing system.
- Multi-organisation structure for large group entities.
- Workflow management tool for business processes.
- Integrated bulk email, SMS and live chat.
- Online system for surveys, quizzes, examination.
- Flexible assessment and learning management system.
FreshCode worked on the project on the stage of migrating to a new interface. The product was preparing for the global launch. The microservice system was supposed to process great amounts of data. As for the app target audience, it was developed for:
- The large education networks that manage 100s of campuses.
- Governments that have up to 200k schools, colleges, and universities.
Meanwhile, the EdTech app design was convenient both for great education networks and a small school of about 100 students.
So, FreshCode development team faced the problem of managing and improving the performance of the complex microservice architecture. It should be mentioned that the client wanted to build both SaaS and self-hosted systems. We have chosen the technical solutions keeping this fact in mind.
Improving Performance In Microservices
The process of generating reports required engagement with different services. Thus, it caused performance issues. That’s why Freshcode team decided to optimize the app architecture by creating a separate reporting microservice. It received data from all the databases. Then, it saved them and transformed into custom reports.
On the picture below you can see the scheme of reporting microservices system and technologies for its implementation.
Yellow color marks all microservices in the system. Each of them has its own database. The reporting module tracks all changes in them with the help of a messaging system. Then, it stores the new data in its own report database.
6 Steps Of The Microservice Implementation
Let’s look at the 6 main parts of the reporting system, technologies that can be used and the best solutions.
Change Data Capturing (CDC)
CDC tracks every single change (insert, update, delete) and performs some logic on it. There were 3 possible tools for the first step of implementing the microservice reporting system.
1. Apache NiFi
It allows creating simple CDC without coding at all. Apache NiFi has a lot of built-in processors and supports data routing, transformation and system mediation logic.
- Support of cluster mode and easy scaling.
- Built-in PutToKafka and PutToKinesis activities.
- Implementation of custom activities on any JVM language.
- User-friendly UI.
- No predefined data format for messaging between activities.
- Supports only JVM languages.
- The quality of default activities isn’t perfect.
- No Oracle CDC activity.
2. StreamSets Data Collector
A popular open-source solution for continuous big data ingestion in a microservice reporting system. Its main advantages are the simple creation of data pipelines and the support of many widespread technologies.
- Built-in AWS S3, Kinesis, Kafka, Oracle, Postgres processors.
- Open-source software can be adjusted for your needs.
- Simple and convenient UI.
- Support of most of the popular tools.
- It’s a new solution that is still actively developing.
- It’s a little bit difficult to start working with StreamSets Data Collector.
The innovative ELT architecture has an easy-to-use interface. It is built specifically for Amazon Redshift, Google BigQuery and Snowflake.
- A proprietary tool.
- Support of the development team.
- Well-tested solution.
- Only several databases can be used with this tool.
- ELT architecture doesn’t match to all projects.
Oracle was the main database of our microservice reporting system. So, we choose StreamSets Data Collector, because of Oracle CDC support out of the box.
It allows sending messages between computer systems, as well as setting publishing standards for them.
1. Apache Kafka
One of the most famous tools for real-time analytics. Apache Kafka has high throughput and reliability characteristics.
- High throughput, fault tolerance, durable.
- Great scalability, high concurrency.
- Batch mode, native computation over streams.
- A great choice for the on-premise microservice reporting system.
- Requires DevOps knowledge for the correct setup.
- No built-in monitoring tool.
2. AWS Kinesis
It simplifies collecting, processing, analyzing streaming data. Amazon Kinesis offers key capabilities for the cost-effective process at any scale.
- Easy to manage and scale.
- Great integration with other AWS services.
- Almost no DevOps effort.
- Built-in monitoring and alert system.
- It needs some cost optimizations.
- No way to use for on-premise software.
Although Apache Kafka required a bit more effort to deploy and setup, we used it as a cost-efficient on-premise solution.
Streaming Computation Systems
The high-performance computer system analyzes multiple data streams from many sources. It helps to prepare data before ingestion. So, it’s possible to denormalize/join them and add any info if needed.
1. Spark Streaming
Brings Apache Spark’s language-integrated API for stream processing. So, it allows writing streaming jobs the same way we write batch jobs.
- Stateful exactly-once semantics out of the box.
- Fault-tolerance, scalability.
- In-memory computation.
- Pretty expensive to use.
- Manual optimization.
- No built-in state management.
2. Apache Flink
It is useful for stateful computations over unbounded and bounded data streams. Apache Flink suits for all common cluster environments and performs computations at in-memory speed.
- Exactly once state consistency.
- SQL on Stream & Batch Data.
- Low latency, scalability, fault-tolerance.
- Support of a very large state.
- It requires high programming skills.
- It has a complicated architecture.
- Flink community is less than Spark but growing.
3. Apache Samza
The scalable data processing engine for real-time analytics that can be used in a microservice reporting system.
- It can maintain a large state.
- Low latency, high throughput, mature and tested at scale.
- Fault-tolerant and high performance.
- At-least-once processing guarantee.
- Lack of advanced streaming features (watermarks, sessions, triggers).
4. AWS Kinesis Services
The set of tools includes Data Firehose, Data Analytics, and Data Streams. As a result, it helps to build powerful stream processing without implementing any custom code.
- Pay only for what you use.
- The easiest way to process data streams in real-time with SQL.
- Handle any amount of streaming data.
- No way to use on-premise.
- The cost in a high-load environment will be higher compared to other solutions, but development and maintenance costs may be less.
- Complicated to customize.
AWS provides a great set of tools for ETL and data procession. It’s a good start point. But there is no way to deploy it on custom servers. That’s why it doesn’t fit for on-premise solutions.
Apache Flink is the most feature reach and performant solution. It allows storing large application state (multi-terabyte). But it requires more developers to be involved and should be deployed by yourself.
The central repository of integrated data from one or more disparate sources. It stores current and historical data in one single place. So, we can use them for creating analytical reports, machine learning, etc.
1. AWS S3
The object storage service offers industry-leading scalability, data availability, security and performance.
- Easy to integrate with other AWS services.
- Designed for 99.999999999% (11 9’s) of data durability.
- Cost-effective for rarely accessed data.
- Has an open source implementation with full API support.
- High network pricing.
- Previously S3 met availability issues, but it’s not a problem for a Data Lake.
2. Apache Hadoop
The primary data storage system used by Hadoop applications. It allows storing and processing large amounts of data.
- Efficiently works with huge amounts of data.
- Integration with many analytical and operational tools (Impala, Hive, HBase, etc).
- Complicated to deploy and manage.
- Needs to set up monitoring and high availability.
We decided to start with AWS S3. It has an open-source implementation. That’s why we could integrate it to the on-premise microservice reporting system.
1. AWS Aurora
It is up to 5 times faster than standard MySQL databases and 3 times faster than PostgreSQL databases.
- Pretty fast SQL database.
- High Availability and Durability.
- Fully Managed.
- Easy to scale.
- Bad performance for analytical reports in case of big data projects.
- The minimally available instance is too big, but we can easily replace it by plain PostgreSQL.
2. AWS Redshift
Redshift delivers 10 times faster performance than other data warehouses. It is using machine learning, massively parallel query execution and columnar storage on high-performance disk.
- May run queries on external S3 files.
- Easy to set up, use and manage.
- Columnar storage.
- It doesn’t enforce uniqueness.
- It can’t be used as a live app database.
- It’s mostly useful for run aggregation on a large amount of data.
The vectorized, columnar, memory-first database designed for analytical (OLAP) workloads. Kinetica automatically distributes any workload across CPUs and GPUs for optimal results.
- Pretty fast aggregation performance, run on GPU and CPU.
- Supports materialized join views and can update them incrementally.
- GPU instances still cost a lot.
- No way to join data between different partitions.
4. Apache Druid
It generally works well with any event-oriented, clickstream, time series, or telemetry data, especially streaming datasets from Apache Kafka. Druid provides exactly once consumption semantics from Apache Kafka and is commonly used as a sink for event-oriented Kafka topics.
- Druid can be deployed in any *NIX environment on commodity hardware.
- Best for interactive dashboards with full drill-down capabilities.
- Stores only pre-aggregated data,
- Isn’t perfect for custom reports that may be built by users.
- Works only on time series data.
- No full join support.
All of these databases are amazing. But our client’s goal was to create reports based on all data from all microservices. So, the development team considered AWS Aurora as the best choice for this task. It simplified the workflow a lot.
The report's microservice was responsible for storing information about data objects and relations between them. It also stood for managing security and generating reports itself. Since these reports were based on the chosen data objects.
We prepared two variants of the technological stack for the microservice reporting system. As for the SaaS product on AWS, we used:
- StreamSets for CDC.
- Apache Kafka as a messaging system.
- AWS S3 DataLake.
- AWS Aurora as a report database.
- AWS ElasticCache as an in-memory data store.
The reporting microservice was written in NodeJS. You can see rough estimates for SaaS solution on the table below.
Such infrastructure was the most appropriate for the client’s requirements. Its main advantage was the easy way to replace AWS services with self-hosted solutions. It allowed us to avoid code/logic duplication for different deployment schemas.
For on-premise one we used Minio, PostgreSQL, Redis accordingly. Their APIs were fully compatible. So, we didn’t have any significant problems in the microservice reporting system at all.
Our team solved the clients’ technical challenges. The reporting microservice module was effective and convenient. It was capable of:
- Generating clear and convenient reports.
- Providing many standard reporting templates.
- Adding a large number of filters.
- Customizing report interface.
FreshCode client improved the microservice reporting system and achieved these goals:
- To update the app’s architecture and design.
- To improve the product by adding new features.
- To optimize performance, increase flexibility and scalability.
If you are interested in solving the same problem or have any other technical challenges, contact our team. We provide free expert advice for startups, small businesses, and enterprises. Check the FreshCode portfolio to find out other interesting projects.
Would you like to read more case-based articles? Let me know in the comments below and stay in touch!
Published at DZone with permission of Artem Barmin. See the original article here.
Opinions expressed by DZone contributors are their own.