Apache Kafka's Code Under the Scanner
We ran Kafka through the Embold static code analysis tool to see the results.
Join the DZone community and get the full member experience.Join For Free
Apache Kafka is the open-source distributed event streaming platform built for data-driven apps that needs real-time handling of the data. Kafka was open-sourced by LinkedIn in 2011. Its use-cases are endless and it's used by thousands of companies for various operations to process real-time data. Kafka provides several [APIs] to process the data streams in real-time with low latency and high throughput. It's used in major companies like Airbnb, Netflix, LinkedIn, etc. It includes publish (write), subscribes (read), store and processes the stream of events for various operations according to the use-case of the application. It uses Binary protocol over TCP for Communication. Since it is open-source licensed under Apache License 2.0, it helps us to examine the code further to explore the inner workings and structure of Apache Kafka with the free static code analyser tool Embold.
The results are surprisingly interesting.
This article is for Education purposes only.
Repository Dashboard summarises the entire software after scanning in a single glance.
- Overall Rating System
- Lines of Code(LOC)
- Languages used in the repo with Java being high.
- Code issues
Overall Rating System
Overall Rating System is the numerical representation of the quality of the Software by including the factors like Vulnerabilities, Code Issues, Anti-patterns, Duplication Blocks. The Overall Rating of Kafka sitting at 2.63/5 and it's pretty good value on the scale of -5 to 5. It provides a summary of the entire system by numerical metric.
Code Issues are the errors/warnings that might be lead to the crashing of the software. It always better to follow the best practices while writing the code. Embold Code Issues reports issues and gives suggestions to overcome the issues. Below are some of the issues Embold found in Kafka's code.
- Non-Private Field Access In Synchronized Block
- Resource Leak
Throwing Null Pointer Exception will make the user assume that is caused by a Virtual Machine rather than a user-generated Error. To avoid confusion, we can use IllegalArgumentException. The description of AvoidingThrowing Exception is shown below. It helps the user to identify the common best practice to overcome this issue.
Description of AvoidThrowingNPE
Non-Private Field Access In Synchronized Block
This Code Issue comes under CWE-820: Missing Synchronization stating the accessing non-synchronized will lead to insecure behaviors. Non-private Field Access in Synchronized block/method might lead to partial synchronization. Using private or/and final will lead to overcome this issue is given the description during code-review. Since it can cause security issue, it is flagged user (Critical)
This Issue comes under CWE-404 and CWE-459 stating the closing of resources have to be done. Closing resources in finally or try with resources will be an effective way to make sure all the resources are closed. As a developer, we have to make sure all the resources are closed. This Resource leak issue raised might indicate that resources are not properly closed in finally block or try with resources.
Signature Declare Throws Exception ensures that no method explicitly throws Exception. It's generic and it can't able to determine what's the exception thrown in some cases. So, it's better to avoid it by giving the expected Exception to be used in the method signature. (For ex: IOException)
Anti-Patterns are bad programming practices that might lead the system to be complex and make it difficult to maintain and scale. Finding these during the development stage helps avoid challenges when scaling the whole application for features. Anti-patterns are assigned with severity levels (low, medium, high) depending on the significant impact. The Design and Architecture of a code base are evaluated by detecting Anti-patterns. Anti-patterns are detected for function, class, and component levels.
Let's analyze some of the common anti-patterns identified in the Apache Kafka Source Code. You can learn more anti-patterns supported by Embold here.
Test Hungry indicates the components that needed more amount of Unit/Integration to cover all the execution paths. Breaking large functions into small junks of functions helps to avoid this Anti-pattern.
Shotgun Surgery shows that the class/methods contain high-incoming coupling (dependent) from the other classes. (ie) Other classes are dependent on this class for their operation. In Sensor.java, the incoming dependency sits at nearly over 450, which shows that this Sensor class or methods in the class are called nearly 450 modules/components.
Incoming Dependency Graph of Sensor class
Message Chain denotes outgoing coupling and dependent on the other methods in another module/Component.
In MirrorCheckpointConnector.java, the method calls the other methods more frequently. (ie) highly outgoing coupling. This can be avoided by move the method to a class that contains most of the data used by the method.
Metrics include the factors that are considered for rating the Module/Component. The rating system ranges from -5 to 5.
Let's take the metrics of OffsetFetchResponse.java and analyze its implications.
Metrics of OffsetFetchResponse.java
Metrics of the component/module include the factors of LOC, Methods, Complexity, etc.
- Coupling Between Objects - includes both incoming and outgoing coupling between objects.
- Depth of Inheritance Hierarchy - Indicates the inheritance relationship between the modules/Components.
- Lack of Cohesion of Methods - indicates less cohesion between the methods.
- Access to Foreign Data (ATFD) - number of external classes from which a given class accesses it might be directly accessing data members or via accessor-methods
- Executable LOC - executable statements without comments and empty lines.
Code duplication can make a code base highly unmaintainable, which is why Embold treats it as a separate issue. In an ideal case, one class should offer one single responsibility for that functionality. If duplication appears in many parts of code, locating and refactoring an error becomes incredibly complex, due to:
- Bloating of the system where a lot of duplicated blocks of code
- Evolution of cloning means cloning errors are increasing
The amount of Duplication in the entire software is 4.97%. It is so much better. It seems like it follows the "DRY Principle" nearly in a perfect manner. We can further overcome the duplication by placing the duplicate block into a function and call that function from all the impacted locations.
HeatMap is used to represent data in form of color-coding to represent different values. In Embold, it uses data as Components. By using HeatMap, we can find the individual components and it's individual rating based on the dependency, issues, duplication. indicating -5 as lowest rating and 5 as highest rating.
HeatMap of Apache Kafka
Module dependency Bubble Graph indicates the dependency between the modules of the entire code-base. In the below diagram, we can see that clients module acts like Hub for some modules. It denotes that other module depends on clients module (incoming arrow as depending)
bubble graph of module
With Nodes as modules and edges containing numerical values as no. of dependencies between any two nodes (ie modules). This Above graph shows module dependency with 5 being highly de-coupled and nearer to -5 as highly coupled piece of code.
KPI of Apache Kafka
Key Performance Indicators (KPI) assess the quality of software to achieve the end goal of the software by providing insights about its security, efficiency, Robustness, etc. These KPIs are like performance indicators that help in deciding on where to focus on getting the secure and scalable application. KPIs help us monitor and measure the efficiency of the system. All KPIs are estimated based on the Code Issues, Anti-patterns, Security and duplications, etc. KPI Summary shows that,
- Efficiency is pretty low when compared with other indicators. it can be overcome by taking care of duplicated code, fixing the issues, and following best practices.
- Robustness being next indicated that how much the system is fault-tolerant without crashing up the entire system. It might be due to the issues like Null Pointer, throwing a generic exception, or might be due to not handling the Exception Properly.
- You can learn more about KPIs here.
Scanning a repository like Kafka with a static analysis tool can help you take your coding to the next level. Good coding is an art that can be honed and improved. Most importantly, have fun with it!
If you enjoyed this post or wanted to give feedback about this post, we'd love to hear your thoughts in the comments!
Opinions expressed by DZone contributors are their own.