Looking Under the Hood of Apache Pulsar: How Good Is the Code?
Ever wondered what's under the hood of some well-known open source software? Get a cup of your favourite beverage and read on.
Join the DZone community and get the full member experience.Join For Free
Apache Pulsar (incubating) is an enterprise-grade publish-subscribe (aka pub-sub) messaging system that was originally developed at Yahoo. Pulsar was first open-sourced in late 2016, and is now undergoing incubation under the auspices of the Apache Software Foundation. At Yahoo, Pulsar has been in production for over three years, powering major applications like Yahoo! Mail, Yahoo! Finance, Yahoo! Sports, Flickr, the Gemini Ads platform, and Sherpa, Yahoo’s distributed key-value store.
One of the primary goals of the open-source software movement is to allow all users to freely modify software, use it in new ways, integrate it into a larger project or derive something new based on the original. The larger idea being that through open collaboration, we can make a software the best version of itself. So, in the spirit of innovation and improvement, we peeked behind the curtain and ran Apache Pulsar through the free static code analysis tool Embold. The results are very interesting, both from an observer's and a learner's point of view.
1. Quality Gate: Embold's Quality gate is "closed" for Pulsar mainly due to the Code Issues and Code Duplication. The overall rating score is certainly not the best, but sits at an innocuous 1.76 (5 being the best on the scale).
2. Code Issues: And so let's move on to some of the code issues found. The critical and high Code issues count in the modules pulsar-broker and pulsar-functions are close to 150. That's a high number worth exploring. Time to do a deep-dive.
TheadLeak: This issue is fairly common across the entire codebase. To quote an occurrence, we see that in the Compactor class, shutdown() is missing for the executor service that manages the threadpool, indicating a thread resource leak. The tool recommends call the shutdown() once done to ensure there is no thread resource leaks.
Resource Leak: In the RuntimeUtils class, we see that close() on the resource rd is not guarded by a finally block. This means, that if there's an exception while reading the data, the resource will never get closed, indicating a leak.
Throwing Null Pointer Exception: In FunctionCacheManagerImpl, NullPointerException is thrown which in high likelihood will be perceived as ambiguous, as it might also be related to many other things in the system. The tool recommends throwing IllegalArgumentException instead.
NonprivatefieldaccessINSyncrhonizedBlock: This issue points to a potential synchronization issue, as a non-private field can be directly accessed from outside without the knowledge that it has to only be synchronized access.
CompileRegularExpressionsOnce: The tool found this issue to be very prevalent in the codebase. Compiling the same regex pattern at every usage can unnecessarily add to the overhead. The fix to this would be to compile it only once at the beginning.
AvoidSyncroinizedAtMethodLevel is another issue that is pretty common in the code base. Synchronizing an entire method can impact the performance, as a method generally undergoes many modifications. This can lead to the method having a section of code that need not be synchronized. The tool recommended adding synchronization at the block level, so you scope down to the only code that needs the syncronization.
In a few places, the object references are compared with ==, which is a not the recommended way. Using equals() is more accurate here.
Other issues such as Use os SysOut at many places unnecessarily takes away the points from being called a production-grade code.
Not using final for the field whose value won't change, is another low hanging fruit we found that can be addressed easily to better the production quality.
3. Code Duplication: The main culprit that stands out here is pulser-common.With a relatively high duplicate to non-duplicate code ratio, it calls for immediate correction.
4. Metrics: The metric violations are reasonably well placed, except for Number of Attributes, Number of Methods and Lack of cohesion Cohesion of method, which each account for around 20%.
5. Hotspots: Digging a little deeper, we see that given the size of the module pulsar-broker, it has a relatively high critical hotspots count compared to its peer module pulsar-common, which has a relatively lower critical hotspots count. We also see that these components are already at a high risk score for critical hotspots, with an average overall score of close to -3, which is concern that needs to be addressed at the earliest.
6. Design Issues Analysis: Quite a few components across the modules in the codebase are affected by the Global Butterfly anti-pattern. Looking at pulsar-common module, the most dominating anti-patterns are Global Butterfly and fat interface. God Class anti-pattern which sticks out at a count of 40, needs immediate attention as it can make the code very in-cohesive and bloat the technical debt.
Did you enjoy reading this review of Apache Pulsar's backend component? Please let me know in the comments if you'd like to see more such reviews of some of your favorite open-source projects. Here's a link to Embold is you would like to try it out yourself: https://embold.io/
Opinions expressed by DZone contributors are their own.