Over a million developers have joined DZone.

Analyzing Codebases: Part I

DZone's Guide to

Analyzing Codebases: Part I

After reusing a few tools and doing some basic analysis, we can begin sniffing a huge codebase and get to places where we suspect most traffic is going.

· DevOps Zone ·
Free Resource

Read why times series is the fastest growing database category.

In Taking Over a Project: Part I, we discussed the necessary tasks to be executed while assessing and taking over existing projects — namely, producing an easy-to-maintain project and capturing the important aspects of a project through valuable communication with previous project owners.

In "Analyzing Codebases Part I", we are going to learn how to use Git version control together with a few more powerful tools which can actually provide invaluable information, possibly more than the previous project owners. We’ll take code metrics from Kubernetes open source, which would serve as an existing test case project for code complexity, and find hotspots in it.

Git is a powerfull tool to analyze existing codebases.  While we can learn a lot from already codebase analysis tool which already exist in Github, there are many additional utilities we can run using git and identify many interesting aspects of our code and it's history.

In Cyclomatic Complexity and Lines of Code: Empirical Evidence of a Stable Linear Relationship, by Graylin Jay, Joanne E. Hale, Randy K. Smith, David Hale, Nicholas A. Kraft, and Charles Ward, academic research shows the relationship between simple code measurements such as LOC and more complex code complexity metrics such as Cyclomatic Complexity (CC). Researchers have carried out a large empirical study of the relationship between LOC and CC for a sample population that crossed languages, methodologies, and programming paradigms. The research shows that the linearity of the relationship between these two measurements has been severely underestimated. This means LOC is an extremely simple metric, which is an excellent measurement to detect code complexity.

We are going to utilize forensic psychology techniques that were adapted to software development as presented by Adam Tornhill in his great work and his book, Your Code as a Crime Scene: Use Forensic Techniques to Arrest Defects, Bottlenecks, and Bad Design in Your Programs. Specifically, in the second post, we’ll focus on tools setup and very basic complexity metric measurements that will allow us a better analysis of our code.

Figure 1. LOC and CC Linear Relationship based on empirical evidence collected by papers such as, “Cyclomatic Complexity and Lines of Code: Empirical Evidence of a Stable Linear Relationship.”

Analysis Plan

  1. Clone Kubernetes, Code Maat, and its scripts; clone CLOC.

  2. Run LOC complexity metrics.

  3. Run Code frequency complexity metrics.

  4. Integrate LOC and code frequency metrics for global complexity.

  5. Run module coupling metrics.

  6. Identify hotspots.

  7. Plot clustering and coupling.

The Analysis

I don’t like installing stuff and cluttering my development laptop. Therefore, I’m going to utilize a docker whenever possible to keep my laptop clean for the experiments.

Let’s start by cloning Code Maat and compiling it into a standalone JAR:

$ git clone https://github.com/adamtornhill/code-maat.git
$ docker run -it --rm -v "$PWD":/Users/tomerb/tmp/blogspot/code-maat -w  /Users/tomerb/tmp/blogspot/code-maat clojure lein uberjar
$ java -jar target/code-maat-1.0-SNAPSHOT-standalone.jar

We are using Docker’s clojure container in order to compile a Code-Maat JAR, and in order to avoid installing clojure and its build tool (lein).

Let’s run a Code-Maat generated JAR:

java -jar target/code-maat-1.0-SNAPSHOT-standalone.jar 
Invalid argument:  Invalid --version-control specified: Supported options are: svn, git, git2, hg, p4, or tfs.
This is Code Maat, a program used to collect statistics from a VCS.
Version: 1.0-SNAPSHOT

Usage: program-name -l log-file [options]

  -l, --log LOG                                         Log file with input data
  -c, --version-control VCS                             Input vcs module type: supports svn, git, git2, hg, p4, or tfs
  -a, --analysis ANALYSIS                      authors  The analysis to run (abs-churn, age, author-churn, authors, communication, coupling, entity-churn, entity-effort, entity-ownership, fragmentation, identity, main-dev, main-dev-by-revs, messages, refactoring-main-dev, revisions, soc, summary)

We have successfully compiled clojure without having clojure directly installed. Yay!

Now let’s clone Kubernetes sources:

$ git clone https://github.com/kubernetes/kubernetes
Cloning into 'kubernetes'...
remote: Counting objects: 393070, done.
remote: Compressing objects: 100% (5/5), done.
remote: Total 393070 (delta 0), reused 0 (delta 0), pack-reused 393065
Receiving objects: 100% (393070/393070), 358.03 MiB | 288.00 KiB/s, done.
Resolving deltas: 100% (257040/257040), done.
Checking connectivity... done.
Checking out files: 100% (11962/11962), done.

Formatted logs for Kubernetes Git log:

$ alias maat="java -jar ../code-maat/target/code-maat-1.0-SNAPSHOT-standalone.jar"
$ maat -l k8s-git.log -c git2 -a summary

Let's view the header for this file:

$ head k8s-git.log 
[83a77fa] Kubernetes Submit Queue 2016-12-12 Merge pull request #38299 from kargakis/calculate-unavailable-correctly
[b2047ad] Kubernetes Submit Queue 2016-12-12 Merge pull request #38608 from wojtek-t/logrotate_in_kubemark
[9439453] Wojciech Tyczynski 2016-12-12 Increase single logfile size in kubemark

Let’s get a brief summary of the Git traffic on Kubernetes since January 1, 2016:

$ alias maat="java -jar ../code-maat/target/code-maat-1.0-SNAPSHOT-standalone.jar"
$ maat -l k8s-git.log -c git2 -a summary


Since January 1, 2016, there are 698 authors (some of them could be automatic robots such as merge robots) on the Kubernetes project, with 111K commits. That’s what we call a project with some development going on!

Distribution of changes across entities:


Kubelet is the module with the highest developer activity — 330 changes. It’s not that surprising to find that as Kubelet is indeed one of core components of Kubernetes; however, it may raise some questions about its stability. The relative number of commits is a surprisingly good predictor of defects and design issues. Looking at the source code for Kubelet.go, we find that the last update time was 16 minutes earlier!

It’s 2,185 LOC. That's quite a lot!

Digging deeper inside its source code, we see this:

// Kubelet is the main kubelet implementation.
type Kubelet struct {
kubeletConfiguration componentconfig.KubeletConfiguration

hostname      string
nodeName      types.NodeName
dockerClient  dockertools.DockerInterface
runtimeCache  kubecontainer.RuntimeCache
kubeClient    clientset.Interface
iptClient     utilipt.Interface
rootDirectory string

So, it has some dependencies even if conceptual ones are in the Docker! This is not the perfect sign. I would expect some dependency injection at least! Anyway, this is a rather big file, and it’s the Kubelet that’s one of the core components; obviously, a lot is going on there.

Let’s look at the recent changes to Kubelet.go from 17 hours ago:

A rename from release_1_5 to clientset.

Note: According to the docs, Kubernetes used to keep multiple releases in the main repo, and now that they use client-go to do the versioning they no longer need to explicitly write the revision, so this change is a refactor and cleanup.

In the same, way we see that:

2 days ago: refactor/cleanup
2 days ago: refactor/cleanup
6 days ago: logic (optimisation: reduce max amount of time for container runtime to come up from 5 minutes to 30 seconds. Nice, it looks like things are getting faster here).
7 days ago: refactor/cleanup
8 days ago: refactor/cleanup
9 days ago: logic (optimisation, use experimental-kernel-memcg-notification to get kernel notifications about memory eviction threshold instead of polling. Nice, we are optimising).


In this post, we have reused a few tools with very basic analysis from Your Code as a Crime Scene: Use Forensic Techniques to Arrest Defects, Bottlenecks, and Bad Design in Your Programs. We can now begin sniffing a huge codebase with 1.8M LOC and get to places where we suspect most traffic is going, which leads us to a better investment of our time. We should concentrate on these modules. We immediately see that Kubelet gets lots of traction, so even without knowing the internals of Kubernetes we could guess this as the main module and one we should focus on. These techniques can, of course, be applied to any project.

In the next post, we are going to dive deeper into code analysis and get a better understanding of where the hotspots are.

Learn how to get 20x more performance than Elastic by moving to a Time Series database.

kubernetes ,devops ,tutorial ,codebase ,version control

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}