{{announcement.body}}
{{announcement.title}}

Apache Hadoop Code Quality: Production VS Test

DZone 's Guide to

Apache Hadoop Code Quality: Production VS Test

In this article, we walk through results of PVS-Studio's Static Analyzer's look at the Apache Hadoop code base.

· Big Data Zone ·
Free Resource

In order to get high-quality production code, it's not enough just to ensure maximum coverage with tests. No doubt, great results require the main project code and tests to work efficiently together. Therefore, tests have to be paid as much attention to as source code. A decent test is a key success factor, as it will catch regression in production. Let's take a look at PVS-Studio static analyzer warnings to see the importance of the fact that errors in tests are no worse than the ones in production. Today's focus: Apache Hadoop. 

About the Project

Those who were formerly interested in Big Data have probably heard or worked with the Apache Hadoop project. In a nutshell, Hadoop is a framework that can be used as a basis for building and working with Big Data systems.   

Hadoop consists of four main modules; each of them performs a specific task required for a Big Data analytics system:

  • Hadoop Common.       
  • MapReduce.        
  • Hadoop Distributed File System.       
  • YARN.

About the Check

As shown in the documentation, PVS-Studio can be integrated into a project in a variety of ways: 

  • Using the Maven plugin.      
  • Using the Gradle plugin.    
  • Using the Gradle IntellJ IDEA.     
  • Directly using the analyzer.   

Hadoop is based on the Maven build system; therefore, there were no obstacles with the check.

After I integrated the script from the documentation and edited one of the pom.xml files (there were modules in dependencies that were not available), the analysis started!

After the analysis was completed, I chose the most interesting warnings and noticed that I had the same number of warnings in production code and in tests. Normally, I don't consider analyzer warnings from tests. But when I divided them, I couldn't leave 'tests' warnings unattended. "Why not take a look at them," I thought, "because bugs in tests might also have adverse consequences." They can lead to incorrect or partial testing, or even to mishmash. 

After I selected the most intriguing warnings, I divided them by the following groups: production, test and the four main Hadoop modules. And now I'm glad to offer the review of analyzer warnings.  

You may also like: A Kafka Tutorial for Everyone, no Matter Your Stage in Development.

Production Code

Hadoop Common

V6033 An item with the same key 'KDC_BIND_ADDRESS' has already been added. MiniKdc.java(163), MiniKdc.java(162)   

Java
   

A value added twice in a HashSet is a very common defect when checking projects. The second addition will be ignored. I hope this duplication is just a needless tragedy. What if another value was meant to be added?

MapReduce

V6072 Two similar code fragments were found. Perhaps, this is a typo and the localFiles variable should be used instead of localArchives.

  • LocalDistributedCacheManager.java(183). 
  • LocalDistributedCacheManager.java(178). 
  • LocalDistributedCacheManager.java(176).
  • LocalDistributedCacheManager.java(181).
Java
   

The V6072 diagnostic sometimes yields some interesting findings. The purpose of this diagnostic is to detect the same type of code fragments resulted from copy-paste and replacement of one-two variables. In this case, some variables have even been left "underchanged."

The code above demonstrates this. In the first block, the localArchives variable is used in the second similar fragment — localFiles. If you study this code with due diligence, rather then quickly run through it, as it often happens while code reviewing, you'll notice the fragment, where the author forgot to replace the localArchives variable. 

This gaffe can lead to the following scenario:

  • Suppose, we have localArchives (size = 4) and localFiles (size = 2).       
  • When creating the array, localFiles.toArray(new String[localArchives.size()]), the last two elements will be null (["pathToFile1", "pathToFile2", null, null]).
  • Then, org.apache.hadoop.util.StringUtils.arrayToString will return the string representation of our array, in which the last files names will be presented as "null" ("pathToFile1, pathToFile2, null, null").    
  • All this will be passed further, and God only knows what kind of checks there are for such cases =).
    V6007 Expression 'children.size() > 0' is always true. Queue.java(347)  
 
Java


Due to the fact that the number of elements is checked for 0 separately, the further check children.size() > 0 will always be true. 

HDFS

V6001 There are identical sub-expressions 'this.bucketSize' to the left and to the right of the '%' operator. RollingWindow.java(79).

Java
 

YARN

V6067 Two or more case-branches perform the same actions. TimelineEntityV2Converter.java(386), TimelineEntityV2Converter.java(389).     
 
Java


The same code fragments are in two case branches. It's just all over the place! In the prevailing number of cases, this is not a real error, but just a reason to think about refactoring the switch statement. This is not true for the case at hand. Repeated code fragments set the value of the variable preemptedVcoreSeconds. If you look closely at the names of all variables and constants, you will probably conclude that in the case, if metric.getId() == APP_MEM_PREEMPT_METRICS, the value must be set for the preemptedMemorySeconds variable, not for preemptedVcoreSeconds. In this regard, after the switch statement, preemptedMemorySeconds will always remain 0, while the value of preemptedVcoreSeconds might be incorrect. 

V6046 Incorrect format. A different number of format items is expected. Arguments not used: 2. AbstractSchedulerPlanFollower.java(186)  
Java


The planQueueName variable isn't used when logging. In this case, either too much is copied or the format string isn't finished. But I still blame the good old copy-paste, which in some cases is just great to shoot yourself in the foot.

Test Code

Hadoop Common

V6072 Two similar code fragments were found. Perhaps, this is a typo and 'allSecretsB' variable should be used instead of 'allSecretsA'.

TestZKSignerSecretProvider.java(316), TestZKSignerSecretProvider.java(309), TestZKSignerSecretProvider.java(306), TestZKSignerSecretProvider.java(313).      

Java


And again the V6072. Look closely at variables allSecretsA and allSecretsB.

V6043 Consider inspecting the 'for' operator. Initial and final values of the iterator are the same. TestTFile.java(235).  

Java


A test that's always green? =). A part of the loop, which is a part of the test itself, will never get executed. This is due to the fact that the initial and final counter values are equal in the for statement. As a result, the condition i < start will immediately become false, leading to such behavior. I ran through the test file and lept to the conclusion that  i < (start + n) had to be written in the loop condition.  

MapReduce  

V6007 Expression 'byteAm < 0' is always false. DataWriter.java(322)

Java
   

The condition byteAm < 0 is always false. To figure it out, let's give the code above another look. If the test execution reaches the operation byteAm -= headerLen, it means that byteAm >= headerLen. From here, after subtraction, the byteAm value will never be negative. That's what we had to prove.  

HDFS

V6072 Two similar code fragments were found. Perhaps, this is a typo and the normalFile variable should be used instead of normalDir. TestWebHDFS.java(625), TestWebHDFS.java(615), TestWebHDFS.java(614), TestWebHDFS.java(624)

 
Java
 

Believe it or not, it's a V6072 again! Just follow the variable normalDir and normalFile

V6027 Variables are initialized through the call to the same function. It's probably an error or un-optimized code. TestDFSAdmin.java(883), TestDFSAdmin.java(879).    

Java
 

In this fragment, highestPriorityLowRedundancyReplicatedBlocksStr and highestPriorityLowRedundancyECBlocksStr are initialized by the same values. Often, it should be so, but this is not the case. Variables' names are long and similar, so it's not surprising that the copy-pasted fragment wasn't changed in any way. To fix it, when initializing the highestPriorityLowRedundancyECBlocksStr variable, the author has to use the input parameter highestPriorityLowRedundancyECBlocks. In addition, most likely, they still need to correct the format line.   

V6019 Unreachable code detected. It is possible that an error is present. TestReplaceDatanodeFailureReplication.java(222).    

Java


The analyzer complains that the i++ counter in the loop can't be changed. This means that in the for (int i = 0; i < slowwriters.length; i++) {....} loop no more than one iteration will execute. Let's find out why. In the first iteration, we link the thread with the file that corresponds to slowwriters[0] for further reading. Next, we read the file contents via the loop for (int j = 0, x;; j++).

  • If we read something relevant, we compare the read byte with the current value of the j counter though assertEquals (if the check is not successful, we fail out of the test).      
  • If the file is checked successfully, and we get to the end of the file (we read -1), the method exits. 

Therefore, whatever happens during the check of slowwriters[0], it won't get to checking subsequent elements. Most likely, a break has to be used instead of return.    

YARN   

V6019 Unreachable code detected. It is possible that an error is present. TestNodeManager.java(176)

Java
   

In this situation, the Assert.fail method will interrupt the test and the stack trace won't be printed in case of an exception. If the message about the caught exception is enough here, it's better to remove printing the stack trace to avoid confusion. If printing is necessary, you just need to swap them.

Many similar fragments have been found:  

  • V6019 Unreachable code detected. It is possible that an error is present. TestResourceTrackerService.java(928).       
  • V6019 Unreachable code detected. It is possible that an error is present. TestResourceTrackerService.java(737).       
  • V6019 Unreachable code detected. It is possible that an error is present. TestResourceTrackerService.java(685).       

V6072 Two similar code fragments were found. Perhaps, this is a typo and the publicCache variable should be used instead of usercache.

TestResourceLocalizationService.java(315),

TestResourceLocalizationService.java(309),

TestResourceLocalizationService.java(307),

TestResourceLocalizationService.java(313)

Java


And finally, V6072 again =). Variables to view the suspicious fragment: usercache and publicCache.    

Conclusion

Hundreds of thousands of lines of code are written in development. Production code is usually kept clean from bugs, defects, and flaws (developers test their code, the code is reviewed, and so on). Tests are definitely inferior in this regard. Defects in tests can easily hide behind a "green tick." As you've probably got from today's warnings review, a green test is not always a successful check.    

This time, when checking the Apache Hadoop code base, the static analysis turned out to be highly needed both in production code and tests that also play an important role in development. So, if you care about your code and test quality, I suggest that you set sights on static analysis.

Further Reading

Topics:
java ,apache hadoop ,pvs-studio ,apache ,programming ,bugs ,static analysis ,big data

Published at DZone with permission of Maxim Stefanov . See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}