What Happened When PVS-Studio Checked ELKI in January
We take a look at the bugs that one team of researchers and testers found when the used PVS-Studio to go through the open source ELKI library.
Join the DZone community and get the full member experience.Join For Free
If you feel like the New Year just came, and you missed the first half of January, then all this time you've been busy looking for tricky bugs in the code you maintain. It also means that our article is what you need. PVS-Studio has checked the ELKI open source project to show you errors that may occur in the code, how cunningly they can hide there, and how you can deal with them.
What Kind of Library Is ELKI?
The abbreviation ELKI stands for Environment for DeveLoping KDD-Applications Supported by Index-Structures. This project is written in Java and is designed for data mining. Most users of this library are students, researchers, data scientists, and software engineers. No wonder, since this library was developed for research only.
The algorithms included in the library are mainly related to cluster analysis and outlier detection that may indicate experimental errors. Every year, there are more and more references to research conducted using ELKI. On the developers' official website, you can find a whole list of scientific works that were conducted via this library. From year to year, this list is constantly growing and being supplemented, and the topics of scientific works are getting more diverse.
The only obstacle to using this library may be the AGPL license, which requires the full source code of the application to be open to any user. In other words, you won't be able to use this library in commercial projects with closed source code.
Let's Start Checking
The ELKI library contains 2,630 Java files with 186,444 lines of code, excluding comments. Therefore, the project seems to be quite small compared to some open source giants.
However, the small size of the library does not guarantee the absence of errors. As you know, errors can occur in any project, regardless of their scale. That's why we are here now! Let's see my top 10 most interesting warnings found when viewing the report.
V6001 There are identical sub-expressions
bounds[j + 1] to the left and to the right of the
!= operator. CLIQUEUnit.java(252)
if block of the
checkDimensions method, the value of
bounds[j + 1] is checked for inequality with its own value. Naturally, this condition will always be false, and one of the boundaries will never be checked. That is, the
checkDimensions method may return
true when checking, even if the boundaries of the arrays do not match.
The correct check in the
if block must look this way:
updates parameter is not used inside the constructor body. DataStoreEvent.java(60)
removals parameter is not used inside the constructor body. DataStoreEvent.java(60)
Two warnings were received for this code snippet. Here, three parameters are passed to the
DataStoreEvent constructor, two of which are not used for initialization. Looking closer at the code in the constructor, it becomes clear that the programmer used copy-paste and forgot to correct the names of variables that should have been involved in initialization.
If you go even further and take a closer look at the
DataStoreEvent class, you can see three methods in it that are actively using the above constructor.
Due to incorrect initialization in the constructor, the behavior of these methods will also differ from what is expected.
The correct code should look as follows:
?: operator, regardless of its conditional expression, always returns one and the same value, '0.5'. ClusterHullVisualization.java(173), ClusterHullVisualization.java(173)
In this code snippet, the value of the
baseopacity variable will always be 0.5, regardless of what the value of the
flat variable is. It's clear that there should have been two different values, but due to inattention, the author of the code forgot to correct one of them.
The mistake is simple and very disappointing. It immediately strikes your eye if you exclude lines of code that are not related to this variable from the method. But it's not an easy task to find a mistake in a method consisting of almost 90 lines. It looks like this is exactly what became the obstacle for the programmer.
In Pursuit of Something Beyond the Limits
1 is out of bounds. GeneratorStatic.java(104)
computeMean method, the code author checks that the size of the
points collection is equal to one. If so, the developer tries to return a collection element with an index of
1 from the method. Since indexing starts from zero, an
IndexOutOfBoundsException is inevitable.
The correct version of the code must look like this:
Is it Possible to Divide by Zero?
V6020 Divide by zero. The range of the
referenceSetSize denominator values includes zero. PreDeConNeighborPredicate.java(138)
Every programmer knows that you can't divide by zero. Every programmer tries to avoid the occurrence of such an error in their code. And it works in most cases. But not always.
In this code snippet, everything depends on the size of the
neighbors parameter passed to the
computeLocalModel method. The code author checks the
neighbors size and cuts off the values that are less than zero by checking in the
if statement, but isn't taking any actions if the size is zero. Since the check of
referenceSetSize < 0 doesn't make any sense, and the message for logging also says something about an empty set, all of that seems like a typo. The check of
referenceSetSize == 0was most likely to be here.
If an empty
neighbors container is passed in this method anyway, the division by zero will occur in the
for loop. We can only hope that this will never really happen.
V6062 Possible infinite recursion inside the
setInitialMeans method. Predefined.java(65), Predefined.java(66)
This one-line method is a true recursion. It's difficult to guess what exactly the code author wanted to write here. The only line of this method probably should have looked as follows:
Since there is another method with the same name in this class, the programmer most likely wanted to pass data for initialization to it, but, in the end, something went wrong. This is what the body of the second method looks like, by the way:
V6094 The expression was implicitly cast from
int type to
double type. Consider utilizing an explicit type cast to avoid the loss of a fractional part. An example: double A = (double)(X) / Y;. ProbabilityWeightedMoments.java(130)
for loop of this code snippet, the
(n - i - j + 1) / (n - j + 1) expression is converted implicitly from the
int type to the
double type. In this case, the loss of accuracy may be quite different: from a few digits after the decimal point to a complete nullification if the number modulo is less than one. Probably, it's not exactly the expected behavior given that the
xmom array is of the
double type. The
(n - i - j + 1) / (n - j + 1) expression helps to prove that the programmer meant something completely different here. Let's say
n - j + 1 equals
x. Then we get the expression:
(x - i) / x. This expression will always give 0 for integer division unless we start going into negative ranges. But since the values of n in this code snippet are always greater than zero, we can conclude that the programmer did not intend to use integer division here.
Not to lose precision, an explicit conversion to the
double type is required:
Out of Bounds
V6079 Value of the
splitpoint variable is checked after use. Potential logical error is present. KernelDensityFittingTest.java(97), KernelDensityFittingTest.java(97)
In this code fragment, in the
while loop, first, the
fulldata array element with the
splitpoint index is compared with the value of the
splitval variable, and only then we check whether the
splitpoint value is smaller than the size of the array itself. These two checks in the
while loop need to be changed, otherwise you can very easily find yourself outside the array's bounds.
V6019 Unreachable code detected. It is possible that an error is present. Tokenizer.java(172)
V6007 Expression 'c != '\n'' is always true. Tokenizer.java(169)
Two diagnostics decided to act together this time. They both issued warnings. The V6019 diagnostic pointed to an unreachable code fragment:
++sstart, and V6007 pointed to a condition in the
if statement that will always be true.
Why will there always be
true in the
if block? The answer is very simple. In this statement, several conditions are checked at once:
c !=' ' or
c != '\n', or
c != '\r', or
c != '\t'. Whatever the input data, some of those will be true. Even if one of the checks is
false, the next check will return
true, and because of the
|| (or) operator, the condition in
if will eventually be true. The condition in the
if block will always be true. Therefore, the
break statement, which prematurely ends the
while loop, will trigger. As a result, the increment of the
sstart variable will never be executed. This is exactly what the V6019 diagnostics noticed and started sounding the alarm.
Most likely, the programmer wanted to write something like this:
Override, but Check
V6009 Function 'equals' receives an odd argument. An object
other.similarityFunction is used as an argument to its own method. AbstractSimilarityAdapter.java(91)
The code author decided to override the
equals method in the
AbstractSimilarityAdapter class. If it is assumed that objects in the program will be stored in containers, or checked for equality, the
equals override is required. However, the whole author's idea was spoiled by the method's last line, in which
equals is called for the same object. As a result, even the most common comparison will be incorrect.
The code was probably meant to be like this:
I'd like to note that this error pattern is often found in programs in different languages, and not only in Java. We've got an article, 'The Evil Within the Comparison Functions,' on this topic. Check it out if you're interested in finding out why programmers often make mistakes in fairly simple functions designed to compare two objects.
As you can see, mistakes may appear in any project. Some of them are simple and disappointing, others are cunning and tricky. And it doesn't matter whether your project is small or large. It doesn't matter if you're a professional developer or just starting to write code. Mistakes can appear anywhere in your program and hide safely from your eyes.
For every developer and every project, static code analysis is an indispensable tool. As essential as a code review or unit tests. That's why we recommend using static code analysis in your work right now, so that in the New Year your code will become better and clearer than in the previous one.
It's not the first time we are checking packages related to calculations, statistics, and scientific research. Every time we do, we worry about mistakes in the code of such projects. Can we trust the data calculated using these libraries?
Of course it's incredibly difficult to write code without mistakes, however, libraries used as building blocks in scientific, medical, and research programs should be checked as carefully as possible. The code should be as clean as possible. After all, every mistake can lead to very unpleasant consequences in the future. In any case, static analysis will prove to be extremely useful for such projects.
We have already covered this topic in some of our other articles, so if this question is interesting for you, don't hesitate to check out some of them:
Published at DZone with permission of Irina Polynkina. See the original article here.
Opinions expressed by DZone contributors are their own.