What Happened When PVS-Studio Checked ELKI in January
We take a look at the bugs that one team of researchers and testers found when the used PVS-Studio to go through the open source ELKI library.
Join the DZone community and get the full member experience.
Join For FreeIf you feel like the New Year just came, and you missed the first half of January, then all this time you've been busy looking for tricky bugs in the code you maintain. It also means that our article is what you need. PVS-Studio has checked the ELKI open source project to show you errors that may occur in the code, how cunningly they can hide there, and how you can deal with them.
What Kind of Library Is ELKI?
The abbreviation ELKI stands for Environment for DeveLoping KDD-Applications Supported by Index-Structures. This project is written in Java and is designed for data mining. Most users of this library are students, researchers, data scientists, and software engineers. No wonder, since this library was developed for research only.
The algorithms included in the library are mainly related to cluster analysis and outlier detection that may indicate experimental errors. Every year, there are more and more references to research conducted using ELKI. On the developers' official website, you can find a whole list of scientific works that were conducted via this library. From year to year, this list is constantly growing and being supplemented, and the topics of scientific works are getting more diverse.
The only obstacle to using this library may be the AGPL license, which requires the full source code of the application to be open to any user. In other words, you won't be able to use this library in commercial projects with closed source code.
Let's Start Checking
The ELKI library contains 2,630 Java files with 186,444 lines of code, excluding comments. Therefore, the project seems to be quite small compared to some open source giants.
However, the small size of the library does not guarantee the absence of errors. As you know, errors can occur in any project, regardless of their scale. That's why we are here now! Let's see my top 10 most interesting warnings found when viewing the report.
Unchecked Boundaries
V6001 There are identical sub-expressions bounds[j + 1]
to the left and to the right of the !=
operator. CLIQUEUnit.java(252)
private boolean checkDimensions(CLIQUEUnit other, int e) {
for(int i = 0, j = 0; i < e; i++, j += 2) {
if (dims[i] != other.dims[i]
|| bounds[j] != other.bounds[j]
|| bounds[j + 1] != bounds[j + 1]) {
return false;
}
}
return true;
}
In the if
block of the checkDimensions
method, the value of bounds[j + 1]
is checked for inequality with its own value. Naturally, this condition will always be false, and one of the boundaries will never be checked. That is, the checkDimensions
method may return true
when checking, even if the boundaries of the arrays do not match.
The correct check in the if
block must look this way:
xxxxxxxxxx
bounds[j + 1] != other.bounds[j + 1]
Lazy Constructor
V6022 The updates
parameter is not used inside the constructor body. DataStoreEvent.java(60)
V6022 The removals
parameter is not used inside the constructor body. DataStoreEvent.java(60)
xxxxxxxxxx
public DataStoreEvent(DBIDs inserts, DBIDs removals, DBIDs updates) {
super();
this.inserts = inserts;
this.removals = inserts;
this.updates = inserts;
}
Two warnings were received for this code snippet. Here, three parameters are passed to the DataStoreEvent
constructor, two of which are not used for initialization. Looking closer at the code in the constructor, it becomes clear that the programmer used copy-paste and forgot to correct the names of variables that should have been involved in initialization.
If you go even further and take a closer look at the DataStoreEvent
class, you can see three methods in it that are actively using the above constructor.
xxxxxxxxxx
public static DataStoreEvent insertionEvent(DBIDs inserts) {
return new DataStoreEvent(inserts, DBIDUtil.EMPTYDBIDS, DBIDUtil.EMPTYDBIDS);
}
public static DataStoreEvent removalEvent(DBIDs removals) {
return new DataStoreEvent(DBIDUtil.EMPTYDBIDS, removals, DBIDUtil.EMPTYDBIDS);
}
public static DataStoreEvent updateEvent(DBIDs updates) {
return new DataStoreEvent(DBIDUtil.EMPTYDBIDS, DBIDUtil.EMPTYDBIDS, updates);
}
Due to incorrect initialization in the constructor, the behavior of these methods will also differ from what is expected.
The correct code should look as follows:
xxxxxxxxxx
this.inserts = inserts;
this.removals = removals;
this.updates = updates;
Opaque Variable
V6012 The ?:
operator, regardless of its conditional expression, always returns one and the same value, '0.5'. ClusterHullVisualization.java(173), ClusterHullVisualization.java(173)
xxxxxxxxxx
public void fullRedraw() {
....
boolean flat = (clusters.size() == topc.size());
// Heuristic value for transparency:
double baseopacity = flat ? 0.5 : 0.5;
....
}
In this code snippet, the value of the baseopacity
variable will always be 0.5, regardless of what the value of the flat
variable is. It's clear that there should have been two different values, but due to inattention, the author of the code forgot to correct one of them.
The mistake is simple and very disappointing. It immediately strikes your eye if you exclude lines of code that are not related to this variable from the method. But it's not an easy task to find a mistake in a method consisting of almost 90 lines. It looks like this is exactly what became the obstacle for the programmer.
In Pursuit of Something Beyond the Limits
V6025 Index 1
is out of bounds. GeneratorStatic.java(104)
xxxxxxxxxx
public double[] computeMean() {
// Not supported except for singletons.
return points.size() == 1 ? points.get(1) : null;
}
In the computeMean
method, the code author checks that the size of the points
collection is equal to one. If so, the developer tries to return a collection element with an index of 1
from the method. Since indexing starts from zero, an IndexOutOfBoundsException
is inevitable.
The correct version of the code must look like this:
xxxxxxxxxx
return points.size() == 1 ? points.get(0) : null;
Is it Possible to Divide by Zero?
V6020 Divide by zero. The range of the referenceSetSize
denominator values includes zero. PreDeConNeighborPredicate.java(138)
xxxxxxxxxx
protected PreDeConModel computeLocalModel(DoubleDBIDList neighbors, ....) {
final int referenceSetSize = neighbors.size();
....
// Shouldn't happen:
if(referenceSetSize < 0) {
LOG.warning("Empty reference set –
should at least include the query point!");
return new PreDeConModel(Integer.MAX_VALUE, DBIDUtil.EMPTYDBIDS);
}
....
for(int d = 0; d < dim; d++) {
s[d] /= referenceSetSize;
mvVar.put(s[d]);
}
....
}
Every programmer knows that you can't divide by zero. Every programmer tries to avoid the occurrence of such an error in their code. And it works in most cases. But not always.
In this code snippet, everything depends on the size of the neighbors
parameter passed to the computeLocalModel
method. The code author checks the neighbors
size and cuts off the values that are less than zero by checking in the if
statement, but isn't taking any actions if the size is zero. Since the check of referenceSetSize < 0
doesn't make any sense, and the message for logging also says something about an empty set, all of that seems like a typo. The check of referenceSetSize == 0
was most likely to be here.
If an empty neighbors
container is passed in this method anyway, the division by zero will occur in the for
loop. We can only hope that this will never really happen.
Endless Initialization
V6062 Possible infinite recursion inside the setInitialMeans
method. Predefined.java(65), Predefined.java(66)
xxxxxxxxxx
public void setInitialMeans(List<double[]> initialMeans) {
this.setInitialMeans(initialMeans);
}
This one-line method is a true recursion. It's difficult to guess what exactly the code author wanted to write here. The only line of this method probably should have looked as follows:
xxxxxxxxxx
this.setInitialMeans(initialMeans.toArray(new double[0][0]));
Since there is another method with the same name in this class, the programmer most likely wanted to pass data for initialization to it, but, in the end, something went wrong. This is what the body of the second method looks like, by the way:
xxxxxxxxxx
public void setInitialMeans(double[][] initialMeans) {
double[][] vecs = initialMeans.clone(); // TODO: deep copy?
this.initialMeans = vecs;
}
Lost Weight
V6094 The expression was implicitly cast from int
type to double
type. Consider utilizing an explicit type cast to avoid the loss of a fractional part. An example: double A = (double)(X) / Y;. ProbabilityWeightedMoments.java(130)
xxxxxxxxxx
public static <A> double[] alphaBetaPWM(...., final int nmom) {
final int n = adapter.size(data);
final double[] xmom = new double[nmom << 1];
double aweight = 1. / n, bweight = aweight;
for(int i = 0; i < n; i++) {
....
for(int j = 1, k = 2; j < nmom; j++, k += 2) {
xmom[k + 1] += val * (aweight *= (n - i - j + 1) / (n - j + 1));
xmom[k + 1] += val * (bweight *= (i - j + 1) / (n - j + 1));
}
}
return xmom;
}
In the for
loop of this code snippet, the (n - i - j + 1) / (n - j + 1)
expression is converted implicitly from the int
type to the double
type. In this case, the loss of accuracy may be quite different: from a few digits after the decimal point to a complete nullification if the number modulo is less than one. Probably, it's not exactly the expected behavior given that the xmom
array is of the double
type. The (n - i - j + 1) / (n - j + 1)
expression helps to prove that the programmer meant something completely different here. Let's say n - j + 1
equals x
. Then we get the expression: (x - i) / x
. This expression will always give 0 for integer division unless we start going into negative ranges. But since the values of n in this code snippet are always greater than zero, we can conclude that the programmer did not intend to use integer division here.
Not to lose precision, an explicit conversion to the double
type is required:
xxxxxxxxxx
xmom[k + 1] += val * (aweight *= (double) (n - i - j + 1) / (n - j + 1));
xmom[k + 1] += val * (bweight *= (double) (i - j + 1) / (n - j + 1));
Out of Bounds
V6079 Value of the splitpoint
variable is checked after use. Potential logical error is present. KernelDensityFittingTest.java(97), KernelDensityFittingTest.java(97)
xxxxxxxxxx
public final void testFitDoubleArray() throws IOException {
....
int splitpoint = 0;
while(fulldata[splitpoint] < splitval && splitpoint < fulldata.length) {
splitpoint++;
}
....
}
In this code fragment, in the while
loop, first, the fulldata
array element with the splitpoint
index is compared with the value of the splitval
variable, and only then we check whether the splitpoint
value is smaller than the size of the array itself. These two checks in the while
loop need to be changed, otherwise you can very easily find yourself outside the array's bounds.
Unreachable Code
V6019 Unreachable code detected. It is possible that an error is present. Tokenizer.java(172)
V6007 Expression 'c != '\n'' is always true. Tokenizer.java(169)
xxxxxxxxxx
public String getStrippedSubstring() {
int sstart = start, ssend = end;
while(sstart < ssend) {
char c = input.charAt(sstart);
if(c != ' ' || c != '\n' || c != '\r' || c != '\t') {
break;
}
++sstart;
}
....
}
Two diagnostics decided to act together this time. They both issued warnings. The V6019 diagnostic pointed to an unreachable code fragment: ++sstart
, and V6007 pointed to a condition in the if
statement that will always be true.
Why will there always be true
in the if
block? The answer is very simple. In this statement, several conditions are checked at once: c !=' '
or c != '\n'
, or c != '\r'
, or c != '\t'
. Whatever the input data, some of those will be true. Even if one of the checks is false
, the next check will return true
, and because of the ||
(or) operator, the condition in if
will eventually be true. The condition in the if
block will always be true. Therefore, the break
statement, which prematurely ends the while
loop, will trigger. As a result, the increment of the sstart
variable will never be executed. This is exactly what the V6019 diagnostics noticed and started sounding the alarm.
Most likely, the programmer wanted to write something like this:
xxxxxxxxxx
if(c != ' ' && c != '\n' && c != '\r' && c != '\t')
Override, but Check
V6009 Function 'equals' receives an odd argument. An object other.similarityFunction
is used as an argument to its own method. AbstractSimilarityAdapter.java(91)
xxxxxxxxxx
public boolean equals(Object obj) {
if(obj == null) {
return false;
}
if(!this.getClass().equals(obj.getClass())) {
return false;
}
AbstractSimilarityAdapter<?> other = (AbstractSimilarityAdapter<?>) obj;
return other.similarityFunction.equals(other.similarityFunction);
}
The code author decided to override the equals
method in the AbstractSimilarityAdapter
class. If it is assumed that objects in the program will be stored in containers, or checked for equality, the equals
override is required. However, the whole author's idea was spoiled by the method's last line, in which equals
is called for the same object. As a result, even the most common comparison will be incorrect.
The code was probably meant to be like this:
xxxxxxxxxx
return this.similarityFunction.equals(other.similarityFunction);
I'd like to note that this error pattern is often found in programs in different languages, and not only in Java. We've got an article, 'The Evil Within the Comparison Functions,' on this topic. Check it out if you're interested in finding out why programmers often make mistakes in fairly simple functions designed to compare two objects.
Let's Recap
As you can see, mistakes may appear in any project. Some of them are simple and disappointing, others are cunning and tricky. And it doesn't matter whether your project is small or large. It doesn't matter if you're a professional developer or just starting to write code. Mistakes can appear anywhere in your program and hide safely from your eyes.
For every developer and every project, static code analysis is an indispensable tool. As essential as a code review or unit tests. That's why we recommend using static code analysis in your work right now, so that in the New Year your code will become better and clearer than in the previous one.
In Conclusion
It's not the first time we are checking packages related to calculations, statistics, and scientific research. Every time we do, we worry about mistakes in the code of such projects. Can we trust the data calculated using these libraries?
Of course it's incredibly difficult to write code without mistakes, however, libraries used as building blocks in scientific, medical, and research programs should be checked as carefully as possible. The code should be as clean as possible. After all, every mistake can lead to very unpleasant consequences in the future. In any case, static analysis will prove to be extremely useful for such projects.
We have already covered this topic in some of our other articles, so if this question is interesting for you, don't hesitate to check out some of them:
Published at DZone with permission of Irina Polynkina. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments