Even with unit tests, integration tests, stress tests, there are times when you put something out into the wild and something starts happening that you have not seen before. Pushed out an app a few months ago. It exposes a web service interface. Calls into it include some ‘evidence‘ that is handed over to a bayesian inference engine and a probability is computed. The tests we had done showed the computations were extremely fast and clean. Made some load tests using JMeter, even went beyond that and made a separate war file that could be used to trigger invocations of the service complete with real ‘samples.‘ Everything was looking great. Then, the app got pushed out onto a cluster.
Suddenly, the admin from the deployment team shows up to the weekly status meeting saying that the CPU was ending up pinned after some time, and once it went nuts, it never abated.
The first thought that came into my mind was ‘why is this a. not resulting in a failure, and b. the computations are timeboxed so if the performance was degrading, it would eventually stop returning scores. But it wasn‘t. It was continuing to function properly, happily pumping out log messages about each layer of the probability calculation.
My first inclination was that there was something wrong with the container. That maybe the garbage collector was going into a death spiral at some point. But even that didn‘t make a lot of sense, because that should also have produced a degradation of service. Figured it was time for a profiler.
Last time I looked at using profilers, I put my hands on Glassbox, which is interesting, but ultimately pretty bare for someone whose used serious tools ( JProbe was king in the first half decade of Java dominance). Then I talked to a few people on the team about profiling and one of them said we would need YourKit. It had already been coming to the fore when I last did Java profiling. So I went to the site and looked at it. Seemed very impressive. But I was thinking ‘just setting this up is going to take a lot of work, duplicating this, especially since we‘d already done that and not found this condition. Then a great thing happened: I mentioned this to the onsite admin and he dipped into my favorite I word: initiative, and just went and installed it on the server.
For the first week or so, we spent time talking about JSON errors from XStream. There were a lot of them. I was having a hard time believing that a parsing failure would make a thread go zombie. Eventually, I was reading about the features again one day and saw that snapshots could be taken and saved, but seriously complete ones. I sent a request for one. While this was all going on, one of the times the machine was hung up, I asked the admin to see what the threads were doing and there was a method from the gaussian kernel estimator that performed a binary search. suddenly it seemed that there was a way to a zombie thread. Started looking at the binary search thinking ‘either this has a failure condition where it can get stuck in a loop‘ or, not being threadsafe, it was experiencing a race condition in the rare cases where concurrent probability computations had overlapping features.
As fate would have it, this kernel estimator was the one piece of open source code I used in my implementation of bayes. I found it in WEKA, a framework I used first on a project a decade ago. The code was ugly but didn‘t seem to have any obvious faults with it. There were no tests. So I pulled the code out and wrote some. Could not produce a failure. So I figured it must be a race condition. So I synchronized the method and thought this would surely squelch the zombies. I did the old trick of making a package in my own archive and putting the source in it, and adding some logging.
To my amazement, after deploying the war, the app ran fine for a while, then it went zombie. Finally, I got a dump with this zombie session and was able to see all the threads and it was clear that the binary search was where things were popping the rails. So I added log messages showing literally every comparison (>, =, <). (For a time, it seemed that the = (double) might be the problem because there were doubles involved…)
With the new logging in place, the war was deployed and a few hours later, it went berserk. I got a dump and the log was huge: 1.25M lines in it. I started going through it with vi, first from the top, then the bottom. At the bottom there was only one line, repeated ad nauseum showing what the middle and current values were. Right away it smacked me as strange that that message could appear without any intervening logs. I did :set number on vi and started doing my own binary search until I got to where the logs went over to the singularity. Then I got the thread id. Decided at this point to clip off all the dead messages, then use grep to move only my thread messages into another file and went back into vi. Again, jumped around until I found where the first message was that had middle and current values equal to the ones that were repeated ad nauseum. When I finally found it, the line say ‘computing the probability of: NaN‘. My Moby Dick moment immediately made me wonder: am I the idiot or did WEKA get me here?
Ironically, the question goes to one of my favorite schisms in programming: the one between contract programming (e.g. DbC) vs. defensive programming (Code Complete). I come down strongly on Meyer‘s side. Suddenly, I realized, this was very similar to Meyer‘s brilliant interpretation of the Ariane 5 crash report. One of Meyer's core arguments about contracts is that clearly typing is not a sufficient means of expressing constraints. In this case, the implications of leaving it to typing is that the calling interface says the function takes a double and returns one, but in reality, NaN is a double. (Of course, the counter question is why try to get an NaN from a pdf [probability density function]. The answer to this is simple: features are computed values too so unless we can be guaranteed to not get back an NaN, this is liable to happen.)
Amazingly, one hour after fixing this, a dev on one of our other projects came to me and told me that he just figured out that a problem he had found was an NaN problem (super strange synchronicity..).
The second one ended up being easier to unravel because we could just pull the code into a failing unit test. The long and short of that one was we were using CI Bayes to perform bayesian text classification. I am going to do a longer post about CI Bayes. The funny thing about that is I had done something almost the same as that project's author (the ServerSide's Joe Ottinger), but my implementation was in Objective-C. I am going to redo mine in Java because the Java in CI-Bayes is just beyond perplexing to me, and has some serious issues. Meantime, the path to figuring out why we were bombing out with an NaN using it was the following: 1. we suspected that asking it to classify items that had a lot of text was the problem, 2. we embedded log statements to see where it was breaking, which got us to zero in on a part of the project that had a loop going through the tokens found in the item to be classified. It was assigning a starting probability of .5 to the item (the idea being that in a classification problem, with no evidence the prior is it could be or might not be the target classification). After staring at it for a few minutes, the dev I was pairing with said 'if you perform a product on .5, doesn't this eventually drop out of the bottom of double?' I burst out in laughter: 'it's Zeno's Paradox! I‘m pretty sure this logic is wrong. I will also leave this to another post (reader exercise), but the reality is if your probability problem falls over with more evidence, the computation is setup wrong. Now, that said, the literature on probability is loaded with writing about overfitting a problem, but that is something totally different: it refers to too many features of some part of a problem which then drowns out the effects of other features. Anyway, this is wrong, broken, and represents a weirdly similar predicament to the other bug: code that has no tests that exercise constraints and boundaries is going to produce situations like this.
But the bigger frame for these two is that perhaps for all our tools and conscientiousness about testing, there are fundamental language problems that are not magically dispatched. In some ways Agitar was an attempt to close this: it would look for gaps between the weak expressions of simple typed interfaces and insert automatically generated tests, but that failed. It's doubtful that Meyer's crusade for Design by Contract will ultimately gain traction. Something like it needs to be in the next language, whatever that is. I'm sure I've done a blog post about contract vs. defensive programming. One could argue that even the defensive doctrine would fault the providers in both cases here: the idea is to defend yourself from misuse. One of the interesting things about the first bug was that the binary search, by going zombie, was what took down the server. The solution would be to say 'if not in range throw new IllegalArgumentException.‘ One of Meyer‘s key arguments is that even though contracts and defensive may often end in the same place, the contract wins the argument because it gets there much faster and with less pain.