Do Not Stop Threads!

Follow this story from 2006 about losing a $20,000 contract due to calling Thread.stop() and how no one realized what was actually happening.

By  · Opinion
Save
4.3K Views

I dedicate this article to László Fekete, my former boss and director at T-Mobile Hungary. He plays a significant role in this story as he was the one who made the decision to cancel our contract. I must acknowledge that he made the right call, and it was the correct course of action.

However, I also remember some instances where he seemed less concerned about his health, disregarding his blood pressure and cholesterol levels, despite my concerns, which we discussed a few times. Sadly, László passed away in 2017 at the young age of 57 due to a heart attack. It’s a stark reminder of the importance of taking care of our well-being and not neglecting warning signs.

Now, as I find myself at the same age László was when he left us, it serves as a poignant reminder of the fragility of life and the need to prioritize our health and well-being.

Introduction, Topic

I am 57, and I recently made some bad moves and my back aches. I cannot sit for a long time, and I suddenly had ample time on my hand watching YouTube videos.

During my exploration, I stumbled upon an impressive channel called ThePrimeTime. The creator of this channel is a remarkable young individual who possesses wisdom beyond his years. His videos exhibit a profound understanding of technology, which captivates me.

I appreciate how he simply sits and discusses other videos or articles without feeling the need to over-explain things. It’s a "take it or leave it" approach. Those who comprehend his content gain valuable insights, and those who don’t: sorry.

I very much enjoy it when I understand what he says and feel that probably not many do. It is a snug but somewhat arrogant feeling that one should be careful.

Also, I could hardly find any of his statements I would strongly disagree with. Sometimes I feel we could have some discussion, but generally, I can agree to, or accept his points. Go and watch him!

Recently I saw a video where he was commenting on an article that was about a story about how someone almost accidentally corrupted PayPal in the early days. I will not talk about that: you can view it. It is a story with lots of technical details you can learn from.

Being 57 does not only mean backache. It also means that I have seen and done a few things that sometimes I tell younger people in the office. Why not write articles about these? So I decided I will write a few articles about things that I have seen and done and that I think are worth sharing.

And here we go.

Disclaimer

Most of the story is true and based on real events.

Stopping Threads

As I said, I have time to watch YouTube videos. I came across this short, one-minute video about how to stop a thread, which you should not. It does say you should not, and gives a sentence why, but one minute is too limited to explain the reasons.

I know why you should not stop a thread and not only what the documentation says. It cost me $20,000 in lost revenue in 2006 when the GDP per capita per year in my country was less than that.

Background

I started programming in 1980. My father was a professor at TU Budapest in Hungary and could access a TI-30 calculator. It was a programmable calculator. I remember I tried to write a program to crack an RSA-encoded text published to be cracked. Although the prime numbers were only 10 digits long, and the calculator had 1024-step program memory, registers were perhaps 16-bit integers, and I had to implement multi-precision arithmetic in my code.

I never succeeded with this one, but the exposure to programming "infected" me. I was 14. Later I programmed the Swedish ABC80, the Hungarian C64 clone, and the Hungarian VT-1080z that resembled the Enterprise computer, ZX Sinclair Spectrum, and many others. That time we programmed whatever we could get our hands on. My Unix exposure was minimal because the chair I was volunteering with had VAX VMS machines.

I finished TU Budapest Electric Engineer and started to work as a sales rep for Digital Equipment Corporation in Hungary in 1991 -does not fit a programming carrier, does it? At the time, paid programming in Hungary was mainly crafting bookkeeping applications in DBase, and it did not pay well. I was already married and had a child with twins on the way, so I needed a respectable wage. You can afford to live your hobby as a profession if you can afford it. My priorities were different.

I kept programming in C and Perl at that time as a hobby. I even wrote a small book in Hungarian about Perl, which was the first of such, and many learned Perl programming at that time from my book. So much so that when Larry Wall visited the Budapest Perl conference in the late 90s, I was invited as a keynote speaker. The title of my talk was "Forbid Perl," and I was talking about how Perl makes you so productive that using Perl eliminates the need for too many other programmers, and therefore it has to be forbidden to be used for real applications. I was saying that in front of the father of Perl sitting in the first row. I intended that as humor, but after a few decades, I see that I was right. At the time, I did not see the benefit of professional software development overhead versus hacking something together in Perl. It is not the trait of the language per se, but Perl usually was used to script things in a hacky way.

I left DEC in 1999 and joined index.hu as CIO. It was a small startup, the first only online news site founded by a few university friends of mine. We wanted to make history and get rich. We achieved the first one.

I also programmed the advertisement engine of the site, which is a story on its own.

When the dot-com bubble burst, we had to lay off people and restructure the operation from investment-oriented growing to sustainable operation. There were a lot of things I learned there, but those were management lessons, not programming. The last step was to give in my own notice, and I left the company in 2001.

Then I started to work for T-Mobile, but they did not hire me as a programmer. I had no prior professional experience and "hobby programming" did not count. I was hired as a project manager.

Working in that position, I even ignited the development of a reformed project management methodology, but this was not my piece of cake. Five years later, my brother told me to create our own company. He was the one-sixth owner of a small company that was doing software development, and the other five developers moved towards SQL and stored-procedure direction. My brother thought that Java development is more interesting and more prospective, so he wanted to start a new company.

Why we decided to go in the Java direction and not Microsoft is again another topic that deserves an article on its own. It was more political/philosophical than a technical decision. I will write an article about that later, as well as about why we chose to trade in our old Linux and Windows machines for MacBooks with MacOS. These are interesting topics because people approach such decisions based on belief, and it can lead to heated discussions. Not now.

We started the company in 2006. One of our first clients was T-Mobile. We knew the people there, they knew me, and they needed an advertisement engine. I wrote the one for index.hu, and it was still in production six years later, delivering millions of HTTP responses per day. Not only it was the far largest traffic web server in the country, but it was also the most reliable one.

Much later at a conference, a speaker said that back in the day, they checked their Internet connection by pinging the adserver of index.hu. Other sites could be down, but if the adserver is not reachable, then it is more likely they have a connection problem. He did not know I was sitting there in the audience. It was a great feeling hearing that. That ad server ran for nine years uninterrupted and without any code modification.

Thread Stopping AdServer

So we got the contract to develop an ad server for T-Mobile. The contract size was around $30,000. I did not know any Java at that time. I had limited OOP experience. I was mainly programming in C and Perl and not commercial. But I was a good programmer, or at least I thought so.

We created the application in Java while I was learning it. The users were authenticated, and we had a backing database with user data. The ad engine had to select the ads based on the mobile subscription, number of used minutes, phone type, and other parameters.

We used PostgreSQL as the database in the dev environment and Hibernate on a Tomcat. An advertisement had to be displayed in two seconds. If the selection process was running longer, then a default ad was displayed. To achieve this, we executed the selection logic in a separate thread using the ExecutorService and waiting on a Future object. We also used the database connection pool available from the Hibernate library.

We manually tested the application, and it worked fine. We ran some load tests and it worked fine. But I wanted to deliver perfect software, so I decided to play a bit with the case when the selection times out. In that case, the request serving thread sends a response, but the selection thread is still running putting a useless load on an already overloaded system. We can call 'stop' on the thread.

We tested this scenario, and it worked fine. The connection pool realized that the thread was stopped and closed the connection and created a new one in these cases. I knew that the production will use the ORACLE database and the connection pool will also be the one provided by ORACLE. We did not have a test environment with these components; therefore, I decided not to use this performance-saving trick in the production system. But I was proud of my code, and I did not want to delete the line stopping the thread. Instead, I put it into an if statement that was never true, with a comment something like:

Java
 
// this 'if' is always false but I keep it here to show that I know how to stop a thread
if( true ){
    thread.stop();
}


Now, you already get a clue, especially if you skip over the line reading it not realizing that the ACTUAL value is 'true'. The code went into production and worked fine. It worked fine for a while, except when the load went up.

When the load went up, the application started to deliver the default ad. The weird thing was that after the load went down, the application still delivered the default ad. The operation had to restart the application to work again. We did not have a clue what was going on, and we responded by suggesting increasing the hardware capacity. It was clearly needed to handle the peak load, but there was another problem eventually. We tried to ignore it. Being a small company, we were already occupied with the next project. Putting new hardware under service in a large corporation does not happen from one day to the other. The service needed to restart a few times every day. It went on between us and the project manager till he escalated the issue, and we could not ignore it anymore.

We had the log files, and we started to investigate. The log clearly showed that the application allocated a connection from the pool when a selection started. The log also showed that the connection was returned to the pool when the selection finished even when the selection timed out. I strongly believed that this could not be the problem, especially because we did not stop the threads in the case of a timeout.

At least that was what I thought.

We added more logging to the code, and deployed it to production, which essentially made it a bit slower, making the client even less happy, but it was needed. There were log items for each request and response, we knew when a request timed out, the connection id, thread id, and so on. The log was huge, and I wrote Perl scripts to analyze it. It took a week and a lot of diagrams until I realized that whenever a thread timed out, that connection ID never appeared later in the log. The connection never returned to the pool, even though the library falsely reported that it was. But why? We did not stop the threads, and the log showed that these threads always stopped a few milliseconds after the selection timed out.

This was the first clue. It seemed fishy. When the selection using a few SQL selects timed out, why was it always only a little bit late? The fact that we first tried to increase the timeout from two seconds to two and a half seconds shows how clueless we were. It made the time outing threads to finish in two and a half seconds plus a few milliseconds. Always the timeout time plus a few milliseconds.

"Didn’t you leave the code in that stops the thread?" asked my brother.

"Sure, I didn’t, see, it is in an if statement that is never true."

"No. That is what the comment says." — he replied. — "But the code is there, and it stops the thread."

I was looking at that code hundreds of times blindly during those previous two weeks. I read the comment and skipped the code. I read what I wanted to be there and not what really was there.

This time I deleted the line and the comment, and we deployed the code. It worked fine, unlike our relationship with the client. They canceled our contract for the further development of the ad server. We had lost a $20,000 contract, and we were told that we will never get any contract from them again. I could not blame them.

This "never" lasted three years when partnering with another company, we delivered a system they used to electronically sign four million invoices every month. Do you remember what my very first program was on that TI-30 calculator? That delivery I am not ashamed of. I learned a lot during those three years.

Conclusion

There are many things to learn from this story.

Don’t Stop Threads

Even though you technically can stop threads, you MUST not. If you MUST not, then why experiment with it? You can tell the thread that it can stop if it feels so. You can use some shared state for the thread to check periodically and stop when it can do safely. Calling interrupt() on a thread is a good way to tell the thread that it can stop. Documentations list a lot of things that may happen when you call stop() on a thread, but reading it is one thing and when it happens to you is another.

Everybody has to burn their hands a few times. The cleverer you are, the less you need to burn your hands. There are some Mucius Scaevolas out there, not learning from their mistakes. Do not be one.

Logs are Only Logs

Logs contain the messages that the application writes about what it does and not what really happens. Programmers make bugs, including misleading logs. Even when you use a high-reputation library, you can still face bugs.

Comments Can Be Dangerous

Comments can be dangerous. Comments are in English and no matter how much of a nerd you are, your eyes will read the human text first. In this case, non-native English speakers may have a slight advantage. If the comment is outdated, misleading, or plain wrong, it may lead the maintainers' eyes away from the code.

A good comment does not explain what the code does. The code precisely describes that. You should explain why it does what it does and how other parts of the code should use, and interface the code.

In this case, not having any comment before the if statement, or just:

Java
 
// we can switch experimental thread stopping on and off here
if( true ){
    thread.stop();
}

would have been better. My today wisdom says to delete the line and the comment. If you want to keep the line as a legacy, do it in a separate branch or tag it in the version control system.

You Do Not Know When You Are Stupid

At that point, writing my first commercial application, I was at the peak of my Dunner-Kruger curve. You do not know when you are there. If you feel you are an expert, you know everything, and you are the best: be very careful. You are probably at that dangerous peak. Don’t stay there: climb off on the right side and start to climb up on the peak-less long slope to the right, always with a healthy level of self-doubt.

Customer Is Always Right

When the customer says that you are wrong, you are wrong. They complained that the application does not come back from the overloaded state and our first response was to ask for more hardware. Technically, we were right. If the system does not ever get into the overloaded state, then there is no problem not getting back to normal from it. However, you see how arrogant this standpoint was. Probably this was the number one reason we lost the contract.

We learned from this mistake. We learned many more mistakes after that, and this is a process that I have not finished yet. Learning from mistakes may be the most perpetual thing in my life, and I think it is important for everyone. I have many similar stories, and if you liked this one then leave a comment, and give some feedback that will make me know that I should write more.

Published at DZone with permission of Peter Verhas, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.


Comments