A Developer's Perspective: Responding to the Call
A Developer's Perspective: Responding to the Call
Responding to emergencies as a developer
Join the DZone community and get the full member experience.Join For Free
If you feel like existing processes can keep major issues from reaching the user base, you'll want to read this article — which provides an expose on how a reported issue was not addressed for several months only after the application was deployed.
My friend Greg has been a professor for most of his career. One thing he has conveyed to me is, "to think of an IT organization as a fire department," where you drive by and notice firefighters tending to periodic tasks or simply enjoying a peaceful day. In every case, they are ready to respond to an unexpected emergency. The theory is that everyone is glad that they are there but happier when they are not fighting constant emergencies.
To Greg, a happy organization is one in which the IT department is focused on their feature (or planned) work and not constantly concerned with changing course to react to an unexpected situation.
Rolling Back Time
Months ago, my CleanSlate team was in the process of converting a very large repository into a series of services which would all become several individual repositories. Leveraging Maven, our team extracted the common components into another repository to serve as a shared library for services. During that time, Russell (as mentioned in my "I Want My Code to Be Boring" article) discovered some design decisions which raised concerns.
Issues were logged and our team communicated this information to the appropriate resources. We were told that the issues would be handled before the services were deployed as part of a cloud transformation.
Since these items were somewhat abstract, our team really did not have an easy way to know if the fixes were made. At the time, everything seemed okay; we were focused on other aspects of the project.
Leveraging a holiday weekend, the cloud transformation was planned, scheduled, and attempted. The database was moved into the cloud, the applications and services were started in their new container-based instances, and the entire application was ready to be validated. As one might expect, some minor issues were noted and fixed without any major setbacks.
Then, we found out that one of those items we originally reported was not addressed.
Without anyone at the client at a level to understand the technologies at play, I received a call from ("Things I Learned From a Guy Named Darren") Darren to see if I could look into this situation. I jumped in to assist the team. After all, prior analysis determined that reverting the go-live weekend would ultimately cost the customer 500,000 dollars.
In a span of about nine hours, I was able to figure out what was not working correctly. From there, I worked with Russell to build and inject the necessary solution programmatically – to allow a cloud-based configuration to function without any issues. With the update code running locally, I was able to validate the issue for technologies I had never once utilized in my entire career in IT.
The go-live weekend was implemented as planned. All was happy.
Sound the Alarms
About a week into the new environment, remote facilities, which utilize a critical component of the application began to log tickets about missing or incorrect functionality. In short, the application was not working as expected.
When I was approached about these items, I thought back to the tickets Russell logged several months ago. Reviewing the situation led me to make the conclusion that none of the items reported by our team were addressed. Additionally, none of these items were part of a test plan — which means they were never addressed — allowing broken functionality to be delivered to the application's userbase of tens of thousands of end-users.
Three Days Ago
Within a day, the decision was made to pull me into the fold to look into this situation. I would handle one of the three issues that was not working as expected. (All three issues stemmed from what Russell noted in his original discovery.)
Sifting through volumes of documentation, I quickly realized that this aspect of the employed tooling was never considered "in spec" with what the original framework was designed to do. In fact, years ago, someone at (or working for) the client downloaded the source code for an open-source framework, changed it dramatically, and built a JAR file that was included in the project. Years later, that framework had evolved, but the client's source was never kept in sync. Little documentation existed to fully document and describe the changes that were required.
That legacy version of the open-source framework contained within the JAR file would not run with an updated version of Java and would cause run-time exceptions within the Spring framework hosting the application. Upgrading and using a current version of the framework did function without any issues — except all of the customizations were gone — which tied directly to the three issues that were considered urgent to resolve.
This time twenty hours ago, I still had no clue how I was going to resolve an issue that was reported by Russell months ago. An issue that we thought had been resolved and tested, but instead was left off the radar of a cloud transformation. The team at the client, some of whom have been in a supporting role for 10 years, also had no insight into this aspect of the application.
No one on our team knew who wrote this code (now obfuscated from a conversion to Git). So, gaining information via an information exchange or recruiting the individual(s) was not an option.
When I started, I drafted a list of potential causes in an empty document within my Sublime editor. In the 29 hours since I was assigned to the task, I was able to transform the code to call the functionality via the application's UI but still had not determined the cause of the issue. What was more concerning was that my list of ideas within the Sublime editor was fully exhausted.
I was exhausted too. The time was 11:42 pm and I was closing in on 41.50 project hours in four days.
I was about to tell Darren (who was working on another one of the issues) over Skype that I had no idea how I was going to fix the situation. We had a status call planned at 11 am, and I was about to change my focus to determine how I could convey this negative information.
Then, while looking through the native source code, I saw something stick out that I had not seen or noticed before.
It gave me an idea.
My Final Attempt
Like when Dateline plays the results of a 911 call or a police interview with the expect suspect, I figured I should present the Skype dialog between Darren and me at this point:
jv [11:42 PM] I think I just found the issue! Darren [11:47 PM] yeah yeah yeah!!! jv [11:48 PM] Stand by for all the builds and stuff. jv [12:00 AM] that is crazy stuff, man
Four long minutes later:
jv [12:04 AM] I FREAKIN’ FIXED IT!!!!!! Darren [12:05 AM] congrats!!! jv [12:05 AM] Oh my! jv [12:05 AM] The good news is … it will work.
Before I reached the fix, I had created a very-controlled sample set of data to process. Data that would be pristine and perfect, so that I could eliminate any issues from the years of data manipulation. This way, I could focus on fixing the core issue and then worry about any data issues later. My last Skype message to Darren before I decided to get some rest was simple:
jv [12:06 AM] Now to figure out if it will work with the current state of the data.
Implementing the Fix
Getting an early start at 6:15 am, I spent most of the next morning validating there were no data issues, pulling out temporary changes I had injected into the very large codebase, and polishing up the fixes that were required to merge into the projects main branch of the impacted repositories. From there, I explained the situation and the fixes I employed. A PR was created for each repository and sent for review.
By 3:30 pm that afternoon, all the changes were merged into the develop branch, and the code was deployed to the DEV environment for validation and review. I stopped working for the week at 4:40 pm. I had worked 51.50 hours in five days.
The unexpected work from this cloud transformation has given me a greater appreciation of my friend Greg's perspective. I would assume that a large number of emergency calls could be prevented if best practices or safeguards are part of a standard routine. Meaning, if you have your HVAC system (as an example) checked periodically, there is less of a chance something is going wrong to the degree in which the fire department is called.
This is no different than IT. When Russell created the issues months ago, we expected the appropriate action would be taken. Even if it slid off the radar, some regression tests or validation efforts would discover the issue long before the planned, go-live weekend. Instead, the best practices for safeguards were missed, which led to multiple emergencies that needed to be addressed.
Have a really great day!
Opinions expressed by DZone contributors are their own.