DZone
Java Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Java Zone > Guess, Don't Measure

Guess, Don't Measure

Nikita Salnikov-Tarnovski user avatar by
Nikita Salnikov-Tarnovski
·
Nov. 12, 14 · Java Zone · Interview
Like (0)
Save
Tweet
2.01K Views

Join the DZone community and get the full member experience.

Join For Free

I have lost count in how many occasions I have recommended to start all performance tuning activities by setting goals and measuring baseline instead of tuning random bits in your source code. In the current post I will give you a counterexample about what just happened.

We have been revamping our own internal monitoring tooling to make sure our production deployment is fast and error-free. For this we have rebuilt our logging solutions and monitoring tooling which now among other things includes bunch of Logstash services aggregating logs to central log management solution and a Grafana – Kibana combo to make sense of the logs.

Last week we were finally ready to ship the new monitoring solution to production. Immediately we started receiving data about how (badly) some of our public services behaved. For example, acquiring the list of JVMs Plumbr is monitoring was one of the operations whose performance was not satisfactory for us:

performance measurement

As seen from the above, 10% of the queries to /jvms URL were taking multiple seconds to complete. This was clearly violating the SLA we had set for this particular operation.

Completely in parallel to surfacing this data from our monitoring solution, our product owner had entered a bug in our issue tracker complaining about that a list of JVMs for a particular customer with 25 different JVMs taking “forever” to render. The forever was in reality seven seconds, but important is that the problem was actually having effect for real users and was reported by real users as well.

As a next parallel activity, one of the engineers while implementing a completely non-related functionality stumbled upon a poorly designed query. Exact details are not too relevant, but what was once a simple query had over time evolved to what was now yet another example of the N+1 query problem in action. He had introduced the problem to me and had a permit to rewrite the design. By accident, the query in question was the very same query fetching the list of JVMs.

Today, on our regular monitoring review, we spotted a sudden drop in the execution times for this operation as seen from the following chart. What used to look like multi-second operations was now reduced to couple of hundred of milliseconds:

performance improvement

First reaction to this data was “great, the optimization our engineer took is clearly working, look at this 10x drop in response times”. Proud of this, we summoned the engineer in question to the meeting to give him feedback and pat on his back showing him the impact of his work.

When we explained the chart to the engineer, he just sat there, silent and confused. Five more minutes of confusion and embarrassment later it became apparent that the change he had built the optimization into has not yet even merged to the production branch and there is no way this performance improvement is related to his work.

This was now super awkward in many ways. We had a clear sign of performance improvement in place and no idea why it had happened. The doubt went to the level where we suspected that the functionality has to be broken, but numerous checks to the service demonstrated it was behaving exactly as it was supposed to be.

And then it struck me. Yesterday I was dealing with some data integrity issues and discovered a missing foreign key in our storage. I had added the key and an index to it. Checking the continuous integration logs revealed that this change had been propagated to production exactly when this performance improvement appeared. So we had found the actual reason for this change in response times.

And still, here I stand, feeling a mix of surprise and embarrassment. On one hand – it is awesome that we managed to patch a performance problem before the impact even became fully evident. On the other hand, the whole situation is contradicting my core beliefs where I always say “measure, do not guess”. Considering that actions speak louder than words, I have now some food for thought.

Performance improvement Measure (physics)

Published at DZone with permission of Nikita Salnikov-Tarnovski, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • What Are Cookies in Servlets?
  • Types of UI Design Patterns Depending on Your Idea
  • Debugging Java Collections Framework Issues in Production
  • 6 Things Startups Can Do to Avoid Tech Debt

Comments

Java Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo