{{announcement.body}}
{{announcement.title}}

A Plan for Performance Bugs in 10 Steps: When Managers Want Answers Now

DZone 's Guide to

A Plan for Performance Bugs in 10 Steps: When Managers Want Answers Now

Catch these bugs!

· Performance Zone ·
Free Resource

Bugs

Performance bugs are never good.
You may also like: The Lifecycle of a Testing Bug

You probably experienced this: you made a new release, everything works, but in production, it turns out much slower than expected. It behaves completely differently. Customers complain. Managers want you to wave a magic wand to conjure the problem away. It's not that easy.

Performance is elusive. Where to start? What to promise?

No developer likes performance fixing. It all works, finally, and now you have to go through all your code again. You will have to rewrite, you will break things, elegant code may become ugly and inflexible. All too often, you put a lot of dedication in an improvement, and when it runs, you find out it does not help at all. Or even made it worse.

Randomly fixing inefficiencies until it's fast enough, causes unknown planning and uncertain results. It is tempting to just wrap a cache around it, hope that does it. It may not be enough. More and faster hardware might cost a lot of money and may not be enough either.

So, managers want a plan. Here's that plan.

The Ten Steps

  1. Measure.
  2. Fix the flat tire.
  3. Always attack the number one time-waster.
  4. Do less.
  5. Reduce data amounts.
  6. Reduce roundtrip penalties.
  7. Low-level optimization.
  8. Hybridize.
  9. Cache.
  10. Optimize the user experience.

Before You Start

Often, intuition makes us work in reverse, starting with step ten, hoping a minimum effort will do. This is a very long, slow and frustrating road to success. To be fast, go from one to ten.

Tip: if users complained, pay them a visit and watch what they do.
Often the thing that really annoys them is very different from the posted complaint.

It's Never What You Think

Performance problems are very counter-intuitive, only in hindsight the cause is obvious; pretty much like a regular bug. Why is it so counter-intuitive? We humans assume that if it's complex in our mind, it's hard for the computer too. To computers, all that matters is whether complexity causes a large number of machine instructions.

Set a Target

A performance target should be set to focus the team on measures that matter and not tinker on things that are known to be slow but less than 1% gain.

This target is not a promise. Only after measurements, you can prudently predict what is feasible with the time, people and money you have. Profound speed-ups require profound changes and profound decisions about people and budgets. Cheap silver bullets and free magic wands don't exist.

Don't Confuse Performance and Scalability

Performance is an individual user experience. Scalability is improving parallel throughput.

Scaling up is a waste of money if the system is slow with one user. If you improve performance, you get more throughput on your current infra as a bonus. First, improve performance for good user experience, then scale up for good throughput.

Step 1: Measure

Since it never is what you think, do not assume. Measure. Let the measurements guide you where to look next.

Measurements can be very simple. Advanced profiler software is nice, but not necessary. Just understand where in your code time is spent, and why.

You will have to dive into the code, or achieve underwhelming results.

For your language, find the system call that returns a micro- or nanosecond accurate clock and simply log elapsed time for a piece of code suspected to be slow. Start the high level and work your way down to a low level. Remove log lines afterward.

Run the same code a couple of times with the same input. The measurements will vary, mostly due to OS housekeeping. After a few tries, a minimum time emerges. This is your baseline performance, the number you are after.

You can make calculations with that number, to predict what is achievable with optimizations. This helps to choose what to work on.

Tip: Only use measurements from non-overload situations. Wait times during overload are great for a "Dramatic Improvements" show, but useless for making technical decisions. During overload, you are measuring stress-effects, errors being handled and queues brimful of waiting events. It says nothing about the speed of your code.

Make a Simple Stopwatch

You will find certain iterations that are expensive, and you have to figure out why. If you'd log every iteration, you get a huge log that tells you little.

Take the 30 minute effort to write a simple "stopwatch" object: A tiny library that keeps a sum and count of lap times, with a minimum and maximum. When finished, let it write a logline with count, min-, max- and average lap time:

class StopWatch { laps=0; sum=0.0; min=0.0; max=0.0; t_prev=0.0 } 

stopwatch = new StopWatch () // Keeps lap count and sum µsec lap time
...
for (some_iteration){
  stopwatch.start ()         // Set t_prev
  ... suspected slow stuff ...
  stopwatch.lap ()           // laps+=1; update sum, min, max, t_prev
}
stopwatch.finish ()          // Log laps, sum, min, max, average from summed lap times

// ↑↑↑ Pseudo code. Remake and complete this in your language


You will find that such a tiny piece of code is priceless and highly reusable. There many cases where simply knowing count, min, max and average tell you what's going on. And if you fix something, you will see how much time you saved.

Performance theory says a median and 95/99th percentile is better info, but StopWatch would need to fill and analyze a sortable list, which is CPU intensive, influencing adjacent measurements.

Use Unit Tests to Prevent New Surprises

Simple change: Let your unit test's assert function (or a wrapper) log elapsed microseconds, and store times of each unit test of each run in a simple database or machine-readable file. In the delivery chain, generate a graphic that tracks the speed of each assert call over your unit test history. Use a logarithmic vertical axis. Jump-ups are an early warning of performance degradation, and jump-downs are signs of improvement.

It might cost you a day to make, but it will be a big help for everyone ever after.

Step 2: Fix the Flat Tire

Image title

It's no point discussing other solutions when the root problem is obvious. If you have a flat tire: fix the tire first, then worry about speed.

For example, if a randomly read database runs on a mechanical hard drive, that's a flat tire. Or a script that crashes and restarts all the time. Or insufficient hardware. Or a ridiculously inefficient algorithm. Etcetera.

Even if the fix itself won't improve much, it stands in the way of doing things right and must be dealt with first.

Pick up small inefficiencies along the way that are easy to fix, such as unintended casts between int,float,string, and datetime, or code you can easily move out of a loop. Don't waste time on measuring or discussing it, get it fixed. The win may be negligible now, but when the big time wasters are gone, small ones make a difference.

Step 3: Always Attack the Number One Time Waster

When you measure, you will quickly find the top five things that take the most time. Always fix the biggest time-waster, measure, and then what pops up next as the biggest, and so on. If you start with smaller culprits first, it's very hard to see if it's going to make a difference at all.

Consider this fictive example, time-percentage spent in parts of code before and after 10x speed-up of the top five time-wasting functions.


Before: 100 events/s

After: 395 events/s

ParseAAABBB

50%

19.8%

FindReplaceToken

25%

9.9%

GetUsableItem

4%

1.6%

UpdateMyStructure

3%

1.2%

InitMyStructure

1%

0.4%

all other things

17%

67.1%


What Is Happening?

  • You will be 3.95x faster, even though you sped things up 10x.
  • With each improvement, all-time percentages shift in counter-intuitive ways.
  • The number one time waster ParseAAABBB still dominates, and all other effort seems futile.
  • The more you optimize, the more all other things dominates, which is not one thing but a sum of hundreds of tiny gains, each of which is a lot of work.

This is why performance optimization can be so discouraging. Attacking the top five made a significant leap, but each 1% extra is a project. Well... that's how it looks.

Is the number one really reduced to the minimum? Did you try to call ParseAAABBB ten times less? Or give it ten times fewer data to process? Does it repeatedly call slow low-level functions? If you find more ways to reduce the biggest time-waster, small-time wasters become worth fixing.

To keep the performance project encouraging and effective, constantly attack the biggest time waster.

Step 4: Do less

The most effective way of speeding up is to simply let the computer do less.

As a developer you forget it sometimes, real code execution is layers of repetitions of repetitions of repetitions. Like a fractal, wherever you zoom in: more repetitions. A few hundred lines of code could cause a billion CPU instructions.

The biggest gain is saving on higher-level calls because they represent the biggest amounts of repetitions.

This means:

  • Do more checks before calling functions iteratively.
  • Break elegant, powerful code up in smaller parts to save on unnecessary calls.
  • Move code outside of loops so fewer precautions are needed inside loops.
  • With iterative lookups, test the usual suspects (first, last, previous) before even iterating at all.
  • If many functions try to find out the same thing, find it out once and pass the result as a parameter.
  • Split functions into a high-level version with all the checks and smart stuff for single random calls, and a low-level version that checks nothing, to be called from iterations.

Often a lower level design decision forces you to do unnecessary things at a higher level; it means restructuring, it can be a lot of work. You're fixing a design flaw. That's always a good thing.

Step 5: Reduce Data Quantities

By just making the quantity of data smaller at the beginning of a processing chain, every next layer processes it faster, because there is just less work to do. Work involves:

  • Tune the lowest level requests for fewer objects.
  • Tune the lowest level requests for fewer properties per object.
  • Use advanced widgets that efficiently fetch data when scrolling or navigating.
  • Avoid exact totals.

Limiting and Paging

Many APIs support limiting (max_results=100) or paging (offset=30&size=10). A simple change, surprising gain. You do need UI widgets that support server-side paging or infinite scrolling.

Reduce Data Properties

Try to find ways to reduce the number of properties per object: First see if the query or API can omit properties through parameters, if you can't, instruct the (JSON) parser to skip parts you don't need.

Many REST APIs GET or POST queries can only give "all" properties of queried objects, great if you need to display them all, a waste if you just need a title and ID. For every object in a list, the back-end needs time to gather and encode all these properties you don't want, and you need time to parse this unwanted data into properties you don't use.

Avoid Exact Totals

Not providing the exact number of items is a big time saver. When count-queries are different per user or imply search, you must run the query all the way to find the exact number, while "9+" or "more..." often is enough answer for the user.

Improve Storage

Why not deploy data storage completely specialized for your type of data patterns? Next to good old RDBMS, there's NoSQL flavors, caches, key/value, column storage, graph storage, time series storage, reverse-index storage, Hadoop, data lakes, there are hundreds and they are fast and scalable.

Reducing data quantities is relatively simple and high yield because it saves CPU in each layer of your application. Changing storage type is very effective but not trivial.

Step 6: Reduce Roundtrip Penalties

Disk access is thousand to a million times slower than memory, and network access can be even worse. Disk and network have a big latency penalty for each individual call.

Files: Bundle i/o or prevent files at all. Use mmap() if possible (see tips below).

Databases: bundle queries and updates, run a node of the database locally.

Networking: in browsers, consider SSE instead of AJAX or rethink your protocols altogether.

Microservice architectures: Prevent accumulating wait times of services calling each other internally: use redundant properties, make mutating transactions asynchronous. If requests are not order-critical and have similar speed, fire them simultaneously with threads.

Preventing latency is relatively complex to implement, but it's a very powerful time saver.

Step 7: Low-level Optimization

Now that you shrunk the giants, you're ready for the real bare metal geeky microsecond-shave-off operations. Only do this for functions called many times in a row.

Winning 0.1 ms doesn't seem much, but called 10,000 times per event is 1 second, the difference between terrible and acceptable.

Example: Machine Learning is mostly slowish Python, but 99.99% of the time is spent in a few heavily optimized C++ low-level vector math calls of TensorFlow or PyTorch. It hardly matters that 0.001% is slowish.

Compiled languages: Write simple, straightforward code. All optimizer flags on. Use const and preprocessor statements to tell the optimizer what to rely on.

Non-compiled languages: make the deepest, most repetitive loops into very small and simple functions with as few variables as possible, so that code and variables stick around in Level 1 cache, close to the CPU.

If your deepest loop has many if/else or switch/case statements, try a branch table.

If your deepest loop has many short iterations, try loop unrolling.

It is hard to know the speed of low level operations off-hand. Things that look fast can be surprisingly slow and vice versa. Measure and experiment, then implement.

Step 8: Hybridize

Optimizations are trade-offs based on assumptions about the most often occurring situations. If there are two different situations that deserve a different optimization strategy, there can be heated debate on which one should prevail.

In such cases, hybridize. Write two optimized approaches and use them both, dynamically.

Example: social media timelines are a join of posts pushed to inboxes of few-follower users and posts queried from outboxes of celebrities with million+-followers.

Performance is the only good reason for redundant code.

Step 9: Cache

Caching is only useful if the data is slow to generate and pretty static for a while.

If data is too dynamic, caching merely costs time and memory. Frequent pitfall: Cache validity. Checking it must be very simple and fast, or else caching is useless. Always measure with and without cache.

Consider caching little things on lower levels to make everything faster. You can avoid lots of unnecessary, repetitive code execution.

Step 10: Optimize the Perception

If there's something you can't optimize, make it feel less slow with UI tricks. Some suggestions:

  • Show a fancy transition animation to bridge the first second of waiting.
  • Serve expired cache data in a UI widget, but indicate that it's still loading. In the background, do heavy processing to generate new data, and replace it when it is ready; updating an already filled list feels less slow.
  • In portal-like screens, render the most used widget first, so people can read while the less used widgets fill the rest of the screen.
  • In portal-like screens, load the parts that are in the viewport first and only load the rest when a scroll move is detected.
  • Instead of letting a user look at an empty screen, make a size-correct contour of the upcoming layout, then fill with content when it arrives:

Image title

General Rules of Thumb for Performance

Generic Wisdom

  • Really dramatic gains are achieved by combining multiple techniques to speed up a culprit.
  • Each ten times heavier load brings completely different challenges and requires different techniques; test a ramp-up to ten times the current load to find the breaking point.
  • If you made the target but there are some easy wins left, don't quit. Win big.
  • The performance will always go with some complexity, rigidity, and redundancy.
  • Never condone dirty hacks for performance; dirty hacks always backfire.
  • If you abuse tools/frameworks/techniques against what they were made for, it will perform well with small numbers, but it gets horribly slow when numbers get large.
  • Any storage data model works better when tables are stupid simple with no exceptions, as opposed to shredded with lots of coupling tables.
  • Refactoring from "rows x columns" to "columns x rows" or move the same work to another layer, will not help performance: the work is the same, it's just done in a different order.
  • Do not expect much of tuning application-, runtime- or kernel settings. It is easy to destroy performance completely, and hard to get a tiny improvement.
  • Make performance a standard design consideration. Premature optimization is evil (why?), but the poor performance from poor design and careless coding is foreseeable.

Practical Wisdom

  • Sometimes it's quicker to sort lists inside your application, sometimes the backend or the UI is faster. Try and measure. Always choose the fastest sorting function you can find. Still, prevent having to sort at all, if you can.
  • Regexes have high baseline overhead; even a regex that does nothing, like /()/ , can take up to a millisecond. Avoid regexes in iterative code.
  • Few people know mmap() is the fastest, lowest level file i/o (more). Regular file i/o is a layer on top. You read/modify shared memory directly, the kernel writes any changes to disk, as lazily as possible. I've seen 10x speed-ups in the file i/o, and I used it in heavy production applications for high-volume, zero-latency inter-process communication (code). It's not exotic: Apache httpd serves files through mmap() and running executables are mmap() instances.
  • With strings, use byte index and byte lengths instead of character index and character lengths if you can. UTF-8 characters are variable-width, positions must be counted from start to end, every single time. Not fast with strings of MB or GB length.
  • In long lists, inserts are slow, you are shifting big blocks of memory a few bytes for each insert. Append-then-sort is much faster than insert-sort.

Conclusion

With this plan, you can bring structure, focus, and predictability to a performance optimization project. There is a lot you can do with simple tools. You could even turn it into a sport!


Further Reading

Why Do Bugs Attack Your Software?

How to Efficiently Resolve a Bug

All About Triaging Bugs

Topics:
performance ,performance engineering ,scalability and performance ,performance analysis ,performance metrics ,performance tips ,performance tuning ,performance diagnostic methodology ,performance optimization ,performance problems

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}