Originally written by Marten Hennoch at the Plumbr blog.
Lately our blog has mostly been covering GC tuning and lock contention issues. But our bread and butter is still memory leak detection, which was very clearly reminded to us when tracing down a GPU memory leak in a browser.
The story began about a month ago when we started receiving complaints from end users complaining about Plumbr killing their desktops. When the similar reports kept arriving, we escalated the issue and launched a more thorough investigation.
The first weird thing about this complaint was that the reason for crashes was not related to our agent. As you might recall, one important element of Plumbr is a -javaagent, where we indeed operate dangerously close to OS internals and could think of causing such instability.
Instead, the crashes seemed to originate from our browser-based user interface. The UI contains some websocket-magic and real time graph plotting, but overall – we are speaking about a rather simple application.
Nevertheless, this innocent-looking UI seemed to cause severe performance issues that triggered either OS or Google Chrome level restarts for some users. A couple of these users were able to reproduce the problem but not at will. For weeks, we struggled to reproduce the behavior without much to show for. Even though the behavior looked truly typical to a memory leak, we just did not manage to make sense of it. Peeking under the hood with the help of Chrome developer tools or monitoring native memory usage from OS did not reveal any potential suspects.
Only after installing about 10 different Chrome versions to different virtual machines we seemed to be getting somewhere. The problem suddenly revealed itself when we accidentally switched on “GPU memory monitoring” on Chrome Task manager on a particular Chrome build. What we immediately faced was a sudden spike in GPU memory consumption when Plumbr UI was in a background TAB – in just 10-15 minutes we faced more than a 4GB spike in GPU memory consumption. On office-grade computers this is more than enough to bring your machine to its knees.
So we had found our problem. The cause was then already easy to find. In order to plot the metrics Plumbr is monitoring, we use a jqPlot library. This library is in turn implemented on top of the HTML5 Canvas to redraw graphics on the fly. As you can do pretty cool things using Canvas, Google has decided to add GPU-level acceleration into Canvas rendering. Using GPU acceleration sounds like a cool idea, unless you couple it with this bug we apparently were facing. Seems like the bug was introduced on Chrome 23, fixed in some Chrome 24 build and has been reintroduced couple of builds later.
We truly apologize to the users for whom we ended up crashing their machines due to a memory leak. Honestly, we were not the ones writing the leaking code and have patched the current production site with a workaround. For us, the case served as a good reminder that we are on the correct path – solving such issues without proper tools like Plumbr is a nightmare, even if you are equipped with years of knowledge in the domain.