It was another Monday in the office, testing the next release of our user experience monitoring software. It had been tailwinds towards the release for weeks – everything was progressing nicely. Until the moment I found myself waiting for our Java Agent startup for minutes instead of seconds.
The problem was reproducible – attaching our Java Agent even to a tiny test application increased its startup from two seconds to 30 seconds. My first guess in the hunt for the root cause of the issue was suspecting any of the recent changes in Agent. This guess was quickly dismissed as deployments of previous Agent versions resulted in the very same behavior, which was definitely not present during the previous weeks.
So it had to be something else. This “something else” started revealing itself when looking at thread dumps taken from the lagging JVM process via kill -3 PID:
"Signal Dispatcher" daemon prio=5 tid=0x00007f97f885b000 nid=0x4103 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "Finalizer" daemon prio=5 tid=0x00007f97f8054000 nid=0x3103 in Object.wait() [0x0000700000add000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135) - locked (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209) ... cut for brevity ... "main" prio=5 tid=0x00007f97f8003000 nid=0x1c03 runnable [0x00007000003c7000] java.lang.Thread.State: RUNNABLE at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293) at java.net.InetAddress.getLocalHost(InetAddress.java:1469) - locked (a java.lang.Object) at eu.plumbr.agent.ch.qos.logback.core.util.ContextUtil.getLocalHostName(Plumbr:38)
Now I was confused. The Logback library stuck at java.net.InetAddress.getLocalHost() has been performing the lookups for localhost for ages. The confusion increased when discovering the issue was not reproducing in our test matrix – throughout the dozens of machines in the matrix, everything seemed to work well.
And then it struck – over the weekend, the good old Mac OS X in my laptop had upgraded itself to MacOS Sierra. Due to yet unknown reasons, this started influencing DNS lookups for localhost which now started to take 30+ seconds instead of milliseconds as before.
Thanks to the right keywords now in place, the solution was just one google query away. Apparently, we were not the first ones to be hit by the issue and with the help of StackOverflow, the solution was now at our fingertips. Bypassing the issue was as easy as adding one more mapping to my /etc/hosts file for the 127.0.0.1 so it would map for the hostname. As a result of this my /etc/hosts now looks as:
$ cat /etc/hosts 127.0.0.1 localhost MacBook-Pro.local ::1 localhost
I do hope this saves some of our Mac-using readers hours of debugging time down the road. And if anyone out there happens to know anyone from the MacOS Sierra team responsible for filesystem / csrutils, let them know that there are a lot of confused Sierra converts who struggle with simple DNS lookups.