Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Taming Unexpected Production Issues

DZone's Guide to

Taming Unexpected Production Issues

· DevOps Zone
Free Resource

Download “The DevOps Journey - From Waterfall to Continuous Delivery” to learn learn about the importance of integrating automated testing into the DevOps workflow, brought to you in partnership with Sauce Labs.

You have unexplained production issues from time to time. The business is impacted and the management briefs down your neck while the pieces of the puzzle are collected. There is no way to recreate this on QA set-up. The issue is lurking there for quite a while and becoming more and more frequent. Sounds scarily familiar?

I have peculiar love for such puzzles. I enjoy the adrenalin rush of the hunt, and the investigation aspect of theory creation. The latest one that I inherited took about 6 months to figure out, as the legacy product had number of issues that would surface with load and bad luck a.k.a timing. It takes endurance, iterations, and information collection and data mining to solve a problem of such magnitude.

The information below will point you in the direction of what to collect and how to crunch the collected data. The system I worked with runs on Enterprise Linux Server release 5.8 (Carthage) and JBoss server. You will have to modify to match your OS and Application Server specifics.

Things that will help on the hunt

1.  Periodic thread dumps saved to a file, this can be as simple as cron job to curl for JMX-console of JBoss to activate jboss.system.ServerInfo listThreadDump method

2.  Periodic thread cpu utilization saved to a file, this can be as simple as cron job to curl for JMX-console of JBoss to activate jboss.system.ServerInfo listThreadCpuUtilization method

3.  Periodic results of top command

4.  Logging information from the application, there is no such thing as redundant debug info. A JSP page that allows changing verbosity of a given category is priceless here.

5.  Monitoring software like NewRelic. This one is great at pinpointing slowest running transactions or SQLs, amount of time spent on calls to an external systems, comparison of performance, and reporting on SLA metrics. Watch the watcher though! Don’t forget to update its libraries while upgrading the 3rd party packages, JVM or OS on the product or the machine. Analyze its log files as well. How do I know, form somewhat painful experience.

6.  Have timestamp for your incident and if possible alerts for early warning signs so you’ll be able to watch it building up with DevOps before they decide to pull the plug.

My DevOps guru Jared set up JMX curling and top for me with the power of cron and scripting so each 10 minutes there is an update.

This is the cron expression

*/10 * * * * /tools/scripts/jboss/threaddump.sh > /dev/null 2>&1

And the script itself looks like

#!/bin/sh

logfile=`date +%H%M%S`
logdir=`date +%Y"/"%m"/"%d`
datestamp=`date +%Y%m%d`

if [ ! -d "/logs/console/$logdir" ]; then
        mkdir -p /logs/console/$logdir
fi

curl "http://localhost:8080/jmx-console/HtmlAdaptor?action=invokeOpByName&name=jboss.system%3Atype%3DServerInfo&methodName=listThreadDump" > /logs/console/$logdir/thread_dump_$datestamp_$logfile.html 2>&1
curl "http://localhost:8080/jmx-console/HtmlAdaptor?action=invokeOpByName&name=jboss.system%3Atype%3DServerInfo&methodName=listThreadCpuUtilization" > /logs/console/$logdir/thread_cpu_$logfile.html 2>&1
top -H -b -n 1 > /logs/console/$logdir/top_$logfile.log 2>&1

Now let’s go hunting.

The first question to ask is do I have threads that are consuming unreasonable amount of CPU and what are they doing? Look at the JMX console manually or use pre-collected artifact 2 to find the list of such threads. The list will look like this:

Thread Name

CPU (milliseconds)

ajp-0.0.0.0-8009-13

16783407

ajp-0.0.0.0-8009-8

5566706

ajp-0.0.0.0-8009-15

5087530

Check log files for the last print out from the suspected thread in the logs. Is the thread utilized as expected or was it in the limbo state for hours with no log messages, which can be an indication of the infinite loop. The perl snippet below allows finding latest timestamp of the log message from a thread in JBoss log.

Once the suspect is found its whereabouts can be checked manually from JMX console or in artifact 1.  Last 2 infinite loops I’ve seen were both related to use of HashMap by multiple threads, which is succinctly described here.

#!/usr/bin/perl
use strict;
use warnings;

my $dir = '.';
my %threads;

opendir(DIR, $dir) or die $!;

while (my $file = readdir(DIR)) {

	# We only want files
	next unless (-f "$dir/$file");

	# Use a regular expression to find files ending in .log.*
	next unless ($file =~ m/^server\.log\..*$/);

	open(my $fh, "<$file") or die("Could not open file $file.");
	foreach my $line (<$fh>) {
		#2014-02-27 01:00:00,827 25498282 INFO  [com.company.client.package.Class] (ajp-0.0.0.0-8009-22:) 
		if ($line =~ /^(201\d).(\d\d).(\d\d)[ \t]+(\d\d:\d\d:\d\d),\d\d\d \d+[ \t]+[A-Z]+[ \t]+[^ \t]+[ \t]+\(([^\)]+):\).*$/) {
			$threads{$5} = "$2/$3/$1 $4";
		}   
	}
	close($fh);
}

foreach my $thread (sort keys %threads) {
	print "$thread\t$threads{$thread}\n";
}

closedir(DIR);
exit 0;

The information created needs to be sorted chronologically:

sort -k2 -r threads.txt > threads_sorted.txt

Next question should be: does the system experience deadlocks or resource contentions. To answer such question collection of artifact 1 over time needs to be analyzed. This is the reason why thread_dump files have date information as part of the name. On Linux/Unix simple command below will put the information needed in a single file:

grep -r "lockName:" thread_dump_* > suspects.txt

Now, the content can be parsed into tabbed text file consumable by good old Excel and ran through pivot table.

#!/usr/bin/perl -w
use strict;
use warnings;
if(!defined $ARGV[0]){
    print "No FilePath Specified!\n";
}

my $LOGFILE = $ARGV[0];

print "Time\tThread\tState\tLock Name\tLock Name Class\tCall Stack\n";

open(my $fh, "<$LOGFILE") or die("Could not open log file.");
foreach my $line (<$fh>) {

	if ($line =~ /^thread_dump_(\d\d\d\d)(\d\d)(\d\d)_(\d\d)(\d\d)\d\d.html:(.*)$/) {
		my $year = $1;
		my $month = $2;
		my $date = $1;
		my $hour = $4;
		my $min = $5;
		my $td = $6;

		my @matches = ($td =~ m/(\>b\<Thread: [^:]+ : priority:\d+, demon:[a-z]+, threadId:\d+, threadState:[A-Z_]+, lockName:[a-zA-Z0-9\$\.\@]+\>br\<\>blockquote\<[ a-zA-Z_\$\.\(\):0-9\>\<]+\>\/blockquote\<)/g);
		
		
		foreach my $x (@matches) {
			if ($x =~ /^\>b\<Thread: ([^:]+)\>\/b\< : priority:\d+, demon:[a-z]+, threadId:\d+, threadState:([A-Z_]+), lockName:([a-zA-Z0-9\$\.]+)(\@?[0-9a-f]*)\>br\<\>blockquote\<([ a-zA-Z_\$\.\(\):0-9\>\<]+)\>\/blockquote\<$/) {
				print "$month/$date/$year $hour:$min\t$1\t$2\t$3$4\t$3\t$5\n";
			} else {
				print STDERR "No match : $x\n";
			}
		}
	} else {
		print STDERR "No match in line : $line\n";
	}   
}
close($fh);

This will produce content like the one below

Time

Thread

State

Lock Name

Lock Name Class

Call Stack

6/2/14 0:00

Reference Handler

WAITING

java.lang.ref.Reference$Lock@57db6806

java.lang.ref.Reference$Lock

java.lang.Object.wait(Native Method)<br>java.lang.Object.wait(Object.java:503)<br>java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)<br>

6/2/14 0:00

RMI TCP Accept-1098

RUNNABLE

null

null

java.net.PlainSocketImpl.socketAccept(Native Method)<br>java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)<br>…<br>

6/2/14 0:00

GC Daemon

TIMED_WAITING

sun.misc.GC$LatencyLock@10a8ec79

sun.misc.GC$LatencyLock

java.lang.Object.wait(Native Method)<br>sun.misc.GC$Daemon.run(GC.java:117)<br>

6/2/14 16:00

ajp-0.0.0.0-8009-3

BLOCKED

java.lang.Class@4c6cb02a

java.lang.Class

java.util.TimeZone.getDefaultInAppContext(TimeZone.java:734)<br>java.util.TimeZone.getDefaultRef(TimeZone.java:630)<br>java.util.Date.normalize(Date.java:1206)<br>java.util.Date.toString(Date.java:1027)<br>…<br>

The aggregation will look like

Count of Thread

Column Labels

Row Labels

BLOCKED

TIMED_WAITING

WAITING

Grand Total

com.arjuna.ats.arjuna.coordinator.TransactionReaper

97

97

com.enservio.util.ChannelThreadPool$Channel

388

388

com.mchange.v2.async.ThreadPoolAsynchronousRunner

288

288

com.mchange.v2.resourcepool.BasicResourcePool

2

2

EDU.oswego.cs.dl.util.concurrent.Semaphore

97

97

java.lang.Class

1

1

java.lang.Object

1065

291

1356

java.lang.ref.Reference$Lock

97

97

java.lang.ref.ReferenceQueue$Lock

485

485

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject

580

485

1065

java.util.concurrent.Semaphore$NonfairSync

97

97

java.util.concurrent.SynchronousQueue$TransferStack

380

380

java.util.HashMap

1

1

java.util.LinkedList

388

388

java.util.TaskQueue

485

194

679

java.util.TreeSet

97

97

org.apache.tomcat.util.net.JIoEndpoint$Worker

6531

6531

org.apache.tomcat.util.net.JIoEndpoint$WorkerStack

2

2

org.hornetq.core.postoffice.impl.PostOfficeImpl$Reaper

97

97

org.hornetq.core.remoting.server.impl.RemotingServiceImpl$FailureCheckAndFlushThread

97

97

org.hsqldb.lib.HsqlTimer$TaskQueue

96

96

sun.misc.GC$LatencyLock

97

97

Grand Total

2

3379

9057

12438

The pivot table like the one above, allows to see where the threads are waiting and/or blocked. The stack traces are readily available to drill down. This allowed to identify memcached client default configuration as a bottle neck in my practice along as pointed to architectural deficiency created by the use of the memcached values through application life cycle. It manifested itself though as BLOCKED threads on com.mchange.v2.resourcepool.BasicResourcePool - JDBC connections not released in time for others to re-use.

Additional clues that can be collected are exceptions from log files (artifact 4) or errors from artifact 5. First one can be collected by the following command with the manual aggregation:

grep -r Exception server.* > exceptions.txt

Second one will have all the aggregation already done. Of particular interest are timeout related messages in the logs:

grep -ir timeout server.* > timeout.txt

Everything else is application specific and will reflect logging messages of a particular system. It can be processed in the same way to extract metrics and correlate them over time.

With all the information at hand, start from the timestamp of the incident and go back in time through the artifacts to find the tipping point where it all started collapsing. You may find multiple issues nested one on top of another. Create an action plan for each issue and work to implement the fixes one by one. It can take multiple iterations in production to iron out all of those, so be patient and keep hunting.

Discover how to optimize your DevOps workflows with our cloud-based automated testing infrastructure, brought to you in partnership with Sauce Labs

Topics:

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}