Monitoring Grails Apps Part 2 – Data Collection
Join the DZone community and get the full member experience.
Join For Freecongratulations you’ve got your new grails site live – now you just need to make sure it keeps running!
we’ve already seen how to
expose custom data for a grails application via jmx
– but with this post we’ll focus on the data items that we might want to collect and monitor.
if we take a step back, what are we trying to achieve with monitoring our application?
business perspective
your answer may depend upon the maturity of the operational support for
the application under scrutiny, but the first big question to answer is
usually:
is my site up?
for some people it may be “ is my site making me money? “, but they probably won’t make money if the site is down – so let’s go with availability.
this is a reactive question – if the answer is no, get the sys admin out of bed at 3 in the morning and restore the service!
however, this is critically important so you would normally have an http check to hit the front page of the site – the http response would be checked for a 2xx code and often you may check for an expected string in the body.
for the practical side of this post, we’ll be using opsview to demonstrate the theory. assuming you’ve got the opsview vmware appliance installed & have completed the quick start , this check for a generic grails application called ‘monitored’ could be:
check_http -h $hostaddress$ -p 8443 --ssl -w 5 -c 10 -u '/monitored/js/application.js' -s 'spinner'
the tcp port is specified with
-p
and the –
ssl
flag tells the check to perform an ssl handshake.
the
-w
and
-c
options are warning and critical threshold values respectively for response time in seconds.
the context path portion of the url is specified by the
-u
option and
-s
is used to check for an expected string in the content.
note
: opsview can show plugin help that will list the other options & provide more information on usage.
if you front your application server with a web server, it is common practice to have two checks: one to check via the web server and the other to check directly on the application server. this helps pinpoint where the problem is in the event of an outage. if you have load-balanced servers – then the checks should be applied across all the nodes.
as with all things, the effectiveness of these ‘front door’ checks depends on the point of view of the monitoring server. if the web pages are served from the same data centre as the monitoring server, you might choose to check outbound connectivity to a well known site or use a 2nd view point to determine whether your data centre is isolated from the internet.
for example, this could be done by getting http://www.downforeveryoneorjustme.com/ with your desired url and checking for the response string “it’s just you.”…
a second big question is usually “ what are our customers experiencing? ” – the realm of user experience monitoring / synthetic transactions are beyond the scope of this post, however you might be interested in this blog post on using opsview with selenium .
operational perspective
going down a level, from an operational point of view, you will want to ensure that:
1. the server is up
this is done with a host check command (e.g. ping / ssh)
2. the server is responding to connections on the appropriate ports (including 8080/8443)
here we use
check_tcp
.
3. the application server can connect to the database
this is a
check_tcp
but executed by the agent on the app server and invoked by
check_nrpe
or
check_by_ssh
.
4. the database server is responding to queries
you can use
check_sql_advanced
, other nagios
check_sql
plugins or a custom check (e.g.
https://gist.github.com/994973
).
5. the email server is working
you guessed it,
check_smtp
.
more proactive checks
as this is about grails applications, we’ll focus on tomcat & the jvm and assume you’re using mysql and running it on a linux server with the opsview agent package installed (opsview can also monitor windows servers with nsclient++) and accessible via nrpe (i strongly recommend using iptables to restrict access to the nrpe port to just the monitoring server).
top 5 things to check at the os level with examples:
1. cpu utilisation / load average on the server
check_load -w 8,5,2, -c 20,9,4
2. amount of free memory
check_memory -w 90 -c 98
3. amount of free disk space
check_disk -w 5% -c 2% -p /
4. that the tomcat process is running
check_procs -c jsvc -a tomcat -w 3:3 -c 3:3
5. that the mysql process is running
check_procs -a mysqld --metric=cpu -w 60 -c 90
we’ll ignore scanning log files for this post & assume you’re using jalarms to notify opsview of any problems that the application detects ( how-to here ).
jmx checks
these are queried by check_jmx . there are a number of variants including jnrpe, which has the benefit of not running up a jvm for each check, however for simplicity we’ll assume the use of the standard jmxquery.jar check.
1. heap space (e.g. for -xmx512m)
check_jmx -u service:jmx:rmi:///jndi/rmi://localhost:1616/jmxrmi -o
java.lang:type=memory -a heapmemoryusage -k used -i heapmemoryusage -j used -vvvv -w 400000000 -c 500000000
2. active database connections
check_jmx -u service:jmx:rmi:///jndi/rmi://localhost:1616/jmxrmi -o
bean:name=datasourcembean -a numactive -vvvv -w 16 -c 19
3. any key application metrics that you’ve exposed over jmx (assuming web analytics are handled separately). the standard check_jmx implementations expect these to be numeric
this will appear in opsview like so (the observant will notice that there are graphs available – keep reading to find out how):
the art of monitoring
thresholds
these will require tuning once you understand how your application
behaves; you need to ensure that ‘false positives’ are minimised yet
alerts are raised if something is going wrong.
frequency of checks
again, this needs to be balanced: if you check too frequently you will
place additional burden on your system and use computing power that
could be serving customers; if you check too infrequently, you might
miss an opportunity to avert a disaster or worse still be unaware of an
outage / breach of sla until the phone rings… ultimately the choice is
yours on this one – though opsview defaults to a 5 minute check period.
notifications
obviously we want our operations team to be notified in the event of a
problem. opsview supports multiple channels such as email, rss with sms
and service desk integration in the enterprise edition. further
information on notifications methods is
here
.
graphing
alerts are about the ‘here & now’, while graphs allow you to view
trend information and can assist in correlation or identifying areas for
investigation whilst troubleshooting root causes.
e.g. if you were monitoring the number of active sessions and site
response time you would see the corresponding peaks of a slash-dot
effect.
opsview will graph performance data but not ‘unmapped’ status data. the data for the graphs are stored in ‘round-robin database’ (rrd) format under /usr/local/nagios/var/rrd/<host-name>/<servicecheck-name> .
mapping jmx service check results
the standard check_jmx only returns status data, therefore we need to map the data so that it can be graphed.
this is done in the
/usr/local/nagios/etc/map.local
file.
e.g. to match the jmx checks defined above
# java_heapmemoryusage_used
# service type: java jmx heapspace check using check_jmx
# output:jmx ok heapmemoryusage.used=454405600{committed=532742144:init=0:max=532742144:used=454405600}
/output:jmx.*heapmemoryusage.used=([0-9]+).committed=([0-9]+).init=([0-9]+).max=([0-9]+)/
and push @s, [ "heap_memory",
[ "usedmb", gauge, $1/(1024**2) ],
[ "committedmb", gauge, $2/(1024**2) ],
[ "initmb", gauge, $3/(1024**2) ],
[ "maxmb", gauge, $4/(1024**2) ] ];
# jmx active db connections
# output:jmx ok numactive=3
/output:jmx.*numactive=([0-9]+)/
and push @s, [ "numactive",
[ "numactive", gauge, $1 ] ];
you’ll need to ensure your map.local file is valid after you’ve saved it – this can be done by:
perl -c map.local
why can’t i see my graphs straight away?
the rrds won’t be created until the
map.local
changes have been
picked up – this should happen when the performance data has been
processed. however opsview may not be aware that graphs are available
for a service check until the configuration is reloaded.
so if you’re impatient like me you could always try restarting opsview,
forcing the desired servicecheck to be executed, checking for the rrd in
nagios/var/rrd
and then performing an admin reload…
then you should see e.g. for the heap space:
lastly, a recent introduction to opsview was the very useful viewport sparkline graphs on the
performance
view allowing rapid correlation (with filtering):
summary
we’ve had a very quick run through of grails application monitoring with
opsview – hopefully there’s enough there to keep you busy and your
application healthy!
from http://leanjavaengineering.wordpress.com/2011/05/29/monitoring-grails-apps-part-2/
Opinions expressed by DZone contributors are their own.
Comments