More and more Internet users buy in web shops these days. Research shows that the part of European Internet users that buys on-line has grown from 40% in 2004 to 84% in 2008. Additionally, the large web retailers in my country saw their revenue grow in 2009 and in the first part of 2010 just as if the crisis never materialized.
I also like to shop on the web, to buy electronics, books or tickets. Now and then I enter a web shop where I have to wait before pages appear fully. Most of the time I’ll move away: with just one click I’m off to the competitor. The increased comparison possibilities and freedom of choice offered by the Internet are not only valid for the products, but also for the web shops themselves. Therefore, it has become crucial for the success of the web shop to have a responsive web site.
With only a few concurrent visitors, it is usually not so hard to have a quick website. However, with the growing trend of Internet sales, the increasing integration and complexity of back-end systems and the by marketing demanded ever increasing richness of the user experience, this often becomes a big challenge for developers and operators. This may result in systems blacking out under high load or responding too slowly.
So the question is: how can we prevent these performance and availability problems and how can we assure that a web site is always quick and available?
On the basis of real life, trial and error experience we’ve come to an approach which can be described as: measure, don’t guess; seven steps to performance success.
How performance problems get to you
Frustrations and loss of revenue
When internal applications are slow, this is frustrating for the users. They cannot do their job efficiently anymore and will be de-motivated. Call center agents have to apologize continuously to the customer on the other end of the line for their slow systems. The customer in turn will be frustrated by the long waiting times and long phone calls. When external apps are slow, this will have direct consequences for the revenue of the company. For instance, if I want to buy a book or insure my car, I compare online and choose a shop. If I have to wait when I am on such a site, I simply browse to the competitor to buy there. Since I will not be the only one behaving like that, this has its effect on the company revenue.
Disruption of regular development
Slowness problems most of the time manifest themselves unexpectedly, such as after the introduction of a new application or new release. A cause of this is that the non-functional aspects of the software usually get attention too late and too little. The difficulties which turn up in production put a high pressure on both the operators as well as the developers to solve the usually difficult to find problems. This will have its disruptive effect on the regular development of new releases: the development team is only busy firefighting.
Just adding hardware: a cheap solution?
The solution to the slowness is regularly sought in putting in more hardware: load balancing over more servers or modern fashioned: run the app in an elastic cloud. However, if the bottleneck turns out not to be located in the web tier but somewhere else, this investment in more servers will turn out to be just wasted money. Moreover, yearly returning licensing and operational costs are more than once under estimated. So, in case extra hardware is a solution it may an easy solution, it is certainly not always a cheap solution.
Seven steps to performance success
It can be a valid choice to run the risk of performance problems in production and deal with them in a re-active manner. However, it is usually wiser to be pro-active and prevent them. This approach brings more certainty, peace of mind and also saves money. It consists of the following seven steps.
Step 1: Define performance requirements
Defining the performance requirements well usually is a neglected activity. Most of the time the requirement is formulated as: it just has to be fast or: at least as fast as the previous platform. With such vague definitions the confusion starts. The goal is unclear and is typically explained very differently by the business and the IT department. To prevent this, the goals should be formulated in a SMART way and be prioritized. Speed will be more important for a shop homepage than for a page where a customer can change his profile. By defining priorities, this order of importance is made explicit and clear. From SMART, the A stands for Attainable and the R for Realistic. These aspects are often ignored by the non-technical contributors to these requirements. In that case, a short response time will lead to an extended development time or expensive hardware. Half a second slower during peak hours can be acceptable if this saves tons of money. On the other hand, reducing the response time of an important page from 4 to 2 seconds can lead to a substantial growth in revenue. So, a solid analysis of the impact of performance on the business is needed to be able to clearly define the performance requirements in a SMART way, prioritized and be able to balance the cost and benefits.
Step 2: Execute a PoC for Performance
The IT world is very sensitive to trends. Having been around in the IT industry for 15 years, I’ve seen a few. A technology is hot for a while, and then quickly become out-of-fashion and yesterdays news. It will be replaced by something which is much better and which everyone seems to follow blindly. Such fashionable topics are, to name a few, CORBA, CGI, applets, EJB, Struts, Spring, Server Faces, XML, SOA, OMT, UML and RIA. Often new, bleeding edge technology is used in a project just for the sake of being fashionable or for getting it on the developers resume. In addition, each technology or framework comes with its own teething troubles and most of the time uses more resources than its predecessor. The goal of such a new technology is generally improvement of flexibility, productivity or maintainability, and performance usually has no priority or has not been considered at all.
Therefore, it is questionable if the chosen new technology and architecture will meet the specified performance requirements. In practice, this regularly becomes only evident in a late stage of the project: when it has already slipped beyond the planned production date. Only then it may become clear that the chosen technology or architecture is just not sufficient. And switching to a different technology or architecture usually results in high cost and long delays. Therefore it is essential to execute a Proof of Concept for performance, in which all technology and architecture components are touched, in a vertical slice of the application. It is important that this benchmark is performed in a sufficiently representative manner, which I will elaborate in my next post. By executing this PoC and understanding and using the results of it, the project can early be corrected in the right direction to prevent excessive cost and delay.
Step 3: Test representatively
Slowness of applications in development environments is often neglected with the rationale that faster hardware in the production environment will solve this problem. However, whether this is really true can only be predicted with a test on a representative environment and in a representative way. In such an environment, there needs to be more representative than just the hardware. I have experienced multiple times that a database query on the test database with 1000 customers took only less than 10 ms., while on the production database with 100.000 customers this turned out to take tens of seconds. So, if the development team does not test with a full, complete database, going to production may lead to some surprises. It is also important that the number of concurrent users and their behavior is well simulated in the test. Furthermore, care should be taken to take caching effects into account: if the test continuously requests for the same product by the same customer, this data will be in database or query cache the second and following times. This will speed up the request considerably and be much faster than with many customers and products. This test is therefore not representative for the real situation. A suitable performance test tool and performance expertise is necessary to create a valuable test. The most popular open source performance test tool is Apache JMeter.
Figure 1. Screenshot of a run of a performance test in Apache JMeter.
This is a tool made by programmers, for programmers. Test scripts can be created with visual elements like a HTTP request, which can be recorded and configured. Many are available and if you need more, you can always fall back on a BeanShell element in which you can manipulate the request, response and various JMeter variables. If that even does not meet your needs yet, you can extend JMeter source code and develop your own elements. Because of its for-programmers nature, it is less suited for the average tester. Also reporting features and maintainability of the scripts are both not so great. Therefore, commercial tools like HP Mercury LoadRunner, Borland SilkPerformer or Neotys’ Neoload may be good alternatives for companies.
Performance testing from the cloud
The emergence of cloud computing adds new possibilities for performance testing. An elastic compute cloud like Amazon EC2 provides the ability to scale up quickly with the number of application deployments because of increasing load. For performance testing the cloud can be used the other way around: for temporary use of many load generating test clients to generate expected and peak loads for your application. This saves you from having to buy many servers to run the load generating clients and if you run these performance tests only periodically, this can be an economical solution. Quite some information is available how to run various performance tools in the cloud.
Step 4: Test continuously
With a representative test as one of the last steps before going live we prevent that expensive bad-performance surprises will pop up in production. However, the same surprises will pop-up, only earlier and with less impact. To save costs and prevent large architectural refactoring, it is crucial to test for performance as soon as possible. This is just like any other software defects and Quality Assurance: the later in the development process defects are detected, the more costly these defects are.
At a popular web shop I had the following challenge: we wrote the performance tests only at the end of the six-weekly release period, after functional testing had taken place and functional defects were corrected. In case serious performance defects popped up, a crisis team was gathered and we found ourselves in a stressful situation. There was usually not enough time to fix the defect before the release date, so my recommendation at times was to defer the release date. However, deferring the release date often just was not possible, because TV or radio adds were bought to promote the new functionality. So, how to solve this dilemma? We found the solution in applying agile principles: test early and the team is responsible. We included meeting performance requirements of the new or changed feature in the definition of done. The development process included a common automatic build. Unit tests of a feature were written as usual by the developer. We now introduced performance tests to the spectrum: the developer writes the performance test script for his feature (service, page) in JMeter, side-by-side to his unit tests on the classes. When the nightly build with Maven has taken place, the application is deployed on WebSphere and the performance tests are run by the JMeter Ant script. This script generates a report which is emailed to stakeholders. In this way, the IT department gets early insight into new and changed features, it can adapt its course quicker, back-off early from an unfortunate architecture or approach, minimize surprises and also have lower costs. Additional benefit is that writing test scripts gets done more quickly than before, because the developer has all details of the new feature still fresh in his memory. These details are for instance the conditions under which the service may be called and with which parameters, variations and special cases. This way, communication overhead between a performance tester and a developer on these details is drastically reduced, further improving productivity.
Step 5: Monitor and Diagnose
When a new version of the software is released into the production environment, the question always is: will it actually perform like we saw in testing and acceptance environments? And we keep our fingers crossed. It is therefore important in such cases to monitor carefully what happens with the performance and availability.
There are all sorts of tools and services available to monitor your web site for availability and response times of web pages, like Uptrends, Site24x7 and Dotcom-monitor. They look at the application as a black box and measure once in several minutes. However, to be able to take the right measures in case of a fatal incident, it is necessary to be able to pin-point the problem.
It is essential to monitor on multiple levels and on multiple application parts. For levels, think of hardware, OS, app server, web server, database and application. This can be achieved with JAMon inside a Java application. JAMon is an open source timing API. It basically works like a stopwatch with a start() and stop() call. Every method which you want to measure gets its own stopwatch (or counter) . Each counter maintains statistics like the number of calls, average, maximum, standard deviation, etc. , and this information can be requested for. The individual calls are not stored. This approach results in low memory usage and a low performance overhead, at the cost of some information loss.
Figure 2. JAMon API start() and stop() calls in a Spring interceptor
Recently, a new competitor of JAMon appeared: Simon. It claims to be JAMon’s successor, although it has (had) some issues. Then there is the question: where to measure? It makes most sense to measure all incoming calls like web requests and outgoing calls to for instance the database. Furthermore, parts like Spring beans, EJB’s and DAO’s. Measuring these parts is not only relevant with new releases, but also trends and usage spikes are useful to monitor in order to solve quickly and prevent various problems. Open source tool JARep offers the possibility to store JAMon data from a cluster in a database and monitor trends and changes graphically.
Figure 3. JARep shows the increasing response time trend starting October 15, on two of the four production JVMs.
We had the following situation at my customer. Processing an order slowly took more and more time over a period of several weeks. This happened while no new release was introduced and no other page became slower. This behavior was a complete mystery, until we looked deeper in our JARep monitoring tool. The troublemaker turned out to be a DAO executing a prepared statement with only part of the variables being bind-variables. With help of JARep, we could look back to where the trend of increasing response time started so when the problems started. We could also see that this problem was only present at one of the two machines. With this knowledge and his log book, the operator could remember that on the start date he had experimented with a new JDBC driver to try to solve a memory leak. This seemed not to change anything concerning performance, what actually was the case in the beginning. Problems only appeared slowly during the following weeks. They had left the new driver in place, which manifested itself as a time bomb later. When we put back the old driver, the problem just disappeared! This real life experience shows the usefulness of monitoring and trend analyses on application internals.
Step 6: Tune based on evidence
If an application turns out to be too slow, tuning can provide a solution. Tuning can take place on multiple levels. Adding hardware can be a cheap solution. However, when hardware is added at a place where the bottleneck is not located, this has little use.
Important steps of tuning are therefore the following five steps. Identifying which pages or services do not meet stated requirements and isolating the problem: where is it located, in which layer, in which component. This can be made clear with testing and monitoring on parts. The next step is diagnosing. In fact, this comes down to making up a hypothesis why this component is so slow. This can for instance be a missing or wrong index on a database table or the invocation of too many small queries. Next, the component is improved based on this hypothesis. Finally, one needs to verify whether the improvement actually brings the expected speedup. If so, then the proposed hypothesis is true and the speedup is the result. If not, then there is something wrong with the hypothesis and we need an alternative hypothesis. As soon as the performance of the system meets its requirements, tuning is finished.
Figure 5. Finding evidence
Right tools for the right job
The right tools are indispensable: performance test tool, enterprise profiler, heap monitor, etc. I have seen several developers work multiple days on assumed performance improvements which turned out not to help at all, or even slowed down the application and also deteriorate the maintainability and flexibility. This is caused by the fact that developers are used to mould functionality from source code and therefore work from source code to improve performance. What is missing here is: measure, don’t guess. This is something developers learn in my performance training. Experience also has taught me to judge every proposed improvement separately and to only implement the improvement when we have proven that it really helps.
There are many tools to choose from. Live monitoring is essential to see the actual performance problems. Being able to do root cause analyses and to find the needed evidence is essential to effectively solve those problems. On the open source front there is VisualVM to the rescue, my favorite open source performance tool. On the commercial APM front there are the big vendors like HP, CA (Wily) and Quest which can provide an extensive solution including some or all from: end user experience, transaction profiling, infrastructure and database performance.There are also smaller, more specialized vendors like dynaTrace and AppDynamics. I like their products because they are innovative and really effective at finding the root causes.
When an incident happens in production, this usually means stress. A performance problem in production often leads to finger pointing. The DBA says that he has looked and nothing is wrong with his database. The network operator concludes the same thing about his network. The app server operator about his app server, the software developer about his source code and the back end operator about his back end. It is never them, it is always the other guy.
The application often gets thrown over the wall to the operation department. Responsibilities then hold only at one side of that wall. If software development, maintenance, testing and/or operation is outsourced to external parties this can lead to tricky situations. Before you know it, contracts and legal procedures are at play and cooperation is far away. Both parties stick to their position, costs will raise and precious time gets lost.
Finding out which part of the chain is responsible for the slowness can partly be solved with proper tools that monitor the whole chain and tools which are used from early on in the development process. But there is more to it than just tooling. Experience with and knowledge of tooling and technology is inevitable just as priority for the proper utilization of the tools. It is most important to prevent formation of separate kingdoms and finger pointing between them; and rather to operate together as a multi-disciplined performance team and share the responsibility for the whole chain.
Summery and Conclusions
In this growing on line world with demanding customers it has become crucial that services provided on the web are always available and always fast enough. This is often challenging to developers and operators: performance problems manifest themselves in various ways, like in frustration, loss of revenue and disruption of development; and just adding hardware is a doubtful solution. The question is: how can we as developers and operators assure that our web site is always available and available fast? My answer is: you need the right approach. The approach is: measure, don’t guess; seven steps to performance success. These seven steps are as follows:
Step 1: Define performance requirements;
Step 2: Execute a proof of concept;
Step 3: Test representatively;
Step 4: Test continuously;
Step 5: Monitor and diagnose;
Step 6: Tune based on evidence;
Step 7: Share the responsibility for the whole chain.
This approach provides a pro-active way of working which my customers appreciate as valuable. It can actually be leveraged to assure high performance, all of the time, not only for web apps, but rather for any on- and off-line application.
This article and blog series has been an interesting journey for me. Some time ago we at Xebia presented our EJAPP Top 10 about performance problems. Now we have added this approach of seven steps to help assure your applications performance. It has worked for us. How does this all work for you in practice? I’d like to hear your feedback.