The Paradox of Software Performance Engineering
The Paradox of Software Performance Engineering
How can a performance consultant who doesn't know the product catch things the creators can't? And yet they can. Learn how.
Join the DZone community and get the full member experience.Join For Free
How can you help me solve the problem when you don’t have any experience with the software product?
Prepare to be amazed.
This is a typical challenge for the performance engineer. This is either a production performance issue type of scenario, or the “we are about to go into production and we have already delayed the release once” scenario.
The team on the ground, who does know the product, cannot get to the root-cause of the performance and stability issues. They have found a few of the causes and made changes for the new release. They have assembled the SWAT team to triage and troubleshoot, and they are reporting updates on troubleshooting progress twice a day. Some members of the team have been through this before and for others it’s the first time. A key issue is lining up the full support of the product vendor, as their key technical architect is currently six time zones away. The vendor is frustrated because they have to divert resources to chase problems that are not theirs.
Not only is there the core software product, but there are also a few customizations that their partner System Integrator has made as well. So, suffice it to say, there are many people and groups involved, each with a horse in the race, including the Business Unit CIO.
The system in question is complex—consisting of web servers, Java application servers with custom and off the shelf code, messages to third party corporate systems, and a large database. Next, let’s add a Virtual Machine or two in the mix. The team continues to look at each component; the network is always the first place they go. The Oracle Automatic Workload Repository Reports (AWR’s) are reviewed repeatedly and then, there is the occasional Java heap dump review. There are guest operating systems to review and VMware Hypervisor metrics to review.
Along the way, in the database, the team found some index issues and corrected them. They separated some tables onto different table spaces. It’s very late in the game to rearrange the database. There is a task underway to confirm the actual infrastructure configuration again because there seems to be some disagreement on the state of the platform. For example, the database server was supposed to have 32 CPUs, it seems there are 24. They need to upgrade a few components to make them current, because the vendor said the new updates should help. Who owns the configuration management database?
At this point, progress seems to have stalled. Many of the key business transactions are very slow, and a few are intermittently slow. The performance tests are showing that the responses degrade with just the first wave of users. The workload of the first wave consumes all the system resources. The capacity plan showed the system as configured would have capacity to handle the workload of the four waves of users.
Enter the Performance Engineer
The senior performance engineer has experienced this scenario before; they have been in the hot seat. They have the technical knowledge from using many different systems, and have a core foundation of development or database administration. They have the skills and techniques required to get to the root-cause, using the knowledge of existing key team members. At Collaborative Consulting, (now CGI) our performance engineering team has created a proven approach for root-cause analysis projects. They are grounded in our Software Performance Engineering Body of Knowledge, with our five knowledge areas; 1) Performance engineering in the SDLC, 2) Performance testing, 3) Capacity Planning, 4) Application Performance management, and 5) Problem detection and resolution.
They have a wide range of experience with different Application Performance Management, APM, tools. In todays’ complex environments, an APM tool is mandatory. The alternative of stitching together disparate logs from all the components is not an efficient use of resources and time. The goal is to automate as much as possible for the RCA process. The APM tool also enables you to immediately take new measurements when the next release is ready, enables the before and after comparison. This compresses the time needed to verify the changes.
An APM tool, used by an experienced performance engineer or technical architecture, can dynamically view the run-time environment, discovering the technical architecture. This can greatly accelerate the physical architecture review of the application or system. The application is evaluated from an end to end perspective, following the business transactions all the way to the database transactions.
The second key individual necessary for a successful RCA project is a technical subject matter expert in the platform technology. This person must have experience with the APM tool. At Collaborative (now CGI), we have had great success with the Dynatrace tool-set for Java and Microsoft environments. We typically include the use of the APM tool in the RCA project.
No Experience with the Application or Vendor Product
Often times in a fast paced root cause analysis project, the team becomes emotionally invested and start to defend their code or technology. The benefit of bringing in an external performance engineer is, they can be the neutral third party who can follow the facts. The performance engineer does not have horse in the race, as they say.
The performance engineer can lead the results analysis, define the plan and priorities, communicate the technical details to the project team, and manage the RCA process. They can also help get a consensus on what the success criteria should be. Often times there are no formal performance response time goals for the critical business transactions, so defining these are very important. How do you know when you’re done if you do not define the end state? The performance engineer will also look at the bigger picture, by defining processes to measure performance for every release.
Understanding the technical environment is critical to success. With the Java technical architect and the APM tool in place, the performance engineer can start to follow the transactions to find the bottlenecks. With the experience and the knowledge of the APM tool, the performance engineer can tell you what Java method is slow, and he or she can tell you what SQL statements are causing the problem or taking too long. They can find the blocked threads and tell you why they are blocked. You can instantly evaluate the Java Garbage collection. They can tell you the locations that are having network bandwidth problems.
The performance can then parcel out the research to the proper teams; network information to the network team, database observations to the database team, software product to the software team and revised performance test scenario’s to the performance test team.
For each business transaction, the performance engineer can map the entire call stack under a given workload and pass the details on to the Software Vendor/Product team. The software vendor will be happy to have the facts and workload conditions that led to the slowdown. Then software vendor can correct the problem and send in a new build. For example, in one of our recent projects, the product vendor was unware that they made 200 SQL calls to create an order. They had a product bug that mistakenly called the same procedure ten times, when it should have called once.
The experienced performance engineer asks the questions;
- Why is the application updating all these tables on an order creation?
- Why is it calling the remote pricing call three times?
- Why are you creating a new object for the same customer or product?
- Why is the database connection handler making so many connections for a static number of users?
- Did you expect your users/customer to come from a slow wireless connection? Did you test for that?
- Did you realize the Application Servers where in one data center and the database was in another data center?
- Who set the JVM memory configuration?
- Why are the indexes on the same volumes as the files?
- The performance testing database was one quarter size the production database.
- How many physical CPU’s did you really allocate to the Database Server?
- How was the peak volume determined?
These foster productive, in-depth conversations with the Product technical team.
So, yes, an experienced performance engineer can isolate and provide recommendations to solve performance problems for applications and Products they have no prior experience with. However, it requires the proper experience, proper tools, and a knowledgeable on the ground team to be successful.
Published at DZone with permission of Walter Kuketz , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.