Most enterprise databases today run on shared storage volumes (SAN, NAS, etc…) that are accessed over the network or via Fibre Channel connection. The shared storage concept is great for helping to keep storage infrastructure and management costs relatively low but creates cross silo finger pointing when there are performance issues. In this blog post we will explore a real world example of how to avoid finger pointing and get right down to figuring out how to fix the problem.
One Rotten Apple Can Ruin The Whole Bunch
This story dates back to June of 2012 but I just came across it so it is new to me. One of our customers had an event which impacted the performance of multiple databases. All of these databases were connected to the same NetApp storage array. Often when there is an issue with database performance the DBAs will point the finger at the storage team and the storage team will tell the DBA team that everything looks good on their side. This finger pointing between silo’s is a common occurrence between various groups (network, storage, database, application support, etc…) within enterprise organizations.
In the chart below (screen grab taken from AppDynamics for Databases) you can see that there was a significant increase in I/O activity on dw_logvol. This issue impacted the performance of the entire NetApp storage array.
As it turns out dw_logvol was used as a temporary storage location for web logs. There was a process that would copy log files to this location, decompress them, and insert them into an Oracle data warehouse for long term storage. This process normally would not impact the performance of anything else connected to the same storage array but in this case there happened to be corrupted log files that could not be properly decompressed. This resulted in multiple attempts to retransmit and decompress the same files.
Context and Collaboration to the Rescue
Storage teams normally don’t have access to application context and application teams normally don’t have access to storage metrics. In this case though, both teams were able to collaborate and quickly realize what the problem was as a result of having a monitoring solution that was available to everyone. The fix for this problem was really easy, just remove the corrupted files and replace them with versions without any corruption. You can see activity return to normal in the chart below.
Modern application architectures require collaboration across all silo’s in order to identify and fix issues in a timely manner. One of the key enablers of cross-silo collaboration is intelligent monitoring at each layer of the application and the infrastructure components that provide the underlying resources.