Production issues seek the attention of middle and top level management. Here are a few things that you must pay attention as a software developer or architect to prevent any future embarrassments. You can use this as a check list.
#1: Not externalizing configuration values in .properties or XML files. For example, not making the number of threads used in a batch job configurable in a .properties file. You may have a batch job that worked well in DEV to UAT (user acceptance) environments, but when deployed to PROD, due difference in the JDBC driver version or issue discussed in the #2 was throwing an IOException when run as multi-threaded over larger data sets. If the number of threads are configured in a .properties file, it can be easily made a single threaded application by changing the .properties file without having to redeploy and retest the application until a proper fix is made. This applies to all URLs, server and port numbers, etc.
#2: Not testing the application with the right volume of data. For example, testing your application with 1 to 3 accounts instead of 1000 to 2000 accounts, which is the typical scenario in the production environment. The performance tests need to be conducted with the real life data, and not cut down data. Not adhering to real life performance test scenarios can cause unexpected performance, scalability, and multi-threading issues. It is imperative that you test your application for larger volume of data to ensure that it works as expected and meets the SLAs (i.e. Service Level Agreements) in the non-functional specification.
#3: Naively assuming that external or other internal services that are invoked from your application is going to be reliable and always available. Not allowing for proper service invocations timeouts and retries can adversely impact the stability and performance of your application. Proper outage testings need to be carried out. This is very crucial because the modern applications are distributed and service oriented with lots of web services. Indefinitely trying for a service that is not available can adversely impact your application. The load balancers need to be properly tested to ensure that they are functioning as expected by bringing each balanced node down.
#4: Not adhering to the bare minimum security requirements. As mentioned above, web services are everywhere, and web services can be easily exploited by the hackers for the denial of service attack. So, use of SSL layer, basic authentication, and penetration testing with tools like Google skipfish are mandatory. Unsecured applications can not only adversely impact stability of an application, but also can tarnish an organization’s reputation due to data integrity issues like customer “A” being able to view customer “B’s” data.
#6: Not externalizing business rules that are likely to change often. For example, tax laws, government or industry compliance requirements, classification laws, etc. Use business rules engines like Drools that allow you to externalize rules into database tables and excel spreadsheets. The business can take ownership of these rules, and can react quickly to changes to tax laws or compliance requirements with minimal changes and testing.
#7: Not having proper documentation in the form of
Unit tests with proper code coverage.
A confluence or wiki page listing all the software artifacts like classes, scripts, configuration files that have been modified or newly created.
High level conceptual diagrams depicting all the components, interactions, and structures.
Basic documentation for developers on “how to set up the DEV environment with data source details.
Points 1 and 2 are the primary form of documentation in an agile project in addition to the COS (Condition Of Satisfaction) created via tools like MindMap.
#8: Not having proper disaster recovery plans, system monitoring and archival strategies in place. It is easy to get missed on these activities in a rush to get the application deployed to meet the tight deadlines. Not having proper system monitoring through Nagios and Splunk can not only impact the stability of the application, but also can hinder current diagnostics and future improvements.
#9: Not designing Database tables with proper house keeping columns like created_datetm, update_datetm, created_by, updated_by and timestamp, and provision to logically delete records with columns like ‘deleted’ with ‘Y’ or ‘N’ values or record_status like ‘Active’ or ‘Inactive’.
#10: Not having proper system backout plan to restore the system to its stable state before deployment if anything goes wrong. This plan needs to be properly reviewed and signed-off by the relevant teams. This includes backing out to previous versions of software artifacts, any data inserted into the database, properties file entries, etc.
#11: Not performing proper capacity planning at the beginning of the project. Its no longer sufficient to simply say that you “need a Unix box, an Oracle database server and a JBoss application server” when specifying your platform. You need to be really precise about the
specific versions of operating systems, JVMs, etc
how much memory (including physical memory, JVM heap size, JVM stack size, and JVM perm gen space)
CPU (number of cores)
load balancer, number of nodes required, node types like active/active or active/passive and clustering requirements.
file system requirements, for example, your application may archive generated reports and keep it for a year before archiving them. So, you need to have enough hard disk space. Some applications require to generate data extract files to be generated and temporarily stored to be picked up by the other system processes or data warehouse systems for multi dimensional reporting. Some data files are SFTP’ed from other internal or external systems, and need to be kept for a period like 12 to 36 months before archived.