Business Continuity Planning (BCP)
A BCP plan essentially addresses the non-availability of a primary production environment. The non-availability could be as a result of a natural disaster (hence the popular term Disaster Recovery planning) and sometimes BCP and DR are used interchangeably. However, DR planning is a part of the larger BCP strategy.
As part of BCP, the following items are addressed:
Document scenarios of primary production environments not available and related mitigation plans.
Backup and replication strategies to support the overall BCP strategy.
Building production-quality standby environments or running application environments in multiple geographical regions. The latter configuration makes the application Highly Available (HA).
I've seen production operations teams dragged into a company's or engineering department's drive to roll out the agile process. Though that has been found to be a useful methodology for product development groups, it could be clumsy and forced in an operations environment, mainly because the operations teams don't have full control over their own time. Issues happen in production and the priorities change, but keeping the systems up and running is the primary responsibility. Getting the projects done in a fixed time-frame may not be possible always.
However, projects both small and big have to be tracked formally and they have to completed. If an agile methodology has to be adopted, the production operations team has to be realistic and assertive about its involvement:
Be part of development Scrum teams. Engineering projects are not just product development. The infrastructure to run the application and its monitoring requirements have to be planned right from the beginning. Embedding an operations engineer in the application development agile teams is a great idea as opposed to tossing out tasks to the operations team without context.
Roll out a Kanban-like process within the production engineering team to manage projects. Regardless of the adoption of methodologies, managing the backlog of projects and tasks and their prioritization should happen.
Issues happen all the time in production environments. However, if such an incident causes a considerably negative impact on the end-user experience or loss of revenue, then a quick fix will be needed. That is called a hotfix normally, and the process that is followed is different from the standard code deployment procedure, with a focus on resolving the issue at the earliest.
The incident management process should also ensure that both users and stakeholders are informed of an ongoing issue. If an end user would end up escalating a system-wide issue (don't confuse this with the reporting of product bugs), then the company has a serious problem running its business and the production operations group can avoid such embarrassment by alerting on an issue before users notice it, and later, taking leadership in analyzing the root cause of the production issue.
In a new organization, the following processes and info have to be in place to deal with incidents in production that would have some business impact.
Prepare a comprehensive list of contacts and set up a process to maintain it. The contact info should include operations and development POCs for products. The contacts from operations could be multiple with support needed from core infrastructure, network operations, databases, and application support. The list should also identify the product owners, normally product managers, who would manage the communication with the end users if some issue happens.
Set up the group communication infrastructure. When an incident happens, multiple people could end up triaging the issue. Chat, voice, and desktop sharing are most common modes of communication that will be used during a crisis. The employees should have access to communication tools such as telephone conferencing, IRC, Webex, etc.
Implement a root cause analysis process to review major outages in production. The focus of such issues must be resolving issues so the same incident will not repeat.
Software applications run on hardware infrastructure and software platforms that need upgrades. Old hardware has to be replaced or upgraded, OS has to be upgraded to the latest stable version, and third-party software components will also require upgrades as older versions could go out of official support if you hang on to it for long.
In environments that are built using open-source products, automatic upgrades are very common. Though largely, that will not have an impact. In general, changing any component in production without adequate testing is not advisable. The company should have a plan to roll out upgrades in production environments.
In a data center or private cloud environment, the production operations team has to plan for retiring and replacing old hardware. Such efforts are called rewiring and considerable resources are needed to set up a new computing environment where an application stack will be redeployed so the existing environment can be retired.
Vulnerabilities in the security strategy will put both business and its customers at risk. If there is a serious security breach, new companies rarely recover from it, as, it would lose customer trust and reputation.
The subject of securing cloud-based applications and the platform they are running on can be discussed in length. However, the basic precautions listed below have to be taken. Keep in mind that they are rolled out in a specific environment. These efforts will be in the right direction in implementing the requirements for ISO/IEC 27001 certification or SOX compliance. Such things are needed as the company grows.
Often, the unit-tested software is neither password-protected nor communications-encrypted. It is important that the applications in production only run with such basic protections enabled. It means implementation of SSL and custom or industry standard authentication protocols like OAuth.
Don't allow code with user credentials to be checked into CMS. Such info must be externalized from the code and be moved to config files that can be set up as part deployment process.
Rollout a process to manage passwords. Such efforts will be useful later to be compliant with security audits like ISO/IEC 27001 certification or SOX compliance.
Run industry standard tests such as PEN tests periodically and harden the environments quickly based on the results.
Automate the process of granting and revoking user access, both OS and application. Generate user creation and addition logs.
Have a process in place for the production operations team to be informed of latest security patches etc by the cloud provider or third-party tool vendors.
Include security review as part of planning major releases.￼
It is very tempting to roll out popular tools and implement fancy-sounding processes company-wide as part of setting up production engineering infrastructure. The tools and processes are only good in the hands of those who know how to use them effectively. It is very important that a versatile and competent team is built first and then empower them to choose or build the right tools of their trade. A new tool or a process implemented should be for solving an existing problem or improving productivity; if such an emergency doesn't exist, it is better to wait, as real-life requirements can help define the processes better and help you choose the right supporting tools.