Disaster Recovery
Overview
Sirius has opted for a pilot light system, this means we have replicated all our data across multiple AWS regions to mitigate risk to the business in the event of a disaster. This document describes the process which should be followed if such a disaster was to occur.
Account Level Infrastructure
The account level infrastructure that contains resources like the underlying Network and Secrets. This is constantly deployed to after each successful deployment to preproduction and production.
Environment Level Infrastructure
The Environment level infrastructure is also deployed into the eu-west-2
region but there are some differences
- All ECS services are scaled to zero.
- OpenSearch restored from
eu-west-1
every morning at 5am. - RDS is in a replica only state.
Process
Declare an incident
Should a disaster occur an incident should be declared. This can be done in slack by doing /opg-incident <description>
After this has happened, the incident tool should be used to create a dedicated comms channel and you should page the on call
incident lead using the button provided by the tool in the #opg-incident
slack channel. Once this is done relevant parties
should be invited into the channel and a dedicated voice comms call should be established.
Failover Via Terraform
Prerequisites
Software
- aws cli
- aws-vault
- direnv
- opg-sirius-infrastructure git repo
- terraform
Config
- AWS Vault Profile for Identity
- AWS Vault Profile for Breakglass in the account the environment resides
All commands are executed in the opg-sirius-infrastruture/environment directory.
Steps
Setup DIRENV
Populate
DIRENV
config with values from the last successful deployment.Enable Maintenance Mode
Put Environment into maintenance mode to prevent attempted user access.
aws-vault exec identity -- terraform apply --parallelism=200 -var maintenance_mode=true
Make Secondary Region Active
Update the
terraform.tfvars.json
. In the region block for the environment, change active fromfalse
totrue
for the secondary region.Apply the terraform
aws-vault exec identity -- terraform apply --parallelism=200 -var maintenance_mode=true
Make sure all terraform is fully applied.
(Only outstanding changes should be the AZure AD App and the outputs.json)Switch Over Global DB Primary
Ensuring you have an aws-vault profile with the breakglass role for the account you want - Switch over the primary database cluster of the global cluster from the primary region to the secondary region.
aws-vault exec <account-breakglass-aws-identity> -- aws rds switchover-global-cluster \ --global-cluster-identifier <environment-name>-api-global \ --target-db-cluster-identifier arn:aws:rds:<secondary-region>:<account-id>:cluster:api-<environment-name> \ --region <secondary-region>
For example:
aws-vault exec sirius-preprod-breakglass -- aws rds switchover-global-cluster \ --global-cluster-identifier dr-test-api-global \ --target-db-cluster-identifier arn:aws:rds:eu-west-2:123456789012:cluster:api-dr-test \ --region eu-west-2
Once executed wait for the cluster to switch completely to the secondary region.
Make Primary Region Inactive in Maintenance Mode
Update the
terraform.tfvars.json
. In the region block for the environment, change active fromtrue
tofalse
for the primary region.Apply the terraform
aws-vault exec identity -- terraform apply --parallelism=200 -var maintenance_mode=true
Make sure all terraform is fully applied. May require multiple applies.
(Only outstanding changes should be the AZure AD App and the outputs.json)Restore OpenSearch Snapshot
aws-vault exec identity -- terraform output -json > terraform.output.json aws-vault exec identity -- ecs-runner -task restore-opensearch-snapshot -region <secondary-region> -timeout 3600
Run OpenSearch Catch Up Task for Person Index
aws-vault exec identity -- terraform output -json > terraform.output.json aws-vault exec identity -- ecs-runner -task reindex-elasticsearch-person-from-date -region <secondary-region> -timeout 3600
Fully Reindex the Firm Index
aws-vault exec identity -- terraform output -json > terraform.output.json aws-vault exec identity -- ecs-runner -task reindex-elasticsearch-firm-index -region <secondary-region> -timeout 3600
Take Sirius Out of Maintenance Mode
Apply terraform without the
maintenance_mode
flagaws-vault exec identity -- terraform apply --parallelism=200
Make sure all terraform is fully applied.
(Only outstanding changes should be the AZure AD App and the outputs.json)