Disaster Recovery

Overview

Sirius has opted for a pilot light system, this means we have replicated all our data across multiple AWS regions to mitigate risk to the business in the event of a disaster. This document describes the process which should be followed if such a disaster was to occur.

Account Level Infrastructure

The account level infrastructure that contains resources like the underlying Network and Secrets. This is constantly deployed to after each successful deployment to preproduction and production.

Environment Level Infrastructure

The Environment level infrastructure is also deployed into the eu-west-2 region but there are some differences

All ECS services are scaled to zero.
OpenSearch restored from eu-west-1 every morning at 5am.
RDS is in a replica only state.

Process

Declare an incident

Should a disaster occur an incident should be declared. This can be done in slack by doing /opg-incident <description>

After this has happened, the incident tool should be used to create a dedicated comms channel and you should page the on call incident lead using the button provided by the tool in the #opg-incident slack channel. Once this is done relevant parties should be invited into the channel and a dedicated voice comms call should be established.

Failover Via Terraform

Prerequisites

Software

aws cli

aws-vault

direnv

opg-sirius-infrastructure git repo

terraform

Config

AWS Vault Profile for Identity

AWS Vault Profile for Breakglass in the account the environment resides

All commands are executed in the opg-sirius-infrastruture/environment directory.

Steps

Setup DIRENV

Populate DIRENV config with values from the last successful deployment.

Enable Maintenance Mode

Put Environment into maintenance mode to prevent attempted user access.

aws-vault exec identity -- terraform apply --parallelism=200 -var maintenance_mode=true

Make Secondary Region Active

Update the terraform.tfvars.json. In the region block for the environment, change active from false to true for the secondary region.

Apply the terraform
```
aws-vault exec identity -- terraform apply --parallelism=200 -var maintenance_mode=true
```
Make sure all terraform is fully applied.
(Only outstanding changes should be the AZure AD App and the outputs.json)

Switch Over Global DB Primary

Ensuring you have an aws-vault profile with the breakglass role for the account you want - Switch over the primary database cluster of the global cluster from the primary region to the secondary region.

aws-vault exec <account-breakglass-aws-identity> -- aws rds switchover-global-cluster \
    --global-cluster-identifier <environment-name>-api-global \
    --target-db-cluster-identifier arn:aws:rds:<secondary-region>:<account-id>:cluster:api-<environment-name> \
    --region <secondary-region>

For example:

aws-vault exec sirius-preprod-breakglass -- aws rds switchover-global-cluster \
    --global-cluster-identifier dr-test-api-global \
    --target-db-cluster-identifier arn:aws:rds:eu-west-2:123456789012:cluster:api-dr-test \
    --region eu-west-2

Once executed wait for the cluster to switch completely to the secondary region.

Make Primary Region Inactive in Maintenance Mode

Update the terraform.tfvars.json. In the region block for the environment, change active from true to false for the primary region.

Apply the terraform
```
aws-vault exec identity -- terraform apply --parallelism=200 -var maintenance_mode=true
```
Make sure all terraform is fully applied. May require multiple applies.
(Only outstanding changes should be the AZure AD App and the outputs.json)

Restore OpenSearch Snapshot

aws-vault exec identity -- terraform output -json > terraform.output.json
aws-vault exec identity -- ecs-runner -task restore-opensearch-snapshot -region <secondary-region> -timeout 3600

Run OpenSearch Catch Up Task for Person Index

aws-vault exec identity -- terraform output -json > terraform.output.json
aws-vault exec identity -- ecs-runner -task reindex-elasticsearch-person-from-date -region <secondary-region> -timeout 3600

Fully Reindex the Firm Index

aws-vault exec identity -- terraform output -json > terraform.output.json
aws-vault exec identity -- ecs-runner -task reindex-elasticsearch-firm-index -region <secondary-region> -timeout 3600

Take Sirius Out of Maintenance Mode

Apply terraform without the maintenance_mode flag
```
aws-vault exec identity -- terraform apply --parallelism=200
```
Make sure all terraform is fully applied.
(Only outstanding changes should be the AZure AD App and the outputs.json)

This page was last reviewed on 1 December 2023. It needs to be reviewed again on 12 January 2024 by the page owner #opg-sirius-develop .

This page was set to be reviewed before 12 January 2024 by the page owner #opg-sirius-develop. This might mean the content is out of date.