Skip to main content

Disaster Recovery

Overview

Sirius has opted for a pilot light system, this means we have replicated all our data across multiple AWS regions to mitigate risk to the business in the event of a disaster. This document describes the process which should be followed if such a disaster was to occur.

Account Level Infrastructure

The account level infrastructure that contains resources like the underlying Network and Secrets. This is constantly deployed to after each successful deployment to preproduction and production.

Environment Level Infrastructure

The Environment level infrastructure is also deployed into the eu-west-2 region but there are some differences

  1. All ECS services are scaled to zero.
  2. OpenSearch restored from eu-west-1 every morning at 5am.
  3. RDS is in a replica only state.

Process

Declare an incident

Should a disaster occur an incident should be declared. This can be done in slack by doing /opg-incident <description>

After this has happened, the incident tool should be used to create a dedicated comms channel and you should page the on call incident lead using the button provided by the tool in the #opg-incident slack channel. Once this is done relevant parties should be invited into the channel and a dedicated voice comms call should be established.

Failover Via Terraform

Prerequisites

Software

  • aws cli
  • aws-vault
  • direnv
  • opg-sirius-infrastructure git repo
  • terraform

Config

  • AWS Vault Profile for Identity
  • AWS Vault Profile for Breakglass in the account the environment resides

All commands are executed in the opg-sirius-infrastruture/environment directory.

Steps

  1. Setup DIRENV

    Populate DIRENV config with values from the last successful deployment.

  2. Enable Maintenance Mode

    Put Environment into maintenance mode to prevent attempted user access.

    aws-vault exec identity -- terraform apply --parallelism=200 -var maintenance_mode=true
    
  3. Make Secondary Region Active

    Update the terraform.tfvars.json. In the region block for the environment, change active from false to true for the secondary region.

    Apply the terraform

    aws-vault exec identity -- terraform apply --parallelism=200 -var maintenance_mode=true
    

    Make sure all terraform is fully applied.
    (Only outstanding changes should be the AZure AD App and the outputs.json)

  4. Switch Over Global DB Primary

    Ensuring you have an aws-vault profile with the breakglass role for the account you want - Switch over the primary database cluster of the global cluster from the primary region to the secondary region.

    aws-vault exec <account-breakglass-aws-identity> -- aws rds switchover-global-cluster \
        --global-cluster-identifier <environment-name>-api-global \
        --target-db-cluster-identifier arn:aws:rds:<secondary-region>:<account-id>:cluster:api-<environment-name> \
        --region <secondary-region>
    

    For example:

    aws-vault exec sirius-preprod-breakglass -- aws rds switchover-global-cluster \
        --global-cluster-identifier dr-test-api-global \
        --target-db-cluster-identifier arn:aws:rds:eu-west-2:123456789012:cluster:api-dr-test \
        --region eu-west-2
    

    Once executed wait for the cluster to switch completely to the secondary region.

  5. Make Primary Region Inactive in Maintenance Mode

    Update the terraform.tfvars.json. In the region block for the environment, change active from true to false for the primary region.

    Apply the terraform

    aws-vault exec identity -- terraform apply --parallelism=200 -var maintenance_mode=true
    

    Make sure all terraform is fully applied. May require multiple applies.
    (Only outstanding changes should be the AZure AD App and the outputs.json)

  6. Restore OpenSearch Snapshot

    aws-vault exec identity -- terraform output -json > terraform.output.json
    aws-vault exec identity -- ecs-runner -task restore-opensearch-snapshot -region <secondary-region> -timeout 3600
    
  7. Run OpenSearch Catch Up Task for Person Index

    aws-vault exec identity -- terraform output -json > terraform.output.json
    aws-vault exec identity -- ecs-runner -task reindex-elasticsearch-person-from-date -region <secondary-region> -timeout 3600
    
  8. Fully Reindex the Firm Index

    aws-vault exec identity -- terraform output -json > terraform.output.json
    aws-vault exec identity -- ecs-runner -task reindex-elasticsearch-firm-index -region <secondary-region> -timeout 3600
    
  9. Take Sirius Out of Maintenance Mode

    Apply terraform without the maintenance_mode flag

    aws-vault exec identity -- terraform apply --parallelism=200
    

    Make sure all terraform is fully applied.
    (Only outstanding changes should be the AZure AD App and the outputs.json)

This page was last reviewed on 1 December 2023. It needs to be reviewed again on 12 January 2024 by the page owner #opg-sirius-develop .
This page was set to be reviewed before 12 January 2024 by the page owner #opg-sirius-develop. This might mean the content is out of date.