CloudWatch Log File Redaction
Background
Should there be an occasion where some debugging makes it through to Production there is the potential for personally identifable informaiton (PII) data to leak into the CloudWatch logs. At the time of writing we don’t ship our logs anywhere for additional processing or centralisation, and everyone with access to the logs has access to the raw data, so this wouldn’t constitute a data breach. However we may end up shipping logs somewhere in future, and it is best practice to never log PII data, therefore we need to clean up the logs.
As CloudWatch doesn’t allow us to remove individual logStream rows from a log stream, nor to publish to a log stream data more than a week old we need to process the logStreams, remove the PII rows,
and then we upload it to an S3 Bucket. Persistent Sirius environments have a bucket that stores logs for the 13 months that we retain logs called redacted-logs.{environment name}.eu-west-1.sirius.opg.justice.gov.uk
.
Prerequisites
We need to identify a string that we can use to identify all the rows we need to redact, in the examples here the string Cell
, we’ll use this to run the Log Insights query below to identify
all the logStreams that contain our redaction string. Depending on the frequency of your leak triggering you’ll need to set your timeframe accordingly (on fairly small leak I queried two months of logs at a time).
stats count(*) by @logStream
| filter @message like 'Cell'
| sort @timestamp desc
| limit 2000
This will return a list of logStreams which you can then pop into a CSV (there’s an option in log insights to copy to clipboard as a CSV), prune the count and trailing comma off the end of the rows.
Execute & Monitor
The script itself lives here you can either run it from a Cloud9 instance or package it up with the logStream file to run it on ECS. As it’s only processing existing logStream data, it can be executed without any impact on system performance.
As you’ll be deleting logStreams this requires breakglass permissions so make sure you are running it as the correct role.
The script is set up for Sirius so if you’re using it against a different product you’ll need to modify the log group matching and S3 bucket name at the top of the script.
Search string is in the main function at the end of the script. The script expects a list of logStreams in a file called logStreams.csv
.
There are two environment variables it expects ENVIRONMENT
which is the Sirius environment name, and DELETE_LOGSTREAMS
which defaults to false so you can test the redaction processing without deleting
the source immediately, you can set this to any of T,t,TRUE,true,1
to enable logStream deletion.
Once you’ve set up your instance/container with the correct permissions/environment all you need to do is run python log_redactor.py
it will output logs to stdout/stderr either on the C9 instance, or from the ECS Task.
As it is single threaded serialised processing it will process logStreams one at a time, so it is quite slow.