Experiments
Some code rewrites can be risky, as the new code can perform differently in production (typically due to edge cases or the quantity of data). Experiments let us mitigate that risk by rolling out functionality to a variable number of users.
Once an experiment is running, we can record the outcome of each run (for example, how long it took to execute) and compare the old code against new to ensure there are no mitigations.
Creating an experiment
To create a new experiment, you first need to create an AWS Systems Manager Parameter to control it. To do so, just add it to the experiments list in parameters.tf
in opg-sirius-infrastructure
. The name should only contain lowercase letters, numbers and hyphens.
Its initial value will be a JSON object containing a property called “threshold” with value 0
to indicate that all traffic should use the original route.
locals {
experiments = toset([
// Existing experiments...
"my-new-experiment",
])
}
You then need to update your PHP code to allocate each request to the control group or experiment group, run the corresponding code, and log the outcome.
// Fetch the experiment configuration,
$experiment = $this->experimentManager->create('my-experiment');
if ($experiment->isInExperiment()) {
// Use new code
} else {
// Use old code
}
$this->experimentManager->log($experiment, [/* Any additional data */]);
You can safely deploy this to production, since the threshold is 0 and no requests will be sent to the new code.
Running an experiment
Once your experiment is deployed, you can edit the parameter through the AWS console. You can change the threshold to a number between 0 (no-one gets the experiment) and 1 (everyone gets the experiment). Due to caching, any changes will take at most 5 minutes to take affect.
A few considerations:
- Any values below 0 or above 1 will be treated as 0 and 1 respectively
- If there’s a problem reading the parameter (e.g. the JSON is invalid), all traffic will be sent to the old code and an error will be logged in CloudWatch
You can use AWS CloudWatch’s Log Insights to compare the two experiment groups. The following snippet compares the minimum, maximum and average time of each code option.
stats count(), min(extra.time), max(extra.time), avg(extra.time) by extra.inExperiment
| filter extra.category = 'Experiment' and extra.experiment = 'my-experiment'
The runtime is recorded by default when you call ExperimentManager->log()
, any extra data you passed in as the second parameter is available as properties of extra
.
You can now increase the threshold at a rate that you feel appropriate, monitoring Log Insights, alarms and any user feedback.
Limitations of experiments
A new experiment group is selected on every request, rather than persisting with a user’s session. This means a user could run simultaneous requests and get different outcomes. If we ever need persistent groups, we could add a sticky:true
property to the parameter and update ExperimentManager
to store the group in state if the property is present.
You can only ever have two options in an experiment: control (the old code) and experiment (the new code). Whilst you could support three or more options by running multiple experiments, it would probably be sensible to alter ExperimentManager
if this need every arises.