High Database CPU Usage
Background
Due to its “design” Sirius can experience severly impacted performance when users run expensive requests on the frontend, such as reports or pulling the new cases list.
These queries then timeout at the PHP Level resulting in an error being reported back to the user, who runs the query again, and again. Unfortunately the query does not timeout at
a database level causing a build up of long running, poor performing queries on the DB Cluster impacting performance for everyone.
Investigation & Resolution
You need access to the AWS Console in the Production account
Within the console navigate to the CloudWatch Service, there is a LogInsights saved query called slow requests, that you can modify the request_time on to highlight very long running queries to feedback to the business that they should stop performing certian tasks. Request to the reporting endpoint for example can allow you to target who might be rerunning the report and ask them to stop.
Set up a Cloud9 Instance following the guidance here
Once the C9 environment is configured connect to the Sirius Database.
The following SQL will show you all queries running on the node you’re connected to (We’ll come back to this later) that have been going for longer than 30 seconds.
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state FROM pg_stat_activity WHERE state = 'active' AND (now() - pg_stat_activity.query_start) > interval '30 seconds' ORDER BY duration;
If there are no queries running then you made need to look on the other nodes, or look for a different root cause entirely.
If you do find queries then you can use the returned pid to ask PostgreSQL to safely stop them by running the query below.
In the example below 123 is the pid of the query you want to kill.
select pg_cancel_backend(123);
Checking the other Nodes
The production cluster is a three node cluster made up of one read/write node, and two read-only nodes. By default the Cloud9 setup script points you at the read/write instance, if you investigate that node and there are no long running queries then you need to check the read-only nodes.
Repoint the psql target to the read-only nodes load balancer by running
psql_api_ro
.Run
psql
and repeat the queries above, killing any long running queries if they exist.Run
SELECT inet_server_addr();
this will give you the IP Address of the read-only node you’re connected to.Close psql, re-open it, and run the above command again until you get a different IP address. Due to the load balancer round robin-ing requests you made need to do this a few times before you make it onto the other read-only node.
Once you’re on the other read-only node check for long running queries and kill them.
Poor Performance with no obviously long running queries
We have had instances of poor performance where regular OLTP queries that are usually very quick are performing slowly or in a wildly varied manner. When running the poorly running queries if you see a lot of the PostgreSQL AutoVacuum processes running, and running frequently it might point to an internal cluster issue. If you run the query below it might highlight some issues.
SELECT schemaname, relname, n_live_tup, n_dead_tup, last_autovacuum
FROM pg_stat_all_tables
ORDER BY n_dead_tup DESC LIMIT 10;
If you see a table listed that has been very recently vacuumed, and yet has a large amount of dead_tups, it implies there is something potentially blocking the AutoVacuum from completing successfully (ie cleaning up all the dead tuples, releasing space, and updating the query stats). This has occurred previously on the public.cases table, where it was running constantly, adding more load to the system, whilst not being able to fix the poor performance issues that it’s designed to do.
You can confirm there is an issue by running a vacuum verbose on the problematic table (though this will add load to an already overloaded system it won’t block transactions).
VACUUM VERBOSE public.cases;
There will be multiple lines of output referencing “dead row versions” if you see that there are a high number of dead row versions that cannot be removed, this will be what is constantly triggering the autovacuum, but because they can’t be removed it retriggers again and again. There are a number of possible reasons, some of which are documented here. I found that it was none of those issues and PostgreSQL itself had got into a weird state, which required the writer node of the cluster restarting out of hours (this can be done in the AWS Console as Breakglass) this will disrupt the service of a couple of minutes but is safe to do.