The RIPEstat incident on March 20 was caused by a database cluster for which the nodes became stuck. This caused the frontend application to run out of Python workers. This situation initially affected the looking-glass endpoints. To restart the database, we had to stop the application component ("thrift-api") that communicates with this database. The thrift-api is also used by other endpoints, affecting many endpoints during that maintenance.
RIPEstat uses a distributed, in-memory database for the looking glass. For unknown reasons, a query on this database deadlocked with another query. This deadlock caused the number of active queries from the thrift-api to increase sharply and the database process on one node to hit the open file descriptor limit (with every connection using a file descriptor). The issue with the first worker blocked queries to this node. In addition, this prevented all writes to the cluster.
Calls from the thrift-api to the in-memory database blocked and took time to time out. In turn, this caused Python processes for the frontend application to be stuck and sometimes be unable to process requests.
This sometimes caused health-check calls to fail, which resulted in the load-balancer disabling workers. This increased the load on the remaining workers, degrading the situation. We changed the load-balancer check and restarted the database node that was in the deadlocked state.
The application initially recovered, until a second database node hit a deadlock. At this point, more looking-glass API calls arrived than were timing out. This resulted in almost complete downtime until we disabled the looking-glass API (12:51 UTC).
At this point more nodes were stuck than we could safely restart (there was no quorum of functioning instances). This was complicated by the fact that hitting the open file descriptor prevented this database from writing its snapshot to disk, and also prevented management commands from cleanly restarting the database.
At 16:00 UTC we started a node-by-node restart of the in-memory database. In addition, we raised the open file limit for this database. After the database re-start we re-enabled the thrift-server (recovering many ripestat endpoints) and the service inserting data for the looking glass. Maintenance finished by 18:30 UTC, and by 19:30 UTC the looking glass caught up for most peers.