In one of my previous companies, we had the worst service outage in its lifetime. The unexpected downtime lasted for an entire day, came back up at the night, and then went down again for roughly half a day on the next day. I was on vacation during the entire event and was in the area without any internet access, but the solution they implemented was adding a rate limiter to the service. Everyone had agreed the network wasn’t part of the culprit — it was something in the database. Because rate limiter would address the immediate crisis, the problem was gone, but masked by it. And we were never able to pin down the exact root cause, even after the issue was gone.
I had a chance to catch up on this event after I came back, and though I was also not able to definitively identify the root cause by reading chat history, I came up with a hypothesis and everyone seemed to be in agreement that was most likely what happened — namely, lock contention in database.
The data models we had were highly relational — having foreign key constraints all over the place. It might have unfolded like this:
- A few expensive transactions would hold locks for a bunch of rows in multiple tables.
- Subsequent operations trying to update those rows would wait, taking up the database resources the whole time.
- These dependencies form a chain, worsening the contention more and more, clogging the requests following them.
- Eventually the memory limit is reached in domino effect and the database goes down.
This incident inspired me to create plum.
I claim it can be useful if you want to actively investigate lock contention in PostgreSQL. It could also be useful for passive monitoring on lock contention as well, though that wasn’t the aim.
To run the tool, follow the instructions on GitHub readme. Currently it’s in the alpha version and doesn’t directly support running it in a hosted environment.
Click the “Settings” menu on the sidebar.
Under “Subscribed tables” section, select the Postgres tables that you want to monitor. Also adjust the “Timeseries interval” (how often you want the chart to update) and “Frame cycle” (how often you want to whole chart to refresh, having different start and end times) if you like.
Click the “Monitor” menu.
If you hover over the main screen, the text will change to “Start”. Click it.
You will see the realtime visualization of the current processes on the tables you’re monitoring.
Hover over the chart and notice the tooltip and visual indicator of the mouse position. Click on a position, maybe where there are the most processes (blue horizontal bars) running, because that indicates where the most processes are running (hence likely higher lock contention).
It will display the drill down view of the processes, including the locks held by them. Each lock with
tuple type also dumps the actual row (tuple) the lock is held for, so that it’s easier to identify the offending rows. For each process, the query being run for the process is shown as well as the blocking process, if any, of the process.
You can investigate these data in order to identify what kind queries on what kind of rows are causing the most contention, or deadlocks.
Record & Replay
Without recording, the session is ephemeral so you have to monitor the chart the whole time, also you can’t play it multiple times. For these reasons there is a record-and-replay functionality.
Go to the “Settings” menu and select “Record” under “Record timeseries data for replay” section before monitoring any activities. Go to the“Monitor” menu and play a session. Click “Stop” at the point you want the session to end. A file will be downloaded to your machine.
Now go to the “Replay” menu and click the main screen. Upload the file downloaded. The session will replay!