destroytoday.com

Troubleshooting the database upgrade

In the previous post, I wrote about upgrading Cushion’s production Postgres database from 9.5 to 12. I did this because Heroku notified me that they were deprecating support for 9.5. For the first couple days after the upgrade, everything seemed to be okay, but then Monday happened. I started to seeing a lot of “Out of memory” errors and several other cryptic errors from the database, which is probably the scariest thing you can see after an upgrade.

PG::OutOfMemory: ERROR: could not get shared memory segment: Cannot allocate memory

PG::InternalError: ERROR:  could not load library "/usr/lib/postgresql/12/lib/llvmjit.so": libLLVM-9.so.1: failed to map segment from shared object
CONTEXT:  parallel worker

PG::ConnectionBad: PQconsumeInput() SSL SYSCALL error: EOF detected

PG::UnableToSend: SSL SYSCALL error: EOF detected

PG::OutOfMemory: ERROR:  out of memory
DETAIL:  Failed while creating memory context "ExprContext".
CONTEXT:  parallel worker

I also noticed that many of Cushion’s database queries were now taking up to 10x longer to execute than they did on Postgres 9.5. This was not good.

heroku-postgres-response-times

I reached out to Heroku support, hoping to hear that this is common with upgrades to Postgres 12 and all I’d need to do is A, B, and C, but I didn’t hear back right away, so I continued to troubleshoot on my own. Looking through the Postgres configs between my 9.5 database and the new one, I saw a handful of new properties that referred to parallel workers. This made me wonder if the database’s memory couldn’t handle the number of database connections I had, so I reduced Cushion’s concurrent workers. This immediately helped, and I only saw one or two “Out of memory” errors per day instead of the dozens I was seeing before.

Now that I got the spiking memory under control, I started clicking around the app. Almost immediately, I was able to narrow in on few problematic queries, which were also highlighted by Skylight. None of these queries gave me problems locally, which makes sense because my local database has nowhere near as much data as my production database, so I forked the production database in order to test these queries without affecting the main app.

I ran the queries prefixed with EXPLAIN (FORMAT JSON, ANALYZE) to output the query plan. Then, I used this handy visualizer to narrow in on any problematic areas. As I analyzed queries, I also chopped away and re-analyzed them to see if I could find any quick offenders. While I did find a few missing indexes, the main culprit came from Cushion’s “collaboration” features. Most folks probably don’t know about these features because they were part of a short-lived private beta that we ran years ago and decided not to pursue. After ending the beta, we ended up keeping the code around because several users still rely on these features, and we weren’t in a rush to remove them.

If you’re curious how these collaboration features, which are only used by a few users, could slow down the app for all users, I’ll tell you. When I originally built Cushion, it would query clients, projects, etc., straight from the authenticated user. Those are quick and easy queries. For collaboration, however, Cushion needs to query clients, projects, etc., from the authenticated user, but also from any “collaborators”. Instead of creating a top-level “team” or “organization” model to own the clients, projects, etc., we took a peer-to-peer approach by associating collaborators directly with users. So, when we queried for anything that could be shared, we’d query from a shared dataset instead of filtering directly from a team/organization. I realize now that this is significantly slower.

I’m sure there are ways to make this faster, but this code belongs to a feature that I’m no longer pursuing, so it feels like a waste. Instead, I decided to go through the app and find all the areas that use this shared dataset. I was shocked to discover that many of the views weren’t even accessible to collaborators, so these queries didn’t even need to use the shared dataset. I spent an afternoon ripping them out, and immediately saw up to 10x speed gains. I’m still unsure why these queries performed fine in Postgres 9.5, but that’s what I get for leap frogging all the in-between versions to Postgres 12.

A couple slow queries still exist because of the collaboration features, like the time entry form and workload form, but I was able to cut their speed in half by putting on my SQL hat and fine-tuning as much as I could. Still, it’s not great. While I wish I had the heart to completely remove these features, I absolutely hate taking anything away from anyone—especially when their business relies on it. Eventually, I’ll need to remove collaboration from Cushion, but for now, it’s not doing as much harm as earlier this week.

By now, you probably forgot that I originally reached out to Heroku for help. They eventually responded to my support ticket—five days later—to tell me that they’re redirecting my ticket to a team who could help. This was infuriating. I did end up receiving a thorough response a day later, but by then, I already figured everything out on my own. This really concerns me, and makes me wonder if it’s time to move on from Heroku, so I’ve been actively looking around. Got a good rec for a managed platform? I’m all ears.