destroytoday.com

A dev’s troubleshooting journey

Let me tell you a tale. It all begins when a user reported an issue with Cushion sending them a notification when it shouldn’t. Cushion notifies users when a client views their invoice, but in this case, the user received the notification when they viewed their own invoice. This should only happen when the user isn’t signed in, so I simply told them to make sure they’re signed in before viewing an invoice. I never received a follow-up, so I considered the case closed.

A few weeks later…

On a late Friday night, another user reports an issue—the exact same issue. (lightning strike) Feeling like I should create a canned response by now, I write the exact same explanation about needing to be signed in when viewing your own invoice. This time, however, the user responds. They say that’s strange because they are signed in (lightning strike). Perplexed, I attempt to reproduce the issue. To my dismay, I receive the notification as well.

invoice-viewed-email

The feature works by making an API request upon loading the invoice page. If the session cookie exists for the user, the request returns a 403. If no session cookie exists for that invoice’s user, the endpoint sends the notification, so I start by troubleshooting the endpoint. Despite seeing the cookie in the browser, the authenticated user is nil. Confused, I inspect the authentication serialization code, but it’s never called. Strange indeed.

Hours pass and I feel no closer to a solution than when I first started troubleshooting hours ago. I feel hopeless. I start to wonder if I’m only scratching the surface. I worry that it might consume many more days of my life, if not weeks. My wife will find me curled up in the tub with the shower running, mumbling about session cookies. In that moment, I notice something out of the corner of my eye. A tiny yellow icon, sitting beside my cookie. An alert.

cookie-alert

Hovering the tiny yellow icon, I discover the truth. The cookie exists, but it is not a good cookie (like an oatmeal raisin when you’re expecting chocolate chip). This cookie requires a SameSite=None setting in order to work like it once did. A quick search reveals this relevant blog post that foreshadowed the issue. Looking at the date, I realize exactly why I missed this blog post because I was busy burning out at the time.

Chapter 2

With this clue, I set out to fix the issue, hopeful that my journey is nearing an end. I quickly discover, however, that Cushion’s version of Rack does not support the SameSite setting. In order to support it, I would need to update this dependency and every dependency that depends on this dependency, which is almost all of the dependencies. One by one, I update them until The Bundler tells me it is satisfied. Bracing for the worst, I run my tests—nearly 6,000 of them—and come out the other side with only four failing tests. Merely a flesh wound.

undefined method `gmtime' for "Mon, 21 Dec 2020 16:19:36 GMT":String

I survey the failing tests, and find several similarities between them. The most notable is that two of the tests fail because the gmtime method does not exist on a date string. The date string is immediately recognizable as the session cookie’s expiration date, but why do only these two tests fail when plenty other tests rely on the same code? I search the internet and come up empty. A red herring.

Retracing my steps, I make my way back to the beginning of the fork, and set down a different path—the other failing tests. These ones rely on an id parameter, but it’s nil. Nothing changed in the setup of these tests, so this can’t be, but it be. Taking a step back, I worry this is yet another red herring, but something tells me to persist.

I spin my wheels for several hours before desperation starts setting in. I find myself inspecting the routing of the entire system as a last resort. Expecting to find the id parameter with its mysterious nil value, I don’t find the parameter at all. Instead, I’m confused to find the captures array, which appears when a route path contains variables, like /:slug or /:id. Strangely, the capture includes a single string—“form”. This is strange because the endpoints don’t have any variables in their route… but the namespace block below them does. I immediately know what’s happening because the failing endpoints end in /form, which is the exact same string that appeared in the captures array.

Instead of routing to the correct endpoint, the requests are routing to the /:id endpoint with “form” as the :id. This explains why the model in the test is nil because there’s no model with the id “form”. That makes sense, but why do these tests fail when plenty of other endpoints sit above namespace blocks? Examining the endpoints, I see why they’re the outliers—the endpoints are the only ones where the namespace path is a variable and the routes before them are strings, instead of solely the HTTP method.

Out of curiosity, I remove the namespace block, and instead append the variable to each route within that block. My prayers are answered when the tests pass! In a moment of enlightenment, I revisit the other tests and clearly see that this capture-based namespace block is the similarity that connects them all. Upon updating the routes, the remaining tests pass. For the first time in this journey, I have a passing test suite. Hallelujah.

I spend several minutes searching for why this could be the case, but I come up short. I do notice that the version of Sinatra that I updated to now relies on Mustermann to handle string matching, but because my original journey isn’t over yet, I will revisit this side quest at a later date.

Chapter 3

With tests passing and an end in sight, I add the necessary samesite: :none setting to my session cookie, and deploy the code to staging. While waiting, I notice the pass-through setting in my staging server’s Cloudfront distribution is labeled “legacy”, so I change it to the recommended value, and continue waiting for the deploy. I’m so relieved that after such a deep rabbit hole I was able to come up for air with a solution, or so I thought…

When the deploy finishes, I visit the staging server with a sense of accomplishment. I can’t believe I was able to get to the bottom of the issue. In that moment, I feel at peace, not realizing the horror that awaits me. The browser quickly flashes a dozen times before uttering one of the most terrifying error messages a dev can see—“Too many redirects”. In a panic, I follow the browser’s suggestion of clearing my cookies. Too many redirects. I clear my cache; I try an incognito window; I try other browsers. Too many redirects.

I must’ve blacked out at this point because I don’t remember ending up at the login screen, but I did. The login screen works, but the root path does not. Strange, but I’ll take anything I can get at this point. I attempt to sign in, but I’m met with an even more terrifying error message, which is so terrifying that it doesn’t even have a stylesheet—“Forbidden”. Still panicked, I go to inspect the session cookie for the page only to discover that the cookie no longer exists.

I’m at a loss for words. I revisit my implementation and start to second guess it. I search for anyone else who might’ve experienced this same issue. Nothing. Running out of ideas, I decide to once again retrace my steps by reverting the changes and re-deploy staging to be 1:1 with production. This will let me start from the beginning and inch my way forward in order to find the culprit. Once the deploy finishes, however, I refresh the page, and immediately feel gutted. “Too many redirects.”

My body liquifies. My blood turns hot with anxiety. “This is impossible,” I mumble to myself. After a minute of staring at nothing, I right myself and dive back in. If this issue persists despite reverting all of my code changes, what could that mean? That would mean that the issue isn’t code-related…

In an instant, time slows, my pupils dilate, and the mug of coffee (that I may or may not be holding for dramatic effect) falls from my hand and crashes to the floor (captured from three different angles). A replay montage revisits the entire troubleshooting journey, blurring between specifics points throughout the day, before landing at the moment when I casually updated the staging server’s Cloudfront settings.

I didn’t realize it at the time, but this setting actually prevents cookies from being forwarded to the origin. This is why the staging server no longer showed the session cookie. This is why the root path kept redirecting too many times. This is why signing in displayed “Forbidden”. After reverting the setting and impatiently waiting for Cloudfront to finish propagating, I refresh the staging server, and it works as expected. I re-revert the code, re-deploy it to staging, and it too works as expected. Finally, the original problem I set out to solve is now resolved, and I can consider this case closed, but what a journey it was.

Epilogue

In case you couldn’t tell, I had a lot fun writing this. I wrote this tweet, then decided it deserved the full backstory. Even though the actual “journey” sent me through a roller coaster of emotions and consumed my entire Saturday, I really enjoyed troubleshooting the issue. The investigative side of programming can be frustrating, but the reward is so satisfying. Not the solution, but the story. Every developer has a good one, and even though I’m going on 24 years of coding, these troubleshooting sagas never cease to exist. If you’ve got a story of your own, and I’m sure you do, I’d love to hear it.