Klaviyo Incident Management: Interview with Laura Stone

Published in

Klaviyo Engineering

12 min readApr 8, 2024

On January 4, 2023, in the early afternoon, Engineering Manager Bob Long was alerted to an EndpointConnectionError Sentry (exception) that was blowing up. Shubham Naik, an engineer on his team, checked our monitoring and saw that at least one list import had taken longer than expected to complete.

So, at 1:05, Bob went into our #dev-incidents slack channel and typed /incident List import problems. This created slack channel #20230104_list_import_problems, created an RCA (Root Cause Analysis) google document, and logged the incident on our internal engineering status page.

Other engineers from the team responsible for our list import feature jumped into the slack channel and into a zoom meeting. They discussed what they were seeing and looked to see if any other imports were slow, posting links to tools, playbooks, grafana dashboards, and the output of diagnostic scripts.

By 1:24, the team was concerned enough that engineer Brandon Novick took the hat to block other teams from deploying code and to prevent a deployment train from going out.

By 1:37, the team figured out there were DNS issues on some of the boxes processing list imports and paged our SRE (Site Reliability Engineering) team to join the channel.

At 1:46, the team concluded that the incident should be upgraded. This resulted in paging Lead SRE II Zac Bentley, our on-call Incident Commander at the time. Zac joined the zoom and began coordinating the investigation.

By 1:52, Lead SRE John Meichle, who’d joined the investigation around ten minutes earlier when SRE was paged, had tracked the issue down to something unusual — the boxes in question were exhausting their UDP connections.

The incident commander determined the incident warranted communication outside of engineering. So, at 1:57, Senior Product Manager Madeleine “Mads” Guttuso joined the channel and zoom meeting to serve as Internal Communications Lead. Mads got up to speed and at 2:06 posted the following to our Klaviyo-wide internal status page:

At around 12:40 pm, we noticed an alert that a few customers’ list imports and form subscription double opt-in emails are delayed. We are currently investigating the issue and working towards identifying the root cause.

Investigation continued. By 2:19, it was looking like the incident was related to a system that Zac himself was responsible for, so Engineering Manager Nicholas “Nick” Hoffmann took over as Incident Commander, freeing up Zac to do technical work.

At 2:35, John used our metrics instrumentation to identify exactly when file descriptors started getting exhausted which gave the team high confidence that the problem did in fact start with a deployment that went out at 12:40.

At 2:42, Nick needed to work on other things unrelated to the incident and so handed the Incident Commander role to Lead SRE Andris Cakuls.

At 2:43, Zac posted a PR for review and test. The PR added caching for a client that was being instantiated so many times within a certain service that it was exhausting UDP connections.

At 3:06, Andris, in his role as Incident Commander, posted a summary in slack of what was now understood from the discussion taking place over zoom:

At 3:12, Mads, in her role as Internal Communications Lead, updated the incident on our internal status page:

The issue has been identified and we are working on a fix. Customers will still experience a delay in list imports and subscription double opt-in emails and SMS messages.

At 3:28, Senior SRE Morrisa Brenner merged and deployed Zac’s PR.

Around the same time, other engineers involved in the investigation tracked down exactly what happened in the 12:40 deploy that caused the problem in the first place.

At 3:33, Andris summarized the situation again, reducing all the discussion to a few bullet points with status, cause, and plan.

At 3:52, Mads needed to drop for an interview and handed the Internal Communications Lead role to Lead Product Manager Jen Blagg.

At 4:34, Jen updated the incident:

As of 3:40 pm EST, a fix has been implemented and we are monitoring the issue. All delayed customer list double opt-in messages (email and SMS), subscriptions via single opt-in, and SMS messages have been processed and are now sending in real-time. List imports are operational and no longer delayed.

Introduction

I share the example above because watching our team handle an incident is a thing of beauty. A problem happened. The right people got pulled in. Other people got pulled in. Nobody panicked. The problem was chased down to a weird and confusing root cause. The problem was fixed. No data was lost. Nothing was delayed beyond a few minutes. We didn’t need to escalate to an external incident. And along the way the baton for various roles was passed from person to person without ever being dropped. It’s what you expect from a team or troupe where the people, the processes, and the relationships are such that from the outside, you see the fluid, coordinated movement of a well-oiled machine.

It’s taken us years to develop our incident management skills to their current level. For a window into how we do things, I spoke with Director of Engineering Laura Stone. Laura joined Klaviyo seven years ago and today leads our thirty-five person SRE team. Laura kicked off a big leap in how we handled incidents at the start of 2019. Here’s an excerpt from the RFC Laura wrote at the time:

Excerpt from 2019 RFC

Problem Statement

Klaviyo experiences outages and incidents, like many tech companies. Historically, these times of crisis have been approached in an ad-hoc manner, leaving coordination of efforts and communication to chance. This has resulted in sub-optimal communication strategies and a lack of role clarity (with roles being implicitly defined and executed in tandem with remediation efforts). Let’s change that.

End Goals

Establishment of Incident Commander role and rotation
Clear understanding from all on-call engineers of the incident response process
Repeatable process for handling post mortems and communicating learnings
Make it easier for engineers to ramp up to on-call rotations

Interview with Laura

We used PagerDuty and had people on call way before 2019. What was wrong?

At that time, there really wasn’t any coordination outside someone getting a page and then trying to figure it out, maybe starting a zoom, maybe not. We were all in the office. So it was also like, oh, there’s a problem. People just gather together. It became pretty clear that that wasn’t going to work long term because we were having a lot of problems around visibility, around communication, outside of engineering and even within engineering. So mostly out of the labor of love and wanting to not have this problem happen anymore, I started looking at other resources around how other companies do incident management, and started writing up an RFC.

Did we adopt the PagerDuty roles and process?

I used a lot of their process. I simplified it a bit, so the way that PagerDuty suggests you do it is you have an Incident Commander, which we have, and then you have potentially a bunch of other roles as well. PagerDuty suggests a scribe. We don’t do that. Our Incident Commander handles that. We have an Internal Communication Lead who gets involved when an incident is upgraded beyond a pure what we call engineering incident. We also have an Incident Analyst, but that was a later addition.

What does our incident slackbot do? Why did we create it?

It’s a lot of overhead for the Incident Commander to have to manually make sure that the incident is set up for success. So do you have a space where people are congregating to talk about it? That’s a slack channel and a zoom meeting. Do you have somewhere where you can write down what’s happening? That’s an RCA google document. Do you have some mechanism for communicating outward that this is going on? That’s Statuspage.

Our slackbot automates that. It has a few slash commands that you can apply to do incident management. You can create an incident and that will create a slack channel and the google doc. You can upgrade the incident which will page the Incident Commander, you can resolve incidents, and so on.

What’s an engineering incident vs. an internal incident vs. a public incident?

An engineering incident is just the result of us wanting engineers to be really proactive about communicating what’s going on. When you declare an engineering incident, no one knows about it other than if someone’s looking at the dev-incident slack channel.

Then, if and when you upgrade it, it becomes an internal incident, and the Incident Commander figures out about it because they get paged, and then they’re instructed to page the Internal Communication Lead, if they think one is needed.

Then the Internal Communication Lead makes the determination if the incident should be public and we have an extensive playbook around that with guidance around scope, impact, and duration.

Each incident tier has its own status page and the Internal Communication Lead is responsible for updating the internal and public status pages.

A message I hear us tell new people, and emphasize in the lead up to Black Friday / Cyber Monday each year, is don’t be shy about opening incidents. What’s the idea there?

A lot of things at Klaviyo are interrelated. We want to make sure that we’re being super communicative within ourselves around what’s going on so people know right like, if you know UCG [a subsystem of our segmentation engine] is down or backed up, and that impacts flow sending, and people are reaching out to flows, and if flows doesn’t know there’s an issue with UCG, then potentially, they’re going off and doing their own investigation which isn’t needed. So we instituted this and really try to socialize that it’s not bad to communicate about an incident. We had instances where the product or some underlying system was impacted. A team was working on it, but they didn’t communicate that outside of their own team. And then a bunch of other teams got pinged from external sources and started doing their own investigations, and it just got really messy, and we pulled a bunch of people into something that they didn’t need to be pulled into. So it’s trying to be protective about people’s time, so that we can have only the people that are really needed on the incident, so that everybody else can continue their regular business. Incidents can be really, really disruptive if they’re not managed right.

What makes a good Incident Commander?

The criteria for someone being able to be eligible to be an Incident Commander is having been at Klaviyo for six or more months, and having been involved in two or more incidents that involve multiple teams. The reason that we say incidents that involve multiple teams is because that tends to be when an Incident Commander gets pulled in, because that’s when coordination efforts need to occur. You’ve got one person or people from one team investigating one thing, one person or a subset of a different team investigating something else, and the Incident Commander is responsible for being like, okay, how are these things related? Are we making progress along which axes, which ones should we pursue further?

How technical?

It’s kind of implicit. It’s helpful when Incident Commanders know about the organization and the technical systems. Especially the people organization. The thing that’s important is knowing who to call when something’s going wrong. So like if UCG is backed up, what team do I page for that? A lot of the commander’s responsibility is coordination. So, being a strong facilitator and communicator is important. It’s also about keeping track of threads of work that are going on and trying to continually drive toward resolution on those things.

Staying calm under pressure?

Most of the engineers who’ve been around for a while are very good at that. Part of that was we used to be under pressure a lot and we got used to it. But yeah, I definitely think staying calm under pressure is helpful. Being organized is also helpful.

How does the Incident Commander structure things?

There’s this kind of circle that an Incident Commander facilitates, which is choosing actions like various investigations, coming to consensus about doing action items, and then following up on them. So it’ll be like what is going on with the systems. What should we do about it? Based off of what we should do, assign action items to the people we call responders. These are the engineers actually doing the technical work to respond to the incident. And then part of assigning out action items is like, okay, you’re gonna go off and investigate what’s going wrong with zookeeper. I’ll check back in with you in fifteen minutes on what you found out. Then the follow up is, hey, it’s been fifteen minutes, is your investigation done? Do you need more time? And then based on what we learn, starting the investigation again. They do that circle. It can be messier than that, certainly, but that’s the theory behind it.

It’s facilitating how can we take actions that will start by mitigating the impact that anything might be having on customers? And then, after that, coming up with a larger plan. Is there something we can do in the moment to resolve this? Or do we need action items after the fact, after we would consider the incident to be over, where people still need to do work in some way. They’re responsible for cataloging that and distributing it and making sure everybody is committed to those things.

The Incident Commander’s job can be tough because they’re trying to facilitate that cycle of forward progress. And then they’re also writing everything down in what we call the post-mortem doc. They’re documenting decisions as they’re being made as well. They record a timeline of actions that have been taken, and then also at a high level, what is broken and what the customer impact of that is.

That’s during the incident. After resolution, the Incident Commander either delegates the scheduling of a post-mortem, or schedules and leads it themself. The post-mortem focuses on typical retro stuff. What went well? What didn’t? What can we improve on? Are there any things that we got lucky about? What are the things that we learned from this? What should we do to prevent this from happening again?

Something I see we have discipline around — maybe you can confirm or deny — is even in the heat of the moment, people pasting everything they’re doing into slack and asking for confirmation before running commands.

Yeah, that’s super important to make sure that that happens. For example, being careful, making sure that you post the command you’re gonna run. It helps the Incident Commander too.

Number one, it’s good ops practice. You should get someone to look over your work before you do it, like two sets of eyes on anything. Like getting code review.

And then number two, the Incident Commander is supposed to document everything that’s happening, but that can be really hard to do if you don’t actually know what the engineer typed. Let’s say you task someone with running a query on a database, and they say that they run the query, and then they’re like, oh, crap, data is being returned differently. If you don’t actually see what that query was then you can’t validate what was run. And so being able to have what is the actual command that we ran is useful for that investigation cycle. Also, often the commander will revisit the slack chat after the incident is done to fill in the timeline with those specific commands.

Super important to make sure that that happens. Because again, you want to make sure you’re getting more than one set of eyes on whatever you’re doing, just in case. And then also that you’re documenting what you’re doing, so that when you think about what happened, you know the direct source of what happened as opposed to someone’s word for what happened.

How did things change with Covid?

To me remote is kind of like the easiest mode to do incident management in because everybody’s looking at slack. Everybody can see the zoom, and everybody can join the zoom, and everybody then knows where everybody is, because they’re just on the zoom. The only thing is if people aren’t following the process, then it’s hard to know, because if people don’t declare an incident, are only having conversations in private channels, then you don’t know what’s happening, but in the past you would have a clue because you might see people congregated in an office. Although we’re so much bigger now, a team could be on a different floor, so same thing could happen.

Some in a conference room and some over zoom is hard?

Yeah, understanding and distributing the conversation that’s happening in the conference room, particularly when things are kind of chaotic.

Thanks!