Data Dictionary: How I Learned to Stop Worrying and Love Reporting Standardization

Travis Hansen
Klaviyo Engineering
14 min readFeb 1, 2024

--

Background

Klaviyo processes several billion events per day representing everything from email deliveries to in-store visits to loyalty point redemptions. Each of these events can contain any number of arbitrary dimensions, ranging from standard ones like country_code to user-specified ones like gift_card_occasion. With a tremendous amount of data like this comes the need for a large number of reports to allow our customers to understand that data.

The Klaviyo app contains many reports and pages that contain the same data; one such example is campaign click rates for November. This calculated metric is displayed in dozens of different surfaces in our app, at various levels of aggregation (aggregate, grouped by campaign, grouped by country, etc). These reports and views have been built over the life of Klaviyo by many different teams. Some of these reports are new, having been launched a few short weeks ago. The oldest ones date back to the early days and ownership has been transferred from team to team and engineer to engineer. These derived metrics often contain quite a bit of nuance that is hard to understand unless you dig into the code.

With all of these different calculations scattered across different parts of the app there are bound to be discrepancies. One report may calculate click rates based on total clicks while most might use uniques. Some may use successful deliveries while others might consider total recipients. And what about selecting which campaigns to include in November? Some pages may include any campaigns that had at least one send in November, while others only include campaigns that started sending during the month. There are arguments to be made for these and many other decisions about how to calculate and display data. But if each report is treated as a standalone feature then discrepancies are inevitable. The numbers may be “close enough,” but ultimately it will lead to a loss of customer trust. If your customers can’t trust you to be correct on something as “easy” as a click rate, why should they believe anything else you display?

Why does this problem happen? At most companies, these analytical views are owned by many disparate teams, often ones that don’t consider reporting their primary function. For example, a team that mainly concerns themselves with creating campaigns might also own pages that display campaign performance. These teams (and their product owners) want to ship new features to enable better campaign creation experiences, not focus on making all of the numbers match.

As an added bonus, these teams will spend just enough time making those analytic pages “performant enough.” Why spend weeks trying to optimize a page that loads in 1.2 seconds when time-strapped engineers could focus on building the next thing that will make campaigns easier to use or more successful? None of this is assigning blame; time is a limited resource and those teams shouldn’t be spending time making things faster than necessary.

What Data is Correct?

Let’s take a deeper look at the campaign click rates for November example. Before we start thinking about the right way to calculate our rate metrics, let’s start with something simple: What campaigns do we even want to include in the report? To decide that, we need to decide what timestamp to filter by:

  • When the campaign was first created? No.
  • When the campaign was scheduled (i.e. the user pressed “schedule” in the UI)? No.
  • When the first batch of messages was scheduled to start sending (taking into account timezone and possible timing-based A/B testing)? Reasonable.
  • When the first batch of messages actually started sending? Reasonable.
  • When the first message delivered event entered our event pipeline? Reasonable.
  • When the entire sending process completed? Reasonable.

These are just a few of the options. There are many different types of campaigns and some have very long sending cycles with several timestamps that could make sense. In the past, different aggregate reports used different approaches, due to the subtleties of what that report was expressing, or arbitrary decisions by engineers.

What time period does a campaign fall into?

Let’s assume that we settle on one of those timestamps. Problem solved, right? But what if a campaign starts sending at 11:59 pm on November 30 but doesn’t finish until the clock has flipped over to December? This is fairly common for relative local send time campaigns, which take the recipient’s timezone into account for sending. Should all data for that campaign be allocated to November or December? Should we split the campaign in two and allocate each part to the appropriate month?

And while we’re on the subject of time boundaries, which timezone should we use in our logic? UTC? Account timezone? Browser timezone? Don’t get me started on timezones.

Let’s assume that we figure all that out and can consistently identify which campaigns should be included. Now let’s do some rate calculations!

Which formula is “correct”?

Click Rate = Clicks / Deliveries

Click Rate = Unique Clicks / Deliveries

Click Rate = Clicks / Recipients

Click Rate = Clicks (excluding bots) / Recipients

There are cases to be made for each of these formulas (and likely more as well). And what does Clicks even mean? Are we counting all clicks from November? Or just the ones from campaigns that were sent in November? Fun side note: prior to standardizing our reporting, we had a bug in one report that occasionally displayed click rates over 100%. There were clicks in the report month but the vast majority of deliveries were from a prior month. This was the source of numerous bug tickets over the years and led to lots of customer confusion. Needless to say, when calculating a rate you want the numerator and denominator to line up, but that’s often easier said than done!

So what are our options?

Now that we’ve identified why we have discrepancies, what are we going to do about it?

Option 1: Live with the discrepancies

Yes, this is the easiest path. It literally requires changing nothing. But you might get a few dozen bug tickets a year that look something like this:

Title: Inconsistent Click Rate

Description: I’m seeing 3 different click rates for the same flow. How do I know which one is accurate?

Figure 1: Flow click rate discrepancy example
Figure 1: Flow click rate discrepancy example

But keep in mind that for each customer that files a bug ticket, there are many more that just assume the data is wrong and internalize that they can’t trust your numbers. This loss of trust is pervasive and very hard to rebuild. If they think you can’t get something as easy as click rate correct, why should they trust anything else? You might think documentation will help, but nobody wants to read long articles in the help center explaining nuanced differences about how rates are calculated in one report vs. another.

Option 2: Audit all reports and fix them, one by one

Also known as the whack-a-mole approach. Yes, you could settle on some standards and then mandate to your teams that they update their code.

But this is easier said than done. Some of these reports might be years old and the teams that own them may not have worked on the initial development. It could take a non-trivial amount of time for them to fully understand how the report is implemented before they can start making changes.

And even if you fix all the existing reports, what happens when the company wants to update the definition of a calculated metric? Or perhaps your data platform team launches a new service layer that everyone needs to migrate to? (Both have happened here in the last 18 months.) Each time one of these situations arise, updating reports one at a time will introduce discrepancies again (until all reports have been updated). That could take an exceedingly long time and lead right back to losing customer trust.

Option 3: Centralize reporting standards

The best long term solution to eliminating discrepancies is a centralized system that can provide the aggregated data for all of your reporting use cases. It can start relatively simply. First, gather stakeholders and decide on definitions for your calculated metrics. Spend time discussing how you’re going to handle campaign fetching, filtering, and anything else that needs a standardized approach. The goal is to reach consensus on what reporting standards your company will adopt.

Then just start building. The proof of concept can be relatively simple. Pick an existing use case that doesn’t need a lot of bells and whistles and build the system to support it. Start adding functionality (focusing on closing the biggest feature gaps first) and migrating over use cases. Eventually there will be a positive feedback loop: the more use cases that have adopted the centralized system, the easier it will be to justify adding more functionality. The more functionality that is added, the more use cases will migrate.

The best part is that you only need to perform this migration once. As requirements, metric definitions, or backend implementations change, you can make the update in one place and all reporting surfaces will instantly be in alignment. If you find a bug, you can squash it once and be sure it rolls out to all use cases at the same time.

As you might have guessed, this is the path that Klaviyo chose. We named our centralized system Data Dictionary, as it is intended to be the centralized source for all analytics data within the Klaviyo app. The term data dictionary is not new — in fact it is used throughout the industry to describe attributes and metrics that are available in a data product. Our version of Data Dictionary goes one step further by allowing users to actually build result sets with any supported attributes and metrics. In other words, instead of only telling you the standard definition of click rate, it will actually call into the data systems and calculate it for you.

The Data Dictionary proof of concept began development in October 2022 and we started by building features needed to support our Audience Breakdown report. The migration was completed in January 2023. At time of this writing (January 2024), we have over 50 use cases that have migrated to Data Dictionary, with more onboarding almost every week.

The Vision

At the outset of the project we wrote down our vision. Here are three of the seven points we listed in that original document:

  • Eliminate data discrepancies in the app by being the single source of reporting data and applying Klaviyo-approved standards.
  • Reduce development time by abstracting away complexities related to domain-specific query patterns. Furthermore, we will publish an API that integrates natively with our front-end components to reduce developer friction and simplifies infrastructure.
  • Improve performance for all use cases by including optimizations and eliminating redundant processing.

I’ll be addressing each of these below.

Eliminate data discrepancies

As we have explored, there are many ways that aggregated data can be calculated and presented. The most important goal of Data Dictionary is to eliminate discrepancies to ensure that customers trust our data. Without tackling this challenge, none of the rest of the goals even matter.

The heart of Data Dictionary is the Datasource. As an example, here’s how we process a request for campaign-centric data:

  • Request is sent to the Data Dictionary Datasource.
  • Since it is a campaign request, it is routed to the Campaign Datasource.
  • The Campaign Datasource calls out to the Campaign Fetcher to determine which campaigns should be included in the results (based on timeframe, filters, etc).
  • Similar calls are made to other fetchers to figure out which metrics need to be fetched from Metric Service (the Klaviyo system that allows querying counts, uniques, and values for various metrics).
  • The requests to Metric Service are constructed and executed in an optimal way (more details below).
  • The results are aggregated using the groupings that were specified in the request (e.g. one row per month, campaign, and/or country).
  • Derived metrics are calculated based on standardized formulas (e.g. Click Rate = Unique Email Clicks / Total Delivered Emails).
Figure 2: Data Dictionary high level architecture
Figure 2: Data Dictionary high level architecture

Reduce development time

Prior to Data Dictionary, each report was responsible for determining what data to include, how to define derived metrics, how to aggregate data, and so on. One of our more straightforward reports is the Campaign Performance Report (part of our Custom Reports feature). The report allows users to specify a list of metrics, a single level of aggregation, tag-based filtering, and a timeframe and computes the campaign performance accordingly.

Originally this report consisted of over four thousand lines of code, contained dozens of bespoke functions not used anywhere else, and contained several branching paths of complicated logic depending on which attributes/metrics were selected. It took a senior member of the team almost a full week to add support for our Apple Privacy Opens metric because it required adding new properties to several types/functions and wiring them through each layer of code.

The code itself had been refactored several times over its lifetime: migrating to new backends, changing to batch endpoints, etc. Each migration took weeks, touched over a thousand lines of code, and required many hours of in-depth regression testing to ensure that the new report output matched the previous version.

Now that the Campaign Performance Report has been migrated to Data Dictionary, the complete implementation requires fewer than 700 lines of code. The logic is also far simpler:

  • Instantiate a DataDictionaryDatasource object for the company.
  • Build a request with the user-specified timeframe, groupings, filters, and metrics.
  • Map the results into the expected output format.

Because this is all done via the centralized system, regression testing is much simpler and faster. Bespoke code that was custom-written for this one use case was removed and the team has much less cognitive load to deal with. Additionally, adding new metrics or changing the backend implementation is a no-op.

Improve performance

One of the advantages of Data Dictionary (that might not be obvious) is that there is now a team with the code ownership and motivation to make things really fast. Most feature-based teams don’t have the ability to spend time on this; time spent on optimization is less time available for new features.

We tackled the low-hanging high-impact fruit first: optimizing requests into Metric Service, the interface for our OLAP datastore.

  • Parallelize the queries. These requests are I/O bound, so this results in a massive speed increase.
  • Combine requests for different measurements into a single query (i.e. rather than querying for counts, uniques, and values separately, we fetch them at the same time).
  • Only query for each metric once. Many calculated metrics use Deliveries as the denominator, so instead of fetching it for each different rate, just fetch it once and use it multiple times.

We also spent a lot of time using profiling tools. Example of a learning: some of the services that we called in our object fetchers relied on Django’s ORM functionality. This works fine for pages that load one object at a time, but when trying to build an aggregated report, the performance becomes unacceptable. A Flow Performance Report (grouped by variation) for some of our largest customers used to require 80,000 queries. After discovering this, we built bulk methods and were able to retrieve the same data in nine database calls which sped up the report from 2 minutes to 4 seconds.

Finally, there are a few optimizations that apply to advanced reports; one such optimization is called grouping rollups. Let’s say that a team wants to build a deliverability report that shows data broken down by send channel, but also by message type and day of week. And they want to report on overall deliverability as well. Normally this would take four different requests into Data Dictionary. With grouping rollups, the list of aggregations desired are passed in a single call and the system will calculate the results for each separately. This aggregation is done at the very end of the request, so it is incredibly efficient and doesn’t require any additional queries. In practice, we are able to build these four result sets with only 10% overhead relative to a single request.

For nearly every use case that has migrated to Data Dictionary, we’ve seen a 3–5x performance improvement for p50. The Campaign Performance Report (detailed above) was sped up by an order of magnitude. And the most extreme case (Flow analytics) saw an improvement of 100x for p99 requests. Below are grafana dashboards showing a ~20x speedup for the Message Statistics page.

Figure 3: p95 — p999 to render report pre-Data Dictionary
Figure 3: p95 — p999 to render report pre-Data Dictionary
Figure 4: p95 — p999 to render report post-Data Dictionary
Figure 4: p95 — p999 to render report post-Data Dictionary

How to be successful

The biggest concern during the early days was adoption. If we spent time building out a proof of concept that no one wanted to use, the project would fail. Additionally we didn’t want to have to migrate all the reports ourselves (our team is small — it would have taken all of our focus for over a year). So what did it take for our project to be successful?

  • First and foremost, you need a sponsor. Someone in leadership who sees the vision and agrees that reporting consistency is important. Without that, you’ll never be given the chance to build.
  • Be your own first customer. Pick a use case to migrate and build the system just enough to make it work. Figure out where the rough edges are and smooth them. Do it again. You’ll be surprised at how much you’ll discover by using it. Once you think it’s super easy to use and provides obvious benefits, only then should you try to find someone outside the team to be your first “real user.”
  • Make it better than the alternative. It needs to be significantly easier to use than the alternative. It should be faster as well (or at least not slower). Engineers like to make things fast and seeing their performance metrics improve several times over makes them excited to use what you’ve built.
  • Evangelize what you’ve built. You need to communicate your wins. Share your new features loudly and often. Publish a roadmap and stick to it as much as possible. Even if you think you’ve communicated enough, you probably haven’t. So keep going!
  • Find your champions and empower them. Your system needs advocates in both product and engineering. The majority of migrations should be performed by your users, and if you make it easy for them they will let you (and others) know. Eventually you may even find that other teams want to help you build it. Our attribution team needed to add support for custom conversion metrics and decided to build support directly into Data Dictionary.
  • Publish clear documentation. This includes solid explanations and examples for end users, accurate and deep technical documentation for engineers, and high-level capabilities for product managers/designers.
  • Save some capacity. A roadmap is crucial, but there will always be things that come up as users start to migrate. One of our earliest users migrated the Campaign List to Data Dictionary and found an edge case that caused performance to be incredibly slow for some customers. Since we had capacity, we were able to find two quick fixes that sped things up by an order of magnitude. If you don’t have time to address user needs in real time, they will find an alternative (often without you) and may be hesitant to try your system again.

Conclusion

It may seem like this sort of project is too focused on nitty gritty details, spending company resources to get the data exactly correct. Detractors have highlighted that a percent or two discrepancy isn’t that big of a deal and we’d be better served building new features instead. But our customers disagree; they have expressed frustration and skepticism when presented with discrepancies.

Modern apps display an ever increasing amount of numbers. Our customers expect these numbers to be consistent and correct, which doesn’t just happen by chance. It takes a concerted effort which is best achieved by building a standardized reporting system. As we’ve now seen, there are many significant benefits that you can achieve once you’ve done so (even some you may never have considered). So what are you waiting for? Go standardize all your reporting data!

--

--