Domain Decomposition In Practice

Published in

Klaviyo Engineering

9 min readMar 18, 2024

In a previous post, Fernando explained the process through which we identified and organized all the domains within the Channel Infrastructure pillar at Klaviyo. Engineers across the pillar collaborated and grouped all our system capabilities into discrete domains — send coordination, rendering, compliance, deliverability management — to name a few.

Over the past several months, my team — which owns the rendering domain — has started transforming our services to fulfill the vision conceived in the original domain decomp exercise. So far, we’ve defined and adopted a new discrete rendering interface, as well as decoupled our systems from some key external data dependencies. In this post I will share a little bit about our experience so far, including strategies and learnings that can hopefully assist others undertaking similar projects.

Background

First…a bit more background about our rendering service, also known internally as the Artifact Pipeline. It is a high volume, business critical service that is responsible for injecting personalized data into message templates (e.g. email templates). This happens millions of times every minute, so performance is vital in order to ensure end recipients receive their messages in a timely manner. If we zoom in and examine a single render, the process looks a little bit like the following diagram. For a given (1) Message and Recipient, we (2) fetch the relevant personalized render context, and then (3) merge the context together with the message template to build an artifact.

A visual representation of the steps involved in a single render

The Artifact Pipeline itself sits in the middle of the much larger Klaviyo Sending Pipeline. Systems across the full Sending Pipeline leverage a couple key data models (Messages and Recipients) to know what content to send to whom. These models are owned and managed by teams that initiate sends, yet every system across the Sending Pipeline relies heavily on them. The Artifact Pipeline used to read these top level data models directly from our client team DBs, since this data was needed to derive all remaining render context as part of step 2 in the diagram above. Our interface to request an Artifact Pipeline render at the time looked like:

def schedule_render_request(
    self,
    company_id,
    message_id,  
    recipient_ids,  
    priority=RenderPriority.HIGH,  
    requester=RenderRequester.UNKNOWN,  
    send_time=None,  
    response_metadata=None,  
):
    ...

The code within the Artifact Pipeline has been developed and iterated upon by many different engineers over time. Lots of edge case handling was baked in, and many bespoke features were introduced into the rendering logic. In addition, the amount of traffic that the service needed to support grew substantially year over year, and over time the original service contract started presenting challenges:

Our service needed to query for Message and Recipient rows for every render request which contributed a lot of load on these central datastores. We were able to mitigate the load by querying against read replicas instead, but in doing so we needed to maintain a lot of custom logic to detect and retry on replica lag, since data freshness was extremely important to our service. This also limited the effectiveness of using something like a centralized cache to absorb reads.
We were tightly coupled to our client teams’ data models. New feature ideas were often shelved because any update to the Message or Recipient schema required significant changes across all stages of the Sending Pipeline, including rendering. There was also a lot of conditional logic spread across the code that branched off for every specific subclass of Message. This wasn’t a scalable solution given we knew we needed to support even more types of Messages in the future.
It was very challenging to test just the Artifact Pipeline system in isolation. Because our system was so heavily integrated into the lifecycle of Messages and Recipients, we were required to exercise the entire Sending Pipeline in order to perform test renders.

In the pillar-wide domain decomposition exercise, my team devised a plan to address these pain points. We would need to update our code to interact with immutable Data Transfer Objects (DTOs) instead of directly with the Message and Recipient DB models. To achieve this, we needed to adopt a new render request interface. This interface would include all Message/Recipient data attributes that were accessed by the rendering code, in place of just the entity IDs. We were able to prioritize this work relatively early, compared to the rest of the organization, because it aligned so closely with addressing operational pain points we had been planning to fix.

Research and Interface Definition

We dove into this project with a wish list. We knew we wanted the rendering interface to accept a serialized version of the Message and Recipient models. But we also needed to be mindful of payload sizes and abide by constraints of our messaging system, so ultimately we decided on only including the minimum set of data attributes that were required by the internal rendering code. In addition, we wanted the rendering interface to be more practically organized, instead of being 1:1 tied to the structure of these DB models which had become somewhat of “God objects” in the codebase.

First, we needed to pick out the essential attributes from Message and Recipients to include in our new render interface. For instance the Artifact Pipeline needs to read message.subject to render email subject lines, but a different attribute like message.is_ignoring_throttling that controls Klaviyo’s Smart Sending feature is not relevant for rendering. A naive solution would be to use IDE code inspection tools to search for all accessors of Message and Recipient in the rendering package, but this has limited effectiveness because of the monolithic nature of the codebase. Code in package A could call package B could call package C. And doing a global search for usages of message.foo would make it difficult to isolate only the usages that were linked to the Artifact Pipeline.

The approach we came up with was to exercise the rendering code and log whenever it performed an attribute access on a Message or Recipient. We utilized a clever trick using Python’s built-in __getattribute__, to log names of all accessed attributes along with a pointer to the calling function. In a branch, we added this function to the models that we wanted to audit:

class BaseMessage(BaseModel):
    def __getattribute__(self, attr):
        """
        Log all attr access with the calling function name while the __DEBUG flag is turned on.
        Additionally keep a running collection of unique attr access under the __LOGSET namespace.
        """
        if attr not in ["__DEBUG", "__CONTEXT", "__LOGSET"] and getattr(
            self, "__DEBUG", None
        ):
            # get_caller() returns the identity of the calling function one frame up
            caller = get_caller()
            context = getattr(self, "__CONTEXT", "Unknown")
            logger.info(f",MESSAGE MODEL ACCESS,{attr},{caller},{context}")

            # Keep tabs on all *unique* pairs of (attr, caller)
            logset = getattr(self, "__LOGSET", set())
            if not logset:
                setattr(self, "__LOGSET", logset)
            logset.add(f"{attr}:{caller}")
        return super().__getattribute__(attr)

We then ran a script that performs a render in a controlled environment, and turned on the debug logging around the pieces of rendering logic that we were interested in:

    def render_for_email_message(self):
        ## START DEBUG LOGGING
        setattr(message, "__DEBUG", True)
        setattr(message, "__CONTEXT", "create_resolver_overrides")
        resolver_overrides = self.create_resolver_overrides(message, customer)

        setattr(
            message,
            "__CONTEXT",
            "SyncRenderService.generate_assets_for_message_with_recipients",
        )
        assets = SyncRenderService().generate_assets_for_message_with_recipients(
            company, message, [recipient], resolver_overrides=resolver_overrides
        )

        setattr(message, "__DEBUG", False)
        ## END DEBUG LOGGING
        logger.info(f"UNIQUE LOGSET DUMP: {list(getattr(message, '__LOGSET'))}")

From the logging output we were able to see exactly what model attributes were being accessed and where they were being called. Many of these were names and callers that we expected, but there were also quite a few loglines that were surprising. Nonetheless, we now had a good minimum set of data attributes we needed to ingest in a render request.

We came up with a new scheme for render request definition and RFC’d our findings. Our new render request payload looked like:

@attr.define(frozen=True)
class ScheduleRenderRequestDTO:
    company_id: string
    # Client ID used to identify requests
    client_id: RenderClientId
    message_id: string
    response_metadata_dict: dict
    send_time: datetime
    channel_settings: ChannelRenderSettingsType
    # each render recipient will have a CustomerProfile ID
    render_recipients: list[RenderRecipientDTO]
    # tracking_options contains things like UTM param settings, and open/click tracking config
    tracking_options: TrackingOptions

The most notable changes from our previous implementation were:

Attributes from Message were split between channel_settings and tracking_options, depending on their purpose.
Render_recipients contains recipient IDs and their corresponding customer_profile identifiers, eliminating the need for what used to be our most voluminous DB query.

Transitioning to the new interface

We initially updated all client code to use the new render request interface, but we kept all the internals of our system the same (still reading Messages and Recipients from the DB). We slowly chipped away at updating our internal code by taking advantage of duck typing to create our own dataclass replacements for the DB models. For example, we created a RenderMessageDTO that behaved just like a BaseMessage to all the rendering code. Instead of the attributes being DB columns, they were simply accessors into the data within our new render request object.

@attr.define(frozen=True)
class EmailRenderMessageDTO(RenderMessageDTO[EmailChannelRenderSettings]):
  """
  Our team's duck-typed representation of an email message
  """
    @property
    def name(self) -> str:
        return self.channel_settings.context.message_name

    @property
    def from_email(self) -> str:
        return self.channel_settings.context.from_email

    @property
    def from_label(self) -> str:
        return self.channel_settings.context.from_label

    @property
    def reply_to_email(self) -> str:
        return self.channel_settings.context.reply_to_email

    ...

We plumbed these DTOs through the internals of rendering — functions that used to only accept a BaseMessage would now accept a Union[BaseMessage, RenderMessageDTO], for instance. We controlled which type of message (DB model or DTO) was passed through our code using a rollout flag which we could control on a percentage basis:

def get_message_from_interface(self) -> BaseMessage | RenderMessageDTO:
    if self.use_render_message_dto:
        message = self.request.render_message
    else:
        message = self.interface.get_message().model
    return message

Testing and release

As we started to gradually cut traffic over to use RenderMessageDTO, project progress slowed down. We started running into a steady stream of edge case bugs that were stifling our cutover progress. Because we introduced strong typing in our model to DTO converters, we were getting bit by all sorts of combinations of input datatypes that we didn’t realize were possible in the previous dynamic-typed-only interface. We didn’t have a good way to detect these types of edge cases before rolling out the new code and introducing customer impacting bugs.

To solve for this, our team came up with a solution to create a “canary” render queue. The idea was to copy a configurable slice of production traffic into a parallel processing lane where we could exercise a feature branch. We would save the canary render results alongside the real production renders, and run a verification script to compare the results and alert us on any diffs between the two.

A component diagram of the canary render system — Canary render system

Implementing the canary system helped boost our confidence in new changes, especially when we were modifying bits of core rendering code that hadn’t been touched in years. We now had an efficient and reliable way to test feature branches before merging them. Not long after, we fully rolled out the new render interface. We were quickly able to push the project over the finish line and eliminate all Message and Recipient reads from the Artifact Pipeline.

Conclusion

After completing this project, the most immediate benefit we gained was the ability to exercise the Artifact Pipeline in isolation. We previously needed to create many actual Message and Recipient records in the database in order to perform test renders, but now we are able to generate a large amount of synthetic render traffic with a single command. This also reduced the effort needed to load test the Artifact Pipeline by roughly 75%. Instead of needing to manually scale and monitor infrastructure across several teams to run load tests, we now only need to manage our own team’s infrastructure.

We also gained the benefit of better isolation from our client team’s databases. The internals of rendering are free to change without affecting the structure of Message and Recipient, and likewise our client teams can now evolve those models without impacting the Artifact Pipeline. We reduced load on the Message and Recipient datastores since in many cases these models were already available upstream at the point of requesting renders. Our code became more maintainable since we were able to remove all the complex race condition protections for distributed DB reads.

The biggest takeaway for our team from this project is that test coverage is necessary but not sufficient. Even though we have solid 92%+ test coverage across the Artifact Pipeline, we still ran into a myriad of edge case bugs when trying to roll out usage of the new render DTOs. Ultimately, production traffic was the best substitute for production traffic. The canary renders enabled us to test against production and detect bugs without impacting customers.

Looking ahead, our team aims to continue making ease-of-use improvements to the canary render system. Canary renders have been a huge benefit to our testing and release process outside of just this project. Additionally, we plan to leverage the “executing a portion of code with verbose logging strategy” for future projects where the technical requirements need to be derived from the current state of the code.

As an engineering organization there is still much more to come from the domain decomposition initiative at Klaviyo, but our team has gotten off to a good head start. We’ve already unlocked tremendous operational efficiency and maintainability gains from going through this exercise for our primary system interface. We expect to see many other teams at the company follow suit in the coming months, and are excited to see what they are able to accomplish.