Intro to Code Profiling

Published in

Klaviyo Engineering

11 min readApr 1, 2024

Code profiling is an essential skill in an engineer’s toolbox. In this post, intended for engineers new to optimizing code, I describe the tools and techniques we frequently use here at Klaviyo.

Good system performance contributes to better user experience, increased user retention, resource efficiency, and lower bounce rates.

At the core of improving system performance is the ability to accurately identify and measure the problem. The best practice is to use observability frameworks to achieve this, including production time series metrics, OpenTelemetry tracing, and structured logging. These tools enable engineers to formulate hypotheses and pinpoint inefficiencies within the system. Profiling tools and query instrumentation then become invaluable in identifying the root cause of performance issues. Approaching optimization logically allows for quick identification of problem areas. Lack of discipline can lead to wasted time on unnecessary code modifications that fail to enhance the customer experience.

Let’s dive into the preferred tools and methodologies for code optimization, and how you can take a structured and measured approach to diagnosing and resolving performance bottlenecks.

OpenTelemetry

OpenTelemetry is a broad standard to improve observability in systems. OpenTelemetry frameworks provide tools and APIs designed to collect, process, and export telemetry data like metrics, logs, and traces from software applications. It aims to provide a standardized way to gather insights into software performance and behavior, helping developers and operators better understand and optimize their systems.

OpenTelemetry is an observability framework and toolkit designed to create and manage telemetry data such as traces, metrics, and logs. Crucially, OpenTelemetry is vendor- and tool-agnostic.

OpenTelemetry is excellent as a precursor to code profiling. It reduces the need for an engineer to take guesses about performance. Use it to generate hypotheses before diving into profiling code.

At Klaviyo, we instrument all the Django view endpoints with OpenTelemetry. We also provide Python decorators for methods to record spans for them in OpenTelemetry traces.

@start_span("_validate") # OpenTelemetry span decorator
def _validate(self, thing_to_validate: str) -> bool:
    ...

The data collected:

Traces: In OpenTelemetry, a trace represents a single, logical operation like a transaction or a request. It’s composed of multiple components that together provide a detailed understanding of the operation. The operation could span across multiple software components (e.g. code, database), and services (internal and external). Although traces do resemble the output of code profiling, there are important differences, which I’ll get into below.
Metrics: Quantitative measurements, such as response times or error rates. This can be fed into a tool like Grafana for further analysis and creating dashboards.
Logs: Structured logs are tied to each span (the leaf nodes in a trace), providing insight into the runtime behavior and status of the request.

OpenTelemetry provides real performance data across systems for millions of user requests. If you see major performance issues in these traces, you can be guaranteed that fixing them will improve the user experience. You can capture the before and after state of the operation, confirming your fix. (And bonus: these before and after plots make great visuals.)

Given the sheer amount of data being produced, OpenTelemetry instrumentation is not intended to capture every single user request and every single line of code, due to performance implications and infrastructure costs.

Here’s a screen capture of an OpenTelemetry trace from a single request to a service at Klaviyo. A service app-flows (in red) executes a task to perform batch operations. The task runs a bunch of data queries on a Redis data source to start and makes a couple of requests to the app-event-query (in blue) service to gather the data needed for completion.

Grafana opentelemetry trace — Example of an OpenTelemetry trace

Here’s an example of a Grafana dashboard showing metrics captured by our OpenTelemetry framework.

Example of OpenTelemetry performance metrics

Database query performance instrumentation

All applications utilize some data backend like a database, cache, or files. It’s essential to have a good understanding of how to analyze the performance of your data sources and the impact your app has on them.

At Klaviyo we like Django! Django’s ORM system provides a Pythonic way to run SQL using models. This abstraction sometimes leads to non-performant SQL queries sneaking their way into production.

We forked Django Query Inspector internally to help engineers quickly identify SQL queries that could result in performance regressions.

This tooling helps them:

Identify slow SQL queries during the development phase
Identify n+1 problems with SQL queries
Identify any repeated SQL queries across different modules

The internal fork adds a few more features on top, that are helpful for engineers at Klaviyo:

DB Replica support and total SQL time per replica as well as global SQL time for a Django request
Console colors (if the terminal supports it)
Optimizations to the module when capturing hundreds of SQL statements
Scrubs sensitive data from being rendered to the console

It is enabled in Django’s middleware settings with a single line of code

MIDDLEWARE = (
   "app.tracing.middleware.TracingMiddleware",
   ...
   "middleware.DBQueryPerfMiddleware",
   ...
)

Sample output (with some redaction) from the tool is shown below.
The first line is a summary of the SQL queries on the request and as you can see there are a total of 5 queries over 37ms and there is 1 duplicate query. The detailed output shows the offending query, which was repeated 3 times, and, could be optimized further.

[SQL] 5 queries (1 duplicates), 37 ms SQL time
 - - -

Repeated query (3x): [DB] CoreDB: [SQL] SELECT `django_session`.`session_key`, 
`django_session`.`session_data`, `django_session`.`expire_date`
FROM `django_session` 
WHERE (`django_session`.`expire_date` > '?'
AND `django_session`.`session_key` = ?) 
LIMIT 20

Repeated query (1x): [DB] CoreDB: [SQL] SELECT `auth_user`.`id`, 
`auth_user`.`password`, `auth_user`.`last_login`, `auth_user`.`is_superuser`,
`auth_user`.`username`, `auth_user`.`first_name`, `auth_user`.`last_name`, 
`auth_user`.`email`, `auth_user`.`is_staff`,`auth_user`.`is_active`, 
`auth_user`.`date_joined` 
FROM `auth_user` 
WHERE `auth_user`.`id` = ? 
LIMIT 20

Repeated query (1x): [DB] AuditDB: [SQL] INSERT INTO `audit` 
(`created`, `auth_id`, `view_name`, `view_resource_kwarg`, `request_method`,
`client_ip`, `url`, `role`) 
VALUES (?, ?, ?, ?, ?, ?, ?, ?)

Profiling tools

Now we get into code profiling tools. Code profiling tools are essential instruments for developers, offering deeper insights into performance by identifying bottlenecks, resource usage, and execution hotspots. Armed with the hypotheses, you can rely on code profiling tools to dig deep.

Here’s sample output from a code profiler:

Source: Snakeviz

Before we showcase a profiling tool, there is one item to be wary of: the choice of profiler. There are two kinds — statistical and deterministic.

Statistical profilers will capture the state of your application based on a preset sampling rate (e.g. 1ms). At Klaviyo, we prefer statistical profilers for most optimization work. They have a few advantages:

Vastly reduces the amount of noise in the traces by capturing the state of the program at a predefined sampling rate.
The program’s performance is not massively impacted. Statistical profilers like py-spy cause less program slowdown during a profiling session than deterministic options like pprofile (determistic). (See chart below.)
We rarely need extreme precision on tasks when assessing the performance of the program.

On the flip side, deterministic profilers systematically record every function call, the time spent in each function, and other events that occur during the execution of a program. This is incredibly useful when you need to understand specific code paths that resulted in your system’s poor performance. Wall clock accuracy should not be a concern when using deterministic profilers.

As mentioned earlier, for most scenarios at Klaviyo, we prefer statistical (sampling) profilers: py-spy and pyinstrument.

Py-spy is great for attaching to running processes. For example, it can be attached to a runaway process to quickly figure out what’s going on. Attaching py-spy to a Django process goes as follows:

# Find the Process ID
ps aux | grep django

# Monitor
sudo /home/$USER/.local/bin/py-spy top - pid {pid}

# Bonus: you can record a Speedscope visualization
sudo /home/$USER/.local/bin/py-spy record - pid {pid} \
-f speedscope -o profile.json

We also make pyInstrument available, which has excellent integration with the Django framework. It is as straightforward as adding the following line to the middleware stack:

MIDDLEWARE = (
   "app.tracing.middleware.TracingMiddleware",
   "middleware.DBQueryPerfMiddleware"
   ...
   "pyinstrument.middleware.ProfilerMiddleware"
   ...
)

Appending ?profile query param to a web request, generates a code profile on the running Django server.

The ability to profile in any environment from local to production (with safeguards) is incredibly clutch. At Klaviyo we pride ourselves on making code profiling available everywhere from local development, docker images, development, and production environments. We do this to improve our engineers’ confidence in optimizing and triaging issues.

Things to ask yourself before optimizing the code

At this stage, let’s assume you have identified something slow in your codebase, or someone asked you to improve the performance of a component. Do a few sanity checks before you proceed:

Will it matter?

Make sure it has customer impact. (There’s an art to dissecting the customer feedback to get at the root of what’s being reported. Sometimes speed is the culprit. Sometimes it isn’t.)
If you have a long list of items that can be made faster, start with the ones hit more often by more users.
Seek customer feedback when major optimizations are rolled out. (Read about perception management for some examples.)

Is your code improvement resulting in performance (or correctness) regressions elsewhere?

Utilize OpenTelemetry to have a cohesive view of all upstream and downstream systems of your service/code post-optimization rollout.
Avoid adding to tech debt with substandard changes. These solve the problem short term but increase maintenance headaches for your team in the future.
Ensure you add good functional and non-functional tests in your system to identify regressions.

Stay focused and avoid yak shaving.

Avoid yak shaving by trying to improve unrelated aspects of your system. Stay focused on the problem at hand.
Limit yourself to your domain and any external changes ought to be delegated to the responsible teams.

Are you collaborating with your management?

Keep visibility high so that everyone is aware of the outcomes.
Any deviation from expected outcomes must be communicated clearly to assist with customer outreach.

Are you able to measure performance before and after the optimization?

Add a counter, time, or gauge metric (e.g. statsd, influxdb) around the non-performant code to produce mean, median, 95th, and 99th metric percentiles.
Perform as many code profiles as needed to reproduce the performance numbers before and after the improvement. We recommend capturing and comparing mean, median, 95th, and 99th percentiles in this case too. And again, remember that your CPU and I/O on your local are different from production systems.
If you cannot showcase clear numbers from the above, consider it a risk to roll the improvement to your customers.

Funny but serious rules from Programming Pearls

Code profiling does not measure accurate performance

Never rely on code profiling to produce accurate performance measurements. Tools like py-spy can get close but you ought to rely on other instrumentation to provide accuracy rather than code profiling.

On production, we recommend utilizing OpenTelemetry trace metrics, StatsD, InfluxDB, or any such time series ecosystem to record counters, timers, and gauges. They are much more efficient at reducing the overhead when recording such data.

On your local code profiling session, if you want to avoid overhead, we recommend starting with statistical profilers (e.g. py-spy) to find the unoptimized code. Then, wrap that code around a simpler module like timeit to capture accurate measurements across thousands of iterations.

$ python -m timeit "'-'.join(str(n) for n in range(100))"
10000 loops, best of 5: 30.2 usec per loop

$ python -m timeit "'-'.join([str(n) for n in range(100)])"
10000 loops, best of 5: 27.5 usec per loop

$ python -m timeit "'-'.join(map(str, range(100)))"
10000 loops, best of 5: 23.2 usec per loop

Fix the identified performance issue

Profiling is very effective in deep diving into specific targets and testing your hypotheses. Delving into the finer nuances of our code enables us to pinpoint the specific segments that require optimization. This granular approach is key in efficiently targeting and resolving performance issues. You will only get such detailed information from code profiling.

The next challenge is to find out why the code in question is slow.

Code profiling will, in most cases, not show you the exact changes to make to optimize.

For example, running a code profile could result in the following trace. We can eyeball this to understand that one of the SQL queries is slow. The profile does not tell you how to optimize that SQL query.

In such a case, with SQL, you can use tools like SQL EXPLAIN to understand the inner workings of the query.

As you run more code profiles, you gain experience choosing and working with various domain-specific tools that make it easier to answer the “Why is the code slow?” question.

🎉 Case study: Optimizing SSO logins

I want to share an example of using the tools and techniques described above to identify the cause of slowness and implement a simple fix that resulted in a big improvement in user experience.

Our customers shared that a subset of user logins was much slower when they used single sign-on (SSO) vs traditional username and password. Our metrics (in OpenTelemetry and StatsD) confirmed this.

Next, we generated a hypothesis as to why SSO was slower. We spun up the pyInstrument profiler and saw that there was a significant slowdown in the user lookup method for SSO logins, and that most of the extra time was spent in SQL queries. We locked in on this hypothesis.

The traces below are simplified for demonstration purposes.

Compare the above to the traditional non-SSO login.

The user lookup from the code profile, among other things, ran a select query on the User table. To find the query being run, we spun up DB QueryInspector and replayed the requests. This confirmed that the slowness was due to a specific SQL query. We were getting closer.

We then ran a SQL EXPLAIN on the query and found that the Email field in the query to look up a user was not indexed on the table, whereas, the Username field was indexed and unique.

A simple one-line change to use the indexed Username column instead of Email resulted in a massive performance boost.

# Removed (simplified code)
- return User.objects.get(email=email)

# Added (simplified code)
+ return User.objects.get(username=convert_email_to_username(email))