A Deep Dive into High Performance HTTP Requests for Python Engineers

Charlie Steele
Klaviyo Engineering
16 min readApr 23, 2024

--

This post was co-authored by Collin Crowell and Charlie Steele.

At Klaviyo, we make a lot of HTTP requests, especially from our pre-built integration code (billions per day). However, the inherent inefficiencies of I/O bound operations, like requests, can create bottlenecks within our application if not managed carefully. By leveraging Python’s concurrency models, we’ve been able to significantly improve the performance of our I/O bound code. In this blog post, we equip you with knowledge and techniques to improve the speed and efficiency of your HTTP request code. We’ll be examining different concepts, looking at results from benchmark tests, and drawing conclusions about Python’s capabilities. But before we dive into the details, let’s brush up on the basics!

Background

In computing, Input/Output (I/O) refers to the exchange of data between computers and external entities, such as users or other computers. I/O enables computers to both receive input from external sources and deliver output, facilitating communication. Network I/O refers to the transfer of data between systems over a computer network. Systems communicate using network protocols, which are standardized sets of rules that dictate how devices communicate and exchange data. In this blog post, we’ll focus on the HTTP protocol, which is commonly used to transfer data over the web. HTTP, or Hypertext Transfer Protocol, is an application layer protocol within the TCP/IP family. Exploring the details of network protocols and network architecture is beyond the scope of this post, so we’ll stop here.

Now that we understand the basics of network I/O and HTTP, let’s dive into the practical aspects of making HTTP requests in Python. We’ll start at the lowest level with the http Python package. This built-in contains different low-level modules for working with HTTP, such as an HTTP protocol client and basic HTTP server classes. At Klaviyo, we use the http modules for things like HTTP status code constants, and handling common network exceptions. However, unless you’re operating in the trenches of low-level development, you might not find yourself reaching for this library frequently. In fact, the official Python documentation tells readers to use the URLlib.request library for high-level URL opening.

Moving up the abstraction ladder, we encounter the URLlib package, containing several modules for working with URLs and offering a more user-friendly layer on top of the http modules. This package enriches your dev toolkit with functionalities like authentication, redirects, cookies, and more. While this package provides improvement in terms of abstraction, the official Python docs recommend a different package for those seeking a higher-level HTTP client interface.

Enter the Requests package — the go-to solution for many Python developers when it comes to making HTTP requests. It acts as a powerful abstraction layer atop URLlib, offering a human-readable interface that expedites the process of building and executing HTTP requests in Python. Requests is one of the most downloaded Python packages today, pulling in around 80M downloads / week. According to GitHub, Requests is currently depended upon by 2.7M+ repositories. For most use cases, the Requests package is the preferred choice due to its simplicity and ease of use.

However, as we get deeper into our exploration, we’ll tease the fact that while the Requests library provides an excellent starting point, it’s not the most efficient or performant option available. Hold onto that thought; we’ll talk more about optimization as we progress through the upcoming sections.

Abstraction is great, but how can we improve performance?

As many readers already know, network I/O is inherently slow. Developers frequently encounter the bottleneck of network operations, which can significantly slow down their applications. When python code makes an HTTP request, it essentially enters a waiting game. Whether it’s waiting for a response or a timeout, our program sits idle during this period. To illustrate this concept, let’s consider a simple example. We can measure network latency through round-trip time (RTT). Imagine a scenario where the RTT for a request averages half a second. In this case, your synchronous Python code will not be able to handle more than two requests per second. In real-world applications, this prolonged period of inactivity could severely impact the efficiency of your program (especially when you need to make thousands of HTTP requests every second).

But fear not, for there are strategies to reduce this sluggishness and improve performance. One of the simplest yet most effective approaches is to make multiple HTTP requests concurrently. By doing so, we’re no longer waiting for each response to trickle in before making the next. Python equips us with a variety of techniques to achieve concurrency, including threading, multiprocessing, and asyncio. Let’s touch on each of these techniques:

Threading in Python allows your program to execute multiple threads concurrently, each handling different tasks. In the context of network requests, you can initiate multiple threads to make several requests simultaneously.

Multiprocessing in Python allows true parallelism by creating separate processes, each with its own Python interpreter and memory space. While creating processes incurs more overhead compared to threads, it can still significantly improve performance for network requests, especially on multi-core systems.

Asyncio is a library introduced in Python 3.5 for writing asynchronous code using coroutines. Asyncio enables cooperative multitasking, where a single thread can switch between different tasks. Although asyncio requires a different programming paradigm compared to threading and multiprocessing, it offers high performance and scalability, particularly for I/O bound applications.

Each of these techniques has its advantages and trade-offs, and the choice depends on various factors such as the nature of your application, scalability requirements, and developer familiarity. Later in this post, we’ll delve deeper into each approach and explore their implementation and performance considerations to help you optimize network requests in your Python applications.

Parallelism vs Concurrency

Now let’s take a quick look at the nuances between concurrency and parallelism. These terms often get interchanged, but they carry distinct meanings that are crucial for high-performance networking. To put it simply, concurrency refers to the ability of a system to handle multiple tasks simultaneously. Think of it as managing several tasks concurrently, where progress may overlap, but not necessarily execute at the same time. On the other hand, parallelism takes concurrency a step further. It requires tasks to actually execute simultaneously to be considered “in progress” concurrently.

In Python, the Global Interpreter Lock (GIL) plays a pivotal role in understanding the limitations of parallelism. Essentially, the GIL acts as a gatekeeper, allowing only one thread to access the Python interpreter at a time. This means that even on a multi-core architecture, Python can only execute a single thread at any given moment. Consequently, for CPU bound Python applications, attempting to leverage multithreading for parallelism won’t yield performance improvements. In such scenarios, multiprocessing emerges as the preferred approach for harnessing the full potential of your CPU.

However, when it comes to I/O bound applications, we do not need to turn to multiprocessing in order to make improvements. In Python, I/O operations release the GIL, enabling other threads to execute while waiting for the I/O operation to complete. This opens up the possibility of achieving parallelism in I/O bound code through multithreading. Unlike CPU bound tasks, where the bottleneck lies in computational processing power, I/O bound tasks often spend a significant portion of their time waiting for input or output operations to complete. Therefore, leveraging multithreading can effectively utilize idle CPU cycles during I/O operations, leading to noticeable performance enhancements. Understanding the distinction between CPU bound and I/O bound tasks is crucial for selecting the appropriate concurrency model and optimizing network requests effectively.

What about asyncio?

Asynchronous I/O (aka asyncio) is one of the most powerful tools at our disposal. It’s a library designed for writing concurrent code using the async syntax and it’s particularly well-suited for handling I/O bound network code very efficiently.

Let’s cover a couple of core concepts of Async I/O before we take a look at the benchmark tests. First, let’s discuss coroutines. In traditional Python methods, execution occurs sequentially and the method returns something to the caller at the end. However, coroutines differ in that they can yield control back to the caller at multiple points during their execution. This is facilitated by the yield statement, which you may already be familiar with from Python generators. Interestingly, asyncio heavily relies on generators under the hood to achieve this asynchronous behavior.

The second concept we’ll cover is the event loop. Simplistically, the event loop serves as an environment for executing coroutines within a single thread. This concept of an event loop isn’t unique to Python; many other programming languages also leverage this architecture for handling asynchronous operations efficiently. Now, let’s run benchmarks and see how asyncio can significantly improve the performance of your network requests.

Benchmark Tests

Throughout these benchmarks, we explored various concurrency models, including threading, multiprocessing, and asyncio, and we ran tests with both a synchronous and an asynchronous HTTP client.

We explored six different strategies:

  1. Synchronous HTTP client with a single thread
  2. Synchronous HTTP client with threading
  3. Synchronous HTTP client with multiprocessing
  4. Synchronous HTTP client with asyncio
  5. Asynchronous HTTP client with asyncio
  6. Asynchronous HTTP client with multiprocessing

For each strategy, we sent POST requests to an AWS API gateway equipped with a mock integration. All requests used HTTP/1.1, rather than HTTP/2 (which is not supported by the HTTP clients we tested). Our client code transmitted a 5 kilobyte payload in the message body of each request. All tests were conducted on the same test machine, featuring an eight-core Intel CPU. The test harness recorded wall clock time to measure performance.

Note: These test results are non-deterministic, and can be impacted by several different factors including the machine type, the applications/processes running on the machine, and variable response times due to networks being complex and shared.

Benchmark Test Harness

import click
import random
import string
import time
from constants import BENCHMARK_URL, TOTAL_REQUESTS, PAYLOAD_LENGTH
import seaborn as sns
import matplotlib.pyplot as plt
from benchmark_tests import *


PAYLOAD = {
"data": ''.join(
random.choices(string.ascii_uppercase + string.digits, k=PAYLOAD_LENGTH)
)
}

benchmark_test_map = {
"sync": BenchmarkSynchronous,
"sync_threads": BenchmarkSynchronousThreads,
"sync_multiprocess": BenchmarkSynchronousMultiprocess,
"async_threads": BenchmarkAsynchronousThreads,
"async_with_async_client": BenchmarkAsynchronousAsyncClient,
"async_multiprocess_with_async_client": BenchmarkMultiprocessAsyncClient,
}


def generate_graph(names, values):
ax = sns.barplot(x=names, y=values)
ax.set(ylabel='seconds')
bar_container = ax.containers[0]
ax.bar_label(
bar_container, fmt=lambda x: f"{round(x, 2)}s"
)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


@click.command()
@click.option("--tests", "-t", required=True, type=click.Choice(list(benchmark_test_map.keys())), multiple=True)
@click.option("--num-requests", "-n", default=TOTAL_REQUESTS, type=int)
def run_benchmark_tests(tests, num_requests):
results = {}
for test_name in tests:
test = benchmark_test_map[test_name]

start_time = time.time()
test.run(BENCHMARK_URL, num_requests, PAYLOAD)
total_time = time.time() - start_time

results[test_name] = total_time

names = []
values = []
for k, v in results.items():
names.append(k)
values.append(v)
click.echo(f"{k}: {round(v, 2)}s")

# Generate a graph
generate_graph(names, values)


if __name__ == "__main__":
run_benchmark_tests()

Testing Each Strategy

Synchronous HTTP client with a single thread

The first test we ran used the synchronous requests HTTP client and ran the HTTP requests serially in a single thread.

Code

class BenchmarkSynchronous(SupportsBenchmarking):
@classmethod
def run(cls, url: str, num_requests: int, payload: dict):
for _ in range(num_requests):
resp = requests.post(url, json=payload)
if resp.status_code != 200:
print(resp.content)
print("ERROR: ", resp.status_code)

Results

We ran the command with 100 requests:

python benchmark_runner.py -t sync -n 100

This yielded the following chart:

So running 100 HTTP requests in serial using a synchronous HTTP client took around 14 seconds. That’s pretty slow.

Synchronous HTTP client with Threading

The second test we ran used the synchronous requests HTTP client, but leveraged the Python Threading library to run the requests concurrently instead of serially.

Code

class BenchmarkSynchronousThreads(SupportsBenchmarking):
@staticmethod
def _make_request(url: str, payload: dict):
resp = requests.post(url, json=payload)
if resp.status_code != 200:
print(resp.content)
print("ERROR: ", resp.status_code)

@classmethod
def run(cls, url: str, num_requests: int, payload: dict):
with concurrent.futures.ThreadPoolExecutor(max_workers=100) as executor:
futures = [executor.submit(cls._make_request, url, payload) for _ in range(num_requests)]
concurrent.futures.wait(futures)

Results

Again, we ran the command with 100 requests:

python benchmark_runner.py -t sync_threads -n 100

This yielded the following results:

Adding threads to the mix sped things up dramatically. However, one thing to be aware of is that spinning up new threads comes with non-trivial overhead. This means that at some point, adding more threads will start to degrade overall performance. It’s important to spend time tuning your workloads to find the optimal number of threads for the number of requests you want to make.

Synchronous HTTP client with multiprocessing

The third test we ran used the synchronous requests HTTP client, and leveraged the multiprocessing library for concurrency (instead of Threading).

As shown in the code below, we use a process pool with 10 processes and submit each request to that process pool. multiprocessing will allocate a Python interpreter and a GIL to each child process, and each process will only execute a single request at a time.

Code

class BenchmarkSynchronousMultiprocess(SupportsBenchmarking):
@staticmethod
def _make_request(url: str, payload: dict):
resp = requests.post(url, json=payload)
if resp.status_code != 200:
print(resp.content)
print("ERROR: ", resp.status_code)

@classmethod
def run(cls, url: str, num_requests: int, payload: dict):
with concurrent.futures.ProcessPoolExecutor(max_workers=10) as executor:
futures = [executor.submit(cls._make_request, url, payload) for _ in range(num_requests)]
concurrent.futures.wait(futures)

Results

We ran the command with 1000 requests:

python benchmark_runner.py -t sync_multiprocess -n 1000

This yielded the following results:

As shown here, the threaded test outperforms the multiprocessing test. There are a couple reasons for this:

  1. Processes have a higher startup and memory overhead than threads. As part of our tuning process, we tried using a pool of 100 processes and this caused the test to take significantly longer.
  2. We can only make 10 (process pool size) API calls in parallel. With the threaded version, we were able to make 100 at the same time, all while using less memory.

Synchronous HTTP client with asyncio

For the fourth test, we used asyncio with the synchronous requests HTTP client. We create a coroutine for each request, and run each of these coroutines using asyncio.gather, which will return when all coroutines have finished.

Looking at our coroutine (_make_request), we need to wrap our calls to requests.post in a helper called sync_to_async. The reason here is that running a synchronous network request blocks the event loop, preventing us from taking advantage of the concurrency provided by asyncio. sync_to_async will take our coroutine, wrap it in an awaitable, and run it on a separate thread or thread pool. This means that when your code gets to a point where it makes the network request and gives up the GIL, we won’t block the event loop and other requests can go ahead and run at the same time.

Code

import asyncio

import requests

from asgiref.sync import sync_to_async
from concurrent.futures import ThreadPoolExecutor
from.benchmark_protocol import SupportsBenchmarking


thread_pool_executor = ThreadPoolExecutor(max_workers=100)


class BenchmarkAsynchronousThreads(SupportsBenchmarking):
@staticmethod
async def _make_request(url: str, payload: dict):
resp = await sync_to_async(
requests.post, thread_sensitive=False, executor=thread_pool_executor
)(url, json=payload)

if resp.status_code != 200:
print(resp.content)
print("ERROR: ", resp.status_code)

@classmethod
async def _run(cls, url: str, num_requests: int, payload: dict):
tasks = [cls._make_request(url, payload) for _ in range(num_requests)]
await asyncio.gather(*tasks)

@classmethod
def run(cls, url: str, num_requests: int, payload: dict):
asyncio.run(cls._run(url, num_requests, payload))

Results

We ran the command with 1000 requests:

python benchmark_runner.py -t sync_threads -t async_threads -n 1000

This yielded the following results:

As shown, the asyncio version of this actually took longer than the threaded version. The reason for this is that the asyncio version has the additional overhead of the event loop, on top of spinning up new threads.

Asynchronous HTTP client with asyncio

For the fifth test, we used asyncio with the asynchronous aiohttp HTTP client. Again, we create a coroutine for each request, and run each of these coroutines using asyncio.gather, which will return when all coroutines have finished.

The main difference is that instead of running each request in a thread pool using sync_to_async, we leverage the asynchronous http client to make the request without blocking the event loop. This allows us to make all of these network requests concurrently, without the overhead of spinning up new threads. For more information, check out this article on the aiohttp request lifecycle.

Code

import asyncio

import aiohttp

from.benchmark_protocol import SupportsBenchmarking


class BenchmarkAsynchronousAsyncClient(SupportsBenchmarking):
@staticmethod
async def _make_request(session, url: str, payload: dict):
async with session.post(url, json=payload) as resp:
if resp.status != 200:
print(await resp.text())
print("ERROR: ", resp.status)

@classmethod
async def _run(cls, url: str, num_requests: int, payload: dict):
async with aiohttp.ClientSession() as session:
tasks = [cls._make_request(session, url, payload) for _ in range(num_requests)]
await asyncio.gather(*tasks)

@classmethod
def run(cls, url: str, num_requests: int, payload: dict):
asyncio.run(cls._run(url, num_requests, payload))

Results

We ran the command with 1000 requests:

python benchmark_runner.py -t sync_threads -t async_threads -t async_with_async_client -n 1000

This yielded the following results:

As shown, the fully asynchronous version outperforms both the synchronous threaded version, and the asynchronous threaded version.

Asynchronous HTTP client with multiprocessing

For the sixth test, we used asyncio with an asynchronous http client, but also added multiprocessing to the mix. We chunk the number of requests into 6 chunks, and then create a process pool with 7 processes. Within each process, we run the assigned chunk of requests, concurrently, on the event loop. Each process gets its own event loop.

Code

import asyncio
import concurrent.futures

import aiohttp

from.benchmark_protocol import SupportsBenchmarking


class BenchmarkMultiprocessAsyncClient(SupportsBenchmarking):
@staticmethod
async def _make_request(session, url: str, payload: dict):
async with session.post(url, json=payload) as resp:
if resp.status != 200:
print(await resp.text())
print("ERROR: ", resp.status)

@classmethod
async def _run(cls, url: str, num_requests: int, payload: dict):
async with aiohttp.ClientSession() as session:
tasks = [cls._make_request(session, url, payload) for _ in range(num_requests)]
await asyncio.gather(*tasks)

@classmethod
def _run_job(cls, url, payload, num_requests):
asyncio.run(cls._run(url, num_requests, payload))

@classmethod
def run(cls, url: str, num_requests: int, payload: dict):
num_jobs = 6 # Number of callables (jobs) we want to submit to the ProcessPoolExecutor
num_requests_per_job = num_requests // num_jobs
remainder_requests = num_requests % num_jobs

with concurrent.futures.ProcessPoolExecutor(max_workers=num_jobs+1) as executor:
futures = [executor.submit(cls._run_job, url, payload, num_requests_per_job) for _ in range(num_jobs)]

if remainder_requests:
futures.append(executor.submit(cls._run_job, url, payload, remainder_requests))

concurrent.futures.wait(futures)

Results

We ran the command with 1000 requests:

python benchmark_runner.py -t async_with_async_client -t async_multiprocess_with_async_client -n 1000

This yielded the following results:

As shown, for 1000 requests the single process version outperforms the multi-process version. This was expected, as we anticipated that with only 1000 requests the overhead of spinning up new processes would dominate the overall latency. However, the reason we decided to run this test was because we had a hypothesis, which was that as we increased the number of requests, the overhead of spinning up these new processes would become less significant and we’d actually see the multiprocess version of this start to perform better than the single process version.

We ran the command with 30,000 requests:

python benchmark_runner.py -t async_with_async_client -t async_multiprocess_with_async_client -n 30000

This yielded the following results:

As expected, the multi-process version outperforms the single-process version. The reason for this is that even with an async http library, there is still some code that will block the event loop (although the overall latency will be dominated by the network I/O). With the single-process version, we only have one event loop, so we’re more likely to run into a situation where two coroutines need to run this blocking code at the same time, and only one is able to. With multiple processes, we have multiple event loops, so this will happen less frequently because the asyncio tasks are being distributed equally to each process. With 1000 requests, the overhead of spinning up new processes is higher than the overhead of tasks briefly blocking the single event loop. However, once we get to 10,000+ requests, this flip flops and we see the multi-process, multi-event-loop version win out in terms of performance.

Trade-offs

In the previous section, we dove into the results of our benchmark tests. We demonstrated the performance gains achieved through the use of asyncio with an async HTTP client, and how further improvement was achieved by incorporating multiple processes. Now, let’s discuss the trade-offs that come with these optimizations.

1. Complexity of Application Code

When transitioning from synchronous to concurrent code, complexity inevitably creeps in. Whether you opt for threading, multiprocessing, or asyncio, your application code will inherently become more intricate. Each concurrency model brings its own set of challenges and nuances, necessitating a deeper understanding of their respective APIs and behaviors.

Drawing from our own experience of deploying asyncio in production environments, we’ve learned the value of isolating complexity and employing abstractions. By encapsulating low-level asyncio APIs, we shield higher-level application logic from the intricacies of asynchronous programming. This separation of concerns not only enhances maintainability but also facilitates code comprehension and debugging.

2. Thread Safety

In multi-threaded environments, thread safety is always top of mind! Ensuring that your code and its dependencies can gracefully handle concurrent access from multiple threads is crucial to preventing race conditions and data corruption.

Asyncio offers a mechanism to execute blocking non-thread-safe code safely by running code on the main thread. While this approach ensures safety, it comes with performance considerations. Relying solely on the main thread to handle all non-thread-safe operations can lead to bottlenecks, especially in scenarios where numerous asyncio tasks are awaiting access to this thread. The performance impact arises from the sequential execution of tasks within the main thread, which can hinder the concurrent processing power that asyncio aims to leverage. As a result, the benefits of asynchronous execution may be compromised, and the overall efficiency of the application may suffer.

3. Visibility and Debugging

Introducing concurrency adds a new dimension to visibility and debugging. With multiple threads or asynchronous tasks executing concurrently, tracing execution and diagnosing issues becomes more challenging.

For asyncio, identifying operations that could block the event loop is critical to maintaining responsiveness. Blocking operations, if left unchecked, can disrupt the entire application flow. Similarly, in threaded environments, monitoring latency metrics becomes imperative to gauge the efficiency of thread utilization and preemptively address contention issues within thread pools.

As developers, we should weigh trade-offs carefully. Complexity, thread safety, and debugging overhead should all be considered before migrating synchronous Python code to concurrent code.

Conclusions

While we took a comparative look at the performance of asynchronous and synchronous HTTP clients, it’s important to recognize that our testing was not exhaustive. For example, we didn’t utilize session objects when testing the synchronous HTTP client, which can improve performance when making repeated requests. Additionally, we used the default configurations for both HTTP clients throughout our tests. Fine-tuning parameters such as connection pool behavior and DNS cache table timeouts could yield even more performance gains. While we compared two prominent HTTP clients, namely requests and aiohttp, the Python ecosystem offers several alternatives. Libraries like httplib2, httpx, and GRequests provide different approaches to managing requests, each with its own strengths and trade-offs. Exploring and benchmarking these alternatives against your specific use case can offer insights into selecting the most suitable solution for your project. This investigation serves as a starting point for deeper exploration and experimentation. We hope you’ve found this post informative and you consider asyncio in your next Python project!

--

--