Zig -> allocators -> Rust ergonomics

Show of hands, who has run into this problem when writing Rust:

You have a hot function where you need some sort of temporary buffer to store values, but for performance reasons you can’t just allocate that buffer then and there. Any sort of heap allocation is potentially very expensive, because your program may have to ask the operating system to allocate and hand out more memory.

Usually each memory request allocates more memory then initially required; be that because the OS can only hand out full memory pages, or because the allocator on purpose requests more memory than what’s needed in order to fulfill smaller future requests.

Whenever the process does need to request additional memory from the OS, that’s a fairly large performance hit!

This process is known as a context switch, where your program needs to save the state it is in right now, give control over to the operating system kernel to do its thing, and then restore that previous state. During that time, your program is effectively halted, and the whole process can take a while.

The most common solution to prevent this sort of performance hit is to pre-allocate this buffer. For example, in a simple loop you would maybe do something like this:

// from this...
loop {
  // possible reallocated in every iteration
  let mut ids: Vec<usize> = vec![];
  fetch_ids(&mut ids);
  filter_invalid_ids(&mut ids);
  process_ids(ids);
}

// ... to this
// Allocated once
let mut ids: Vec<usize> = vec![];
loop {
  // reused in every iteration
  ids.clear();
  fetch_ids(&mut ids);
  filter_invalid_ids(&mut ids);
  process_ids(&ids);
}

Note that in this simple example, the compiler might recognize what you’re doing in that first loop. When you have variable declarations like this directly in the loop body, then the state of the ids buffer at the start of the loop is effectively a loop invariant, meaning that this buffer HAS to be empty at the start of each iteration. The compiler may be able use that knowledge to essentially optimize your code to the second variant in the example.

This isn’t always garantueed though, and it only works with these sorts of declarations directly in the loop body. Not only that, I’m not sure that it even works with heap allocated data.

A motivating example

Let’s look at a slightly more complex example. You are writing a video game where you manage a grocery store, and in that game you have NPC customers walking around your store and buying things.

fn load_shopping_cart(
  customer: &mut Customer,
  store: &mut Store,
) {
  let mut items: Vec<Item> = Vec::new();

  customer.fetch_shopping_list(&mut items);

  for item in items.iter_mut() {
    match store.fetch_item(item) {
      Response::Available => {
        mark_as_available(&mut item);
        continue;
      }
      Response::NotAvailable => continue;

      // this early return is why we are
      // splitting looking up the items
      // and actually loading them into
      // the cart.
      Response::FireAlarm => return;
    }
  }

  // take the marked items out of the
  // store's inventory, and put them
  // into the customer's shopping cart
  customer.load_into_cart(items, store);
}

// this function is called in some hot loop
loop { // the game loop
  while !customer_simulation.converged() {
    for customer in customer.iter_mut() {
      load_shopping_cart(
        &mut customer,
        &mut store,
      );
      // ...
    }
  }
}

Let’s imagine this function was called in a hot loop; You would very soon feel the impact of this allocation, because every time this function is called you might trigger an expensive syscall. And, unlike the simple example from earlier, the compiler cannot help you reuse the memory of items, because this allocation is “hidden” in the load_shopping_cart function call.

Solution 1: Pass a pre-allocated buffer into the function

In order to avoid the allocation, you could manually pre-allocate this buffer outside of the outer loop, and then you pass it into the function to be mutated.

fn load_shopping_cart(
  customer: &mut Customer,
  store: &mut Store,
  item_buffer: &mut Vec<Item>,
) {
  // make sure we have an empty buffer
  item_buffer.clear();
  customer.fetch_shopping_list(item_buffer);

  for item in item_buffer.iter_mut() {
    match store.fetch_item(item) {
      // ...
    }
  }
  // ...
}

let mut item_buffer: Vec<Item> = Vec::new();
loop {
  while !customer_simulation.converged() {
    for customer in customer.iter_mut() {
      load_shopping_cart(
        &mut customer,
        &mut store,
        &mut item_buffer,
      );
      // ...
    }
  }
}

This certainly works! Furthermore, in this loop structure this wouldn’t be too hard to maintain either…

… but that’s a very favourable situation. If the load_shopping_cart function was nested a dozen functions deep into the loop, you would have to pass the item_buffer through all of those functions just to use it in this one function. For example, what if we need to put the simulation and the customer loop into seperate functions?

let mut item_buffer: Vec<Item> = Vec::new();
loop {
  run_simulation(
    &mut customers,
    &mut store,
    &mut item_buffer
  );
  // ...
}

fn run_simulation(
  customers: &mut Vec<Customer>,
  store: &mut Store,
  item_buffer: &mut Vec<Item>,
) {
  while !customer_simulation.converged() {
    sim_customers(
      &mut customers,
      &mut store,
      &mut item_buffer
    );
    // ...
  }
}

fn sim_customers(
  customers: &mut Vec<Customer>,
  store: &mut Store,
  item_buffer: &mut Vec<Item>,
) {
  for customer in customer.iter_mut() {
    load_shopping_cart(
      &mut customer,
      &mut store,
      &mut item_buffer
    );
    // ...
  }
}

Now, you need to drag this buffer through all of these functions, only to use it possibly just in this one function.

Where this gets extra hairy is when you have many more of those buffers. Suddenly, your function signatures contain various types of buffers that you need to pass through the call stack just to avoid a couple of allocations.

fn foo(
  // ...
  item_buffer: &mut Vec<Item>,
  customer_buffer: &mut Vec<Customer>,
  id_buffer: &mut Vec<usize>,
  float_buffer: &mut Vec<f32>,
) { /*...*/ }

… and those might only be used many more function calls deep into your program.

Solution 2: Attach the buffer to another struct

Well, we are passing mutable objects to these functions anyways, why not just attach the buffer to one of those structs? The most logical place for our item_buffer would probably be to attach this buffer to the Store struct.

struct Store {
  // ...
  pub item_buffer: Vec<Item>
}

fn load_shopping_cart(
  customer: &mut Customer,
  store: &mut Store
) {
  store.item_buffer.clear();
  customer.fetch_shopping_list(
    &mut store.item_buffer
  );

  for item in store.item_buffer.iter_mut() {
    // compile error! cannot grab reference to
    // store, because we are holding a mutable
    // reference to the store already.
    match store.fetch_item(item) {
      // ...
    }
  }
  // ...
}

So maybe instead we attach it to the Customer?

struct Customer {
  // ...
  pub item_buffer: Vec<Item>
}

fn load_shopping_cart(
  customer: &mut Customer,
  store: &mut Store
) {
  customer.item_bufer.clear();
  customer.fetch_shopping_list();

  for item in customer.items.iter_mut() {
    // the borrow checker is happy!
    match store.fetch_item(item) {
      // ...
    }
  }
  // ...
}

This should work, but are still restricted in what we can do within that for loop. Not only that, we are also making our data structures bigger. Maybe the Customer was 24 bytes in size, and now, because of this one buffer, we may need to grow it to 32 bytes (on a 64bit system). Our Vec<Customer> in the loop we saw earlier just grew by a factor of 1.5.

The concern here may not even be the memory footprint of the program, but suddenly you make it way less likely for multiple Customer objects to fit into the cache, never mind a single cache line. Compared to allocating the vector in the function, this still favourable performance-wise, but it’s still far from ideal.

In order to avoid that problem, another solution could be to have a struct of buffers:

struct BufferContainer {
  pub id_buffer: Vec<usize>,
  pub customer_buffer: Vec<Customer>,
  pub item_buffer: Vec<Item>
  // ... other buffers
}

At least now we only pass this one object through our call stack, but we now need to maintain the fields in this container. Everytime we need a new buffer, we have to make sure the constructors are updated (unless we can just default them), and we still need to remember to clear these buffers before or after we use them, or have a BufferContainer::clear method that resets all of its buffers, but then in that method you need to remember to add a clear call to any new buffer you add to the struct, and so on, and so on …

I don’t like this

No matter which solution you may end up with using, you likely can agree that there’s a whole lot of ceremony and brain power you have to go through in order to land at the solution that fits your problem the best. Maybe it’s not always obvious that this BufferContainer is the least hairy solution. Even that could become very painful to deal with, if one of your buffers needs to hold references to other objects.

My personal list of pain points goes something like this:

I need to make sure I avoid issues with the borrow checker (in terms of mutable references, as well as lifetimes)
I need to make sure my structs don’t keep growing in size
Whenever I use a buffer, I need to remember to clear it (that’s a nasty class of bugs, let me tell you…)

Solving situations like this can be infuriating… but there’s an easier way, and I personally learned it from recently learning a different programming language.

Short detour to Zig

If you’re somewhat interested in modern programing languages, you have likely heard of Zig. It’s a low-level systems language trying to compete with C. One particularly interesting design decision in Zig was to make allocations more explicit, and more importantly to make the allocation strategy explicit.

A typical Zig application starts with declaring your allocator similar to this:

pub fn main() void {
  const allocator = std.heap.page_allocator;

  // think of the `ArrayList` as something
  // like Rust's `Vec`, a dynamically
  // resizable heap allocated array. here we
  // use it to dynamically allocate memory:
  var x = std.ArrayList(u8).init(allocator);

  // free the memory at the end of the scope
  defer x.deinit();

  var y = std.ArrayList(i16).init(allocator);
  defer y.deinit();

  var z = std.ArrayList(f32).init(allocator);
  defer z.deinit();

  // ...
}

This might sound and look complicated, but there’s a lot of power hidden here. Zig’s standard library exposes a couple of different allocation strategies, and one I want to focus on here is the arena allocator.

pub fn main() void {
  var arena = std.heap.ArenaAllocator.init(
    std.heap.page_allocator
  );
  defer arena.deinit();
  const allocator = arena.allocator();

  // Notice we are only calling `.deinit` on
  // the arena, NOT on the allocated buffers
  // x, y, and z!

  var x = std.ArrayList(u8).init(allocator);
  var y = std.ArrayList(i16).init(allocator);
  var z = std.ArrayList(f32).init(allocator);
}

This might already be selling you on arenas: It’s a big chunk of memory that you can allocate to different pieces of data, and when you’re done you just free the whole arena, without having to remember to free the other individual pieces of memory the allocator handed to you. Let’s take a look at how arenas conceptually work.

What are memory arenas?

An arena can be thought of as a linked list of big chunks of pre-allocated memory with a pointer to where your free space in memory starts. For simplicities’ sake, let’s just look at the arena as one single node; one single chunk of memory:

A node in an arena's linked list:
`[ ]` a free "piece" of memory
`[x]` used memory

[ ]
[ ]
[ ]
[ ]
[ ] <- pointer to the start of free memory.
[x]
[x]
[x]

When we want to allocate new memory into the arena, we measure the length of the requested chunk, we move the pointer up to the next free “box”, and return where the pointer previously pointed to, because that’s where the requested chunk of memory starts:

`a` is the newly requested chunk of memory,
and it needs to occupy 3 boxes.

[ ]
[ ] <- pointer to the start of free memory.
[a]
[a]
[a] <- return a pointer pointing to this
[x]
[x]
[x]

What happens when we fill up this chunk? As I said earlier, an arena is a linked list of memory chunks, and whenever a request would overflow the node our pointer points into, we allocate a new chunk that can at least fulfill the request, and future allocations will allocate into that new chunk.

`b` is the newly requested chunk of memory,
and it needs to occupy 4 boxes, but our
current node only has space left for 2.

[ ]  [ ]
[ ]  [ ]
[a]  [ ]
[a]  [ ] <- free memory starts here
[a]  [b]
[x]  [b]
[x]  [b]
[x]  [b] <- return this pointer

If your memory request fits into the current node, this allocation is VERY fast. You just bump up a pointer and return a reference to where the free pointer was pointing to previously. That’s it. No syscall, no management of a free list, just one addition and returning a reference. The only time a syscall is involved in this process, is when you need to allocate a new node, and if your nodes are large enough, this should happen only every rarely.

But here’s the most powerful bit: An arena can be trivially reset by just moving the free pointer back to the beginning!

Let’s say we are using this arena in a loop to allocate two different buffers.

pub fn main() void {
  var arena = std.heap.ArenaAllocator.init(
    std.heap.page_allocator
  );
  defer arena.deinit();
  const allocator = arena.allocator();

  while (true) {
    var x = std.ArrayList(u8).init(allocator);
    var y = std.ArrayList(i16).init(allocator);
    var z = std.ArrayList(f32).init(allocator);

    // ... do some work with the buffers ...

    // reset the arena pointer without
    // deallocating the memory
    arena.reset(.retain_capacity);
  }
}

In our arena diagram what would happen here is something like this:

We reset by just moving the free pointer back
to the beginning of the first node's memory
chunk.

[ ]                          [ ]
[ ]                          [ ]
[a]                          [ ]
[a]                          [ ]
[a]                          [b]
[x]                          [b]
[x]                          [b]
[x] <- this free memory now  [b]

Obviously, our old data is still in the arena, but the arena is only thing holding any sort of references into those chunks of memory, so the old data left in there is essentially unreachable.

The important effect here is that, even if the initial arena was not big enough for all the buffers we need per loop iteration, we likely only grow it in the very first iteration. Each following iteration just reuses the memory we already have allocated previously and only very rarely do we have to request more memory from the OS.

Even if your particular application does not have any sort of loops, you likely can still use an arena and tune it to make the nodes pretty large in order to minimize new memory requests.

Also note that we allocated buffers of arbitrary types into the arena. One buffer held u8 elements, and the others held i16 and f32 elements. The arena itself is untyped. It just needs to know how many bytes you are requesting, and that’s it.

This makes arenas behave kind of like a dynamically allocated stack (the memory region, not the data structure) where you control when the stack pointer is reset, or a manual garbage collector where you control when the memory is “freed” all in one go.

Personally, prior to learning Zig, I have heard of there being these different allocation strategies, but I never played around with them. Turns out, these things are insanely powerful, and make your life way easier! So…

Can we use this in our Rust example?

Yes, we can! The crate bumpalo implements a simple arena/bump allocator to trivialize our example. Here’s what the initial version of the code looked like:

fn load_shopping_cart(
  customer: &mut Customer,
  store: &mut Store,
) {
  let mut items: Vec<Item> = Vec::new();

  customer.fetch_shopping_list(&mut items);

  for item in items.iter_mut() {
    match store.fetch_items(item) {
      Response::Available => {
        mark_as_available(&mut item);
        continue;
      }
      Response::NotAvailable => continue;
      Response::FireAlarm => return;
    }
  }
  customer.load_into_cart(items, store);
}

loop {
  while !customer_simulation.converged() {
    for customer in customer.iter_mut() {
      load_shopping_cart(
        &mut customer,
        &mut store
      );
      // ...
    }
  }
}

With bumpalo we could do this instead:

use bumpalo::collections;

fn load_shopping_cart(
  customer: &mut Customer,
  store: &mut Store,
  allocator: &bumpalo::Bump,
) {
  let items = collections::Vec::<Item>::new_in(
    allocator
  );

  customer.fetch_shopping_list(&mut items);

  for item in items.iter_mut() {
    match store.fetch_items(item) {
      Response::Available => {
        mark_as_available(&mut item);
        continue;
      }
      Response::NotAvailable => continue;
      Response::FireAlarm => return;
    }
  }
  customer.load_into_cart(items, store);
}

let mut allocator = bumpalo::Bump::new();
loop {
  while !customer_simulation.converged() {
    for customer in customer.iter_mut() {
      load_shopping_cart(
        &mut customer,
        &mut store,
        &allocator
      );
      // ...
    }
  }
  allocator.reset();
}

This looks pretty simple, doesn’t it? Most importantly though, I didn’t have to think about neither performance, nor the borrow checker. I just allocate the buffer where I need it, use it, and at the end of the function the only reference to that piece of memory is dropped.

Pros and cons of using an arena

In my mind, using an arena allocator like this gives me the ease of use of allocating memory where I need it but without the performance hit, while being more flexible than the BufferContainer I showed earlier.

struct BufferContainer {
  pub id_buffer: Vec<usize>,
  pub customer_buffer: Vec<Customer>,
  pub item_buffer: Vec<Item>
  // ... other buffers
}

You can think of an arena as sort of a buffer container like this, but completely untyped, and way easier to maintain. You don’t need to bother yourself with resetting the indiviual buffers, you only need to reset the arena.

This does mean that you need to think about when and where to reset the arena, because when you do, all references into the arena are invalid; in fact the compiler will yell at you, when you are trying to reset the arena while still holding a reference into the memory, because resetting requires a mutable reference into the arena. Rust’s borrow checker helps me keeping this operation safe.

Another con is that arenas usually have a higher overall memory footprint than other solutions. For example, if we had a generic Vec<usize> buffer, there are probably many places inside of our loop where we could reuse that, but with an arena you would probably allocate new buffers whereever you need one, because otherwise you end up passing the buffer around again.

There is one situation where using this sort of arena can get complicated, and that’s when you rely on your data structures being dropped. When an arena is reset, because it has no clue what sort of data you dumped into it, it cannot call drop on anything inside the arena. There are ways around that, like calling drop manually, but that hurts ergonomics obviously.

I think those cons are still fairly minor in comparison to how much easier it makes thinking about buffers like that. It makes allocating memory a “shoot and forget” kind of deal, similar to how you would look at stack allocations or a garbage collector (but without the GC pauses).

Should you be using arenas?

I don’t know whether the idea originated there, but game development is probably where you will hear most people talking about using arenas. In a video game you you have a lot of calculations to perform inbetween each frame displayed on the screen, and you want to neither make syscalls, nor blow up your data structures harming cache friendliness.

Using an arena allocator trivializes most of these problems when you set up your arena such that you reset it when you finished calculating a new frame.

Another easy example is web servers: When a request comes in, you spin up an arena to fulfill that request, and throw away the arena afterwards (or you could reuse an arena from a previous request; that’s a bit tricky in async/threaded environments, but not impossible).

The last example I’d like to bring up is one-shot CLI tools. Those sorts of programs that you invoke in your terminal with some sort of arguments, maybe some minor user interaction while its running, and then it just finishes; tools more like a compiler, less like htop. CLI tools, most often than not, have to deal with a lot of strings that need to be heap allocated, but with a program like that, you can just define an arena at the start of your program, dump everything in there, and that’s it.

Although now that I think about it, a TUI tool like htop, because it has a render step, can probably benefit from an arena the same way a video game does.

Basically, whenever you have a clear scope where you just need fresh untyped buffers whenever you enter that scope, arenas are your friend. For me, they made my life in Rust so much easier, especially because I’m kind of circumventing problems I would otherwise have with the borrow checker. In other languages, they solve a lot of the lifetime issues that the language cannot help you with. In Rust, the compiler helps you with lifetimes, but now you have to deal with the borrow checker. An arena handles those lifetimes in a very simple way, which increases ergonomics and productivity immensely, simple because I have to think about these things WAY less.

Maybe the most important caveat is libraries. At the very least, do not make allocators part of your public API, even if it’s just because you would be forcing your library consumer’s to also add bumpalo as a dependency.

Notable resources, quotes

To quote Casey Muratori, in his series “Handmade Hero”:

People don’t usually believe me when I say I don’t spend time thinking about memory management, and I really don’t. Most of the time, it seems like it just handles itself.

I got this quote from a talk by Ryan Fleury titled “Enter the arena”, he gave at Handmade Con in 2023 (here’s a VOD of that talk). It’s a great talk, and you should probably give it a watch.

I also highly recommend Ryan’s interview on Backend Banter, as well as Benjamin Feng’s presentation What’s a memory allocator anyways?. The latter shows a lot of Zig examples, but it’s still very general and shows kind of the iterative design of different memory allocators, starting from fixed size buffers, over to arenas, and to more and more sophisticated solutions.

Check out the bumpalo crate. It’s an amazing project, and as long we don’t have the allocator_api in stable Rust, it and the allocator_api2 crate can help you immensely with making Rust way more ergonimic to use.

Lastly, check out Zig. It’s an amazing project, and learning a different programming language is always a good thing, because it can really open up your mind to completely new ideas you weren’t even aware of previously. For me, these allocation strategies were lingering at the edges of my mind, but having to use them when using Zig made me realise that they aren’t some sort of optimization black magic that you would only use when building video games, but rather that it just makes writing applications in any of these languages so much easier.