Mimalloc Cigarette: Losing one week of my life catching a memory leak

21.08.2024

#memory #mimalloc #prices #rust

One of applications at my work has always been RAM-bound - it's a pricing engine that basically loads tons of hotels, builds some in-memory indices and then allows you to issue queries like find me the cheapest hotel in berlin, pronto.

A pricing engine's main purpose is to price ~~engines~~ hotels, but in order to do that effectively, there's a lot of "meta work" involved, like:

Where do we load the data from, how do we do it?
Do we load the entire dataset or just some parts of it?
Should we precalculate prices in order to speed-up the most popular queries?

Such an engine poses an interesting technical challenge, even greater so when one day it starts OOMing on production, even though the entire dataset should fit in the memory multiple times...

Hello, World!

Simplifying ~~a bit~~ a lot, what we're dealing with is:

github:Patryk27/mimalloc-cigarette

struct State {
    hotels: Vec<Hotel>,
}

struct Hotel {
    prices: Vec<f32>,
}

// ---

impl State {
    fn new() -> Self {
        Self {
            hotels: load_hotels().collect(),
        }
    }

    fn refresh(&mut self) {
        for (id, hotel) in load_hotels().enumerate() {
            self.hotels[id] = hotel;
        }
    }
}

fn load_hotels() -> impl Iterator<Item = Hotel> {
    (0..1_000).map(|_| Hotel {
        prices: (0..1_000_000).map(|n| n as f32).collect(),
    })
}

// ---

fn main() {
    let state = Arc::new(RwLock::new(State::new()));

    // Spawn a separate thread responsible for checking which hotels have
    // changed and need to be refreshed etc.
    thread::spawn({
        let state = Arc::clone(&state);

        move || loop {
            thread::sleep(Duration::from_secs(5));
            state.write().unwrap().refresh();
        }
    });

    // Now in reality we start a Rocket server here, but for practical
    // purposes stalling the main thread will be sufficient.
    loop {
        thread::sleep(Duration::from_secs(1));
    }
}

Of course IRL there's ArcSwap instead of RwLock, every hotel contains much more information (like taxes, discounts or supplements) etc., but we've got a reasonably good approximation here.

Now, what if I told you that this code has totally different memory characteristics depending on which allocator you use?

Practical Reasons

Memory allocator is the piece of software invoked whenever your program needs to get hands on extra memory, like when you're calling Box::new(). And memory allocation is quite a complex topic, with different solutions offering different trade-offs.

Say, when implementing a firmware you might pick an allocator that works slower, because its implementation is just simpler and doesn't occupy much space in the final binary.

(some would argue that embeddeed programs shouldn't allocate, yadda yadda, but you get the idea -- replace firmware with wasm and you end up with the same problem.)

mimalloc is an allocator that fights tooth and nail for performance - and while most applications don't have to worry about allocation performance, in our case internal benchmarks have proven it gives us extra 10% for free; when your response times have to be within milliseconds, this matters.

But there's a catch.

Désenchantée

That program from before, on my x86_64 Linux machine it allocates around 4 GB of memory and it remains on this level through the refreshing.

This makes sense, right? We're using lazy iterators, replacing stuff in place one-by-one, there's no reason we'd need more RAM.

But if you use mimalloc:

use mimalloc::MiMalloc;

#[global_allocator]
static GLOBAL: MiMalloc = MiMalloc;

... the program will first allocate 4 GB and then allocate extra 4 GB during the refreshing, oh noes!

Getting to this point already took three days of my life - believe me or not, when faced with 200k lines of Arc-ridden Rust code that seems to generate a memory leak, one's first thought is not "let's try with different allocator", but rather "probably something's holding onto an Arc for too long".

And so I've valgrind-ed. I've perf-ed. I've analyzed assembly. I've headbanged and I've cried.

No more.

From now on I'm always assuming it's someone else's fault - it's the allocator, it's the compiler, it's that crate Mike pulled last night. WHY DO YOU HATE ME MIKE, WHY ARE YOU PULLING RANDOM CRATES TO MY PURE ~~~tv noise~~~

Remède

Allocators have different characteristics for a reason - they do some things differently between each other. What do you think mimalloc does that could account for this behavior?

Let me give you two hints, two pieces of code that solve the problem, but feel cursed:

// Approach 1:
fn main() {
    let state = thread::spawn(|| {
        Arc::new(RwLock::new(State::new()))
    }).join().unwrap();

    /* ... */
}

// Approach 2:
fn main() {
    /* ... */

    loop {
        Box::new(1234);
    }
}

Any ideas? Last chance to win a plushie polar bear!

Alright then, the issue is that mimalloc assumes that every thread allocates every now and then.

Every now and then during malloc() mimalloc performs some internal bookkeeping, so when a thread goes to sleep (say, because it delegates handling HTTP requests into a separate thread pool...), this bookkeeping doesn't happen (for that particular thread).

The most nasty edge case that can happen here, and the one that we've stumbled upon, is when your thread allocates a lot of data, then launches other threads to work on that data, and then goes to sleep. As other threads work on memory and override stuff, Rust destructors are launched properly, but the underlying memory blocks simply get marked as "to be released".

Under normal conditions, these blocks get processed a moment later, during a call to malloc() on the thread that created them - but if that thread is asleep, those blocks never become available again (unless the thread dies, of course).

To be sure, the problem is not that the memory is not returned to the kernel - that's alright. It's that unless this bookkeeping happens, mimalloc can't even use the memory for itself - all this free estate just lays there, dormant:

Anyway, the solution we went with was to keep all refreshing on the same thread - when program starts, we spawn a dedicated refreshing-thread and use channels to let it know to do its thing.

So yeah, that was fun; and health-wise probably more like seven cigarettes.