Memory for Nothing: Why Vec<usize> is (probably) a bad idea

15.10.2024

Every now and then one has to index something - in my case it was prices. Lots of them. At my previous job we've had this custom engine that basically loaded tons of data points:

|  city  |          dates         |  room  | occupancy |     price      |
|--------|------------------------|--------|-----------|----------------|
| warsaw | 2018-01-01..2018-01-14 | double | 1 adult   | 100.00 per day |
| warsaw | 2018-01-15..2018-01-31 | double | 1 adult   | 125.00 per day |

| warsaw | 2018-01-01..2018-01-14 | double | 2 adults  | 175.00 per day |
| warsaw | 2018-01-15..2018-01-14 | double | 2 adults  | 200.00 per day |

| berlin | 2018-01-01..2018-01-14 | double | 1 adult   | 160.00 per day |
| berlin | 2018-01-15..2018-01-31 | double | 1 adult   | 185.00 per day |

| berlin | 2018-01-01..2018-01-14 | double | 2 adults  | 235.00 per day |
| berlin | 2018-01-15..2018-01-14 | double | 2 adults  | 260.00 per day |

... and then it allowed you to issue queries, like:

find me the cheapest stay in berlin for 2 adults, from 2018-01-03 to 2018-01-04, plzz

Calculating prices was plenty difficult on its own, since you had to take into account supplements, discounts, taxes, cancellation policies etc., lots of things that don't really reduce the problem to a simple hashmap lookup. But the actually interesting thing happened earlier.

After all, before we're able to calculate prices, we have to find them.

♪ memory for nothing and bits for free... ♪

> Indexing
> Unwasting
> Summary

Indexing

Let's ground ourselves with a specific example - we're given all of those:

use std::collections::HashMap;
use std::ops::RangeInclusive;

type City = String;
type CityRef<'a> = &'a str;
type RoomRef<'a> = &'a str;

#[derive(Clone, Copy, Debug, PartialEq, Eq, PartialOrd, Ord, Hash)]
struct Date {
    day: u32, // e.g. days since 1970-01-01
}

#[derive(Clone, Copy, Debug, PartialEq, Eq, Hash)]
struct Occupancy {
    adults: u8,
}

#[derive(Clone, Debug, PartialEq)]
struct Price<'a> {
    city: CityRef<'a>,
    dates: RangeInclusive<Date>,
    room: RoomRef<'a>,
    occ: Occupancy,
    price: f32, // ofc. one wouldn't actually use float for money
}

#[derive(Clone, Debug, PartialEq)]
struct Query<'a> {
    city: CityRef<'a>,
    dates: RangeInclusive<Date>,
    occ: Occupancy,
}

... and our problem is to efficiently implement this function:

fn query(prices: &[Price], query: &Query) -> Option<f32>

Naively, we're looking for:

fn query(prices: &[Price], query: &Query) -> Option<f32> {
    prices
        .iter()
        .filter(|price| {
            price.city == query.city
                && price.dates.start() <= query.dates.start()
                && price.dates.end() >= query.dates.end()
                && price.occ == query.occ
        })
        .map(|price| price.price)
        .next()
}

... but, of course, that won't fly when we're given millions of prices.

(also, ideally we'd find the cheapest price, but for the sake of example let's say that finding any price will do)

Faced with this problem, one might take a deep S3 sleep first, then wake up, drink a cup of coffee laced with kava kava and finally spit:

/* ... */

struct Index {
    indexed_prices: HashMap<City, HashMap<Occupancy, Vec<usize>>>,
}

impl Index {
    fn new(prices: &[Price]) -> Self {
        let mut indexed_prices: HashMap<_, HashMap<_, Vec<_>>> = HashMap::new();

        for (price_idx, price) in prices.iter().enumerate() {
            indexed_prices
                .entry(price.city.to_owned())
                .or_default()
                .entry(price.occ)
                .or_default()
                .push(price_idx);

            // N.B. calling `.city.to_owned()` here is inefficient, because it
            // will allocate a new string even if the hashmap already contains
            // the entry; this can be avoided with `.raw_entry_mut()`, but
            // let's not go crazy
        }

        Self { indexed_prices }
    }

    /// Returns indices of prices matching given city and occupancy.
    fn find(
        &self,
        city: &str,
        occ: Occupancy,
    ) -> impl Iterator<Item = usize> + '_ {
        self.indexed_prices
            .get(city)
            .into_iter()
            .filter_map(move |occs| occs.get(&occ))
            .flatten()
            .copied()
    }
}

fn query(prices: &[Price], index: &Index, query: &Query) -> Option<f32> {
    index
        .find(query.city, query.occ)
        .map(|price_idx| &prices[price_idx])
        .filter(|price| {
            // yay, simpler check!

            price.dates.start() <= query.dates.start()
                && price.dates.end() >= query.dates.end()
        })
        .map(|price| price.price)
        .next()
}

... and that's a perfectly valid approach! Although...

Note that I'm proposing Vec<usize> instead of Vec<&Price> - why is that?

If we were to implement the rest of the application, we'd probably end up with something like:

struct Engine {
    prices: Vec<Price>,
    index: Index,
}

Now, if our Index was based around borrows:

struct Index<'a> {
    indexed_prices: HashMap<City, HashMap<Occupancy, Vec<&'a Price>>>,
}

... then we'd have to implement a self-referential Engine:

struct Engine {
    prices: Vec<Price>,
    index: Index<'whoopsie>,
}

... and self-referential types are not really supported in Rust - the language is actively hostile to you if you try to use them, and so I'm proposing we go with Vec<usize> from day zero.

Also, it'll just come handy later, pinky promise, just trust me.

Unwasting

Alright, look at them Vec<usize>. Look. At. Them. Squeeze your eyes. Unsqueeze them. Do a squat. Do you see what I see?

Those vectors are mostly filled with zeros!

On a 64-bit machine, one usize weighs 8 bytes and allows to encode numbers from 0 up to 2^64-1.

18446744073709551615 - man that's a lot of prices! A lot. It's so many, in fact, that we would run out of memory (let alone time) before we're able to index half of them. One tenth. One percent. We're given all this space that we cannot reasonably use.

This is something that I noticed back then as well, pretty much accidentally by just looking around the code looking for links to memes.

"But, surely it's not mu...", you might say before I gag you. Gigabytes. Praires full of zeros -- and, in this economy!

To fix this, I've come up with a vector specialized for handling lists of indices - it starts as Vec<u8> and upgrades itself to a larger variant (Vec<u16> or Vec<u32>) when you push a number large enough.

It goes sorta like:

use std::{iter, mem};

struct IndexVec {
    repr: IndexVecRepr,
}

enum IndexVecRepr {
    U8(Vec<u8>),
    U16(Vec<u16>),
    U32(Vec<u32>),
}

impl IndexVec {
    pub fn push(&mut self, val: usize) {
        if let Ok(val) = u8::try_from(val) {
            self.push_u8(val);
            return;
        }

        if let Ok(val) = u16::try_from(val) {
            self.push_u16(val);
            return;
        }

        if let Ok(val) = u32::try_from(val) {
            self.push_u32(val);
            return;
        }

        panic!("index too high: {val}");
    }

    fn push_u8(&mut self, val: u8) {
        match &mut self.repr {
            IndexVecRepr::U8(vals) => {
                // Ok, `Vec<u8>` can hold `u8`
                vals.push(val);
            }

            IndexVecRepr::U16(vals) => {
                // Ok, `Vec<u16>` can hold `u8`
                vals.push(val as u16);
            }

            IndexVecRepr::U32(vals) => {
                // Ok, `Vec<u32>` can hold `u8`
                vals.push(val as u32);
            }
        }
    }

    fn push_u16(&mut self, val: u16) {
        match &mut self.repr {
            IndexVecRepr::U8(vals) => {
                // Err, `Vec<u8>` can't hold `u16` - bump it to a larger type
                self.repr = IndexVecRepr::U16(
                    mem::take(vals)
                        .into_iter()
                        .map(|val| val as u16)
                        .chain(iter::once(val))
                        .collect(),
                );
            }

            IndexVecRepr::U16(vals) => {
                // Ok, `Vec<u16>` can hold `u16`
                vals.push(val);
            }

            IndexVecRepr::U32(vals) => {
                // Ok, `Vec<u16>` can hold `u16`
                vals.push(val as u32);
            }
        }
    }

    fn push_u32(&mut self, val: u32) {
        match &mut self.repr {
            IndexVecRepr::U8(_) => {
                // upgrade to `Vec<u32>`, similarly as in `push_u16()`
            }

            IndexVecRepr::U16(_) => {
                // upgrade to `Vec<u32>`, similarly as in `push_u16()`
            }

            IndexVecRepr::U32(vals) => {
                vals.push(val);
            }
        }
    }

    pub fn iter(&self) -> impl Iterator<Item = usize> {
        /* ...  */
    }
}

Summary

Should you (continue to) use Vec<usize>? Most likely yes!

This IndexVec has a couple of performance/memory trade-offs - especially around building it - that don't warrant using it as a generic replacement for Vec<usize>. But it does come handy sometimes. But there's nothing wrong with usizes. But they can be too much. You know, life. Choices. Stuff.

It would be nice to probably publish this vector as a crate - I haven't found anything like this on crates.io yet; rustc has a similarly named struct, but it does something else.

Anyway, ♪ memory for nothing and bits for free... ♪