Look Ma: My computer is talking! - Markov chains and N-grams

In this article I'll show you a fun & simple algorithm (< 200 lines of Rust code, no cheating) that can be trained on some text and later used to generate plausibly-looking new text from it.

Here's an example of that, having trained it on Shakespeare:

> ACT IV
ACT IV. Scene I.
The coast of Kent

Alarum. Enter Falstaff solus.

  Gent. Ay, gentle Thurio; for you are senseless.
  CLOTEN. I cannot forget
    But thou art not here.
  BARDOLPH. Sir John, as you may:
    Will tie you to this vault to die,
    Or with the Emperor in his banishment.
... a few Wikipedia articles:
> Salmon thinks
Salmon thinks, again contrary to Kant, that it is he whom Wanda will marry,
and she placed fourth behind Mia Farrow, Jane Fonda, Peter Fonda, Oliver Reed,
Ashton met the Australian dental surgeon Phil Rasmussen, who gave him the name
Valor.

... and, as is the custom, on The Kernel itself:

> switch (
switch (skdev->state) {
        case MT6323_CHIP_ID:
                if (m->gpio[MAX1600_GPIO_0VPP]) {
                tsi_err(&pdev->dev, DMA_BIT_MASK(32);

My hopes are to spark your interest in the topic of language models and then, if you'd like to learn more, at the end of this post I'll leave a link to my favorite article from Andrej Karpathy on the topic of recurrent neural networks which can be used to approximate languages crazy good, of course with the caveat that they require somewhat more advanced math than what I'm going to show here :-)

Anyway, let's buckle up and start! -- no particular knowledge is required, the main audience for this article are Rust begginers and/or people simply interested in statistics.

(as always, source code is available at my GitHub.)

Naïveté

So, what we want to achieve is an application that can can be trained on a dataset and later used to generate something new that feels like it originates from that original dataset without actually being copied verbatim.

That is to say, an overview of our program is:

struct Brain {
    /* ... */
}

impl Brain {
    pub fn train(&mut self, text: &str) {
        /* ... */
    }

    pub fn prompt(&self, prompt: &str, length: usize) -> String {
        /* ... */
    }
}

fn main() {
    /* ... */
}

... and a very naive implementation could fit on a larger handkerchief:

use rand::seq::SliceRandom;

#[derive(Default)]
struct Brain {
    words: Vec<String>,
}

impl Brain {
    pub fn train(&mut self, text: &str) {
        // Split the input text into words
        self.words = text.split(' ').map(|word| word.to_string()).collect();
    }

    pub fn prompt(&self, prompt: &str, length: usize) -> String {
        let mut out: Vec<_> = prompt
            .split(' ')
            .map(|word| word.to_string())
            .collect();

        let mut rng = rand::thread_rng();

        while out.len() < length {
            // Complete prompt by choosing random words from the input text
            // until the user is fed
            out.push(self.words.choose(&mut rng).unwrap().clone());
        }

        out.join(" ")
    }
}

Before we start ranting about what makes that approach bad, let's at least see it in action:

$ cargo new chatgpt-at-home
$ cd chatgpt-at-home
$ cargo add rand
# fill-in src/main.rs

Let's train it on, say, Orwell's Nineteen Eighty-Four:

$ wget http://gutenberg.net.au/ebooks01/0100021.txt -O 1984.txt
fn main() {
    let mut brain = Brain::default();

    brain.train(include_str!("../1984.txt"));

    println!("{}", brain.prompt("It was a", 64));
}

... aaaand:

$ cargo run
It was a the
bullet, sensuality, tiny--helpless!
How warm as Americans will earliest bluebells. the and
resentment vague five-cent another. It on
the they crashed of I it
was motive English darkness,' had and or time it he varicose designed stormed. the war tormented, it ran:


 as that idea Even
the the was An the no the 'It's he to with go white. the to pleasure Socialist he This on

Oof, owie (surprised pikatchu face), that's pretty bad - what's going on?

Context

Arguably, generating random words yields bad prose because usually words in a sentence are somewhat related to each other (grand discovery, send me a nobel prize, i know) -- and our implementation doesn't care about that.

A low hanging fruit would be to redesign the application so that instead of keeping words as a plain list, it keeps track of their context as well:

#[derive(Default)]
struct Brain {
    words: HashMap<String, HashMap<String, usize>>,
}

In this design, the outer map contains all of the words and the inner map contains words that follow given word at least once (with usize counting how many times this following occurred).

That is to say, given this sentence:

Hello, World! Hello, Joe! Hello, World!

... we'd expect the map to look like this:

{
  "Hello,": {
      "World!": 2,
      "Joe!": 1,
  },
  "World!": {
      "Hello,": 1,
  },
  "Joe!": {
      "Hello,": 1,
  },
}

As compared to a pure list, a map is pretty neat because it tells us the relationships between words.

For instance, we can see that directly after Hello, our dataset followed twice with World! and once with Joe!, but never with another Hello, - and we can use this knowledge to fine-tune the prompter to choose words based on their frequency:

use rand::seq::SliceRandom;
use std::collections::HashMap;

#[derive(Default)]
struct Brain {
    words: HashMap<String, HashMap<String, usize>>,
}

impl Brain {
    pub fn train(&mut self, text: &str) {
        let mut prev_word = None;

        for word in text.split(' ') {
            if let Some(prev_word) = prev_word.replace(word) {
                *self
                    .words
                    .entry(prev_word.to_string())
                    .or_default()
                    .entry(word.to_string())
                    .or_default() += 1;
            }
        }
    }

    pub fn prompt(&self, prompt: &str, length: usize) -> String {
        let mut out: Vec<_> = prompt
            .split(' ')
            .map(|str| str.to_string())
            .collect();

        let mut rng = rand::thread_rng();

        while out.len() < length {
            // Here comes the contextuality - let's look at the latest word
            // and try to find the most probable follower:
            let last_word = out.last().unwrap();

            if let Some(next_words) = self.words.get(last_word) {
                // rand's `.choose_weighted()` works on a `Vec`, so we have
                // to collect the follower-candidates first:
                let next_words: Vec<_> = next_words.iter().collect();

                // And finally, we can choose the next word based on its
                // frequency:
                let next_word = next_words
                    .choose_weighted(&mut rng, |(_word, frequency)| *frequency)
                    .unwrap();

                out.push(next_word.0.to_string());
            } else {
                // If this branch is taken, then it means we've stumbled
                // upon an unknown word - e.g. someone has given us a prompt
                // with a word that wasn't seen during the training phase.
                //
                // To recover from this situation, we could either generate
                // a completely random word (to push *something* to `out` so
                // that over the next iteration we have a different word to
                // lookup) or we can just simply bail out; let's do the
                // latter.
                break;
            }
        }

        out.join(" ")
    }
}

fn main() {
    let mut brain = Brain::default();

    brain.train("Hello, World! Hello, Joe! Hello, World!");

    println!("{}", brain.prompt("Hello,", 16));
}
$ cargo run
Hello, World! Hello, World! Hello, World! Hello, Joe! Hello, World! Hello, World! Hello, Joe! Hello, World!

Hey, that's some progress! Let's see what it thinks of Orwell back again:

brain.train(include_str!("../1984.txt"));

println!("{}", brain.prompt("It was a", 64));
$ cargo run
It was a different person.

'You see through the Party's sexual desire, they could. I collaborated in Newspeak
it was ambivalent in books, he had not know that
Newspeak is no words of thing into his mother said, speaking to grow not
less but when you remember the beginning there
had been issued yet. The face against all hours. He picked herself was awake he had racked my
brains. There

Now we're talking! -- the first sentence is correct English, feels like it could come from 1984, and it doesn't come verbatim from the source text; if that isn't huge success, I don't know what is!

(👉👈 so circling back to that nobel prize thingie 👉👈)

Would you guess it, the thing we've just invented actually exists and has its own name - it's called n-grams; in particular, what we've created is a 2-gram (bigram) model of that book.

In general, n-grams are just "sliding windows" of the input data:

fn main() {
    let input = "Onions have layers. Ogres have layers. Onions have layers.";
    let words: Vec<_> = input.split(' ').collect();
    let bigram: Vec<_> = words.windows(2).collect();
    let trigram: Vec<_> = words.windows(3).collect();

    println!("{:?}", bigram);
    println!("{:?}", trigram);
}
[["Onions", "have"], ["have", "layers."], ["layers.", "Ogres"], ...]
[["Onions", "have", "layers."], ["have", "layers.", "Ogres"], ["layers.", "Ogres", "have"], ...]

... and they come handy pretty often when working with text, e.g. when designing search engines; or, you know, just having fun.

Terminology aside - what can we do to make the text generation better?

At the moment our brain "sees" only within the context of one word (out.last()) - this means that in any given sentence:

Mary   had    a      little  lamb
word0  word1  word2  word3   word4

... word1 is only influenced by word0, word2 is only influenced by word1 etc., without any "long distance" relationships happening.

Among other things, this means that when we train our algorithm on a larger dataset, it will most likely start generating more garbled data instead of less, because the probabilities will get "blurred" between more words.

So, there are (at least) two (easy) things we can do to improve it.

All My Tokens Gone

n-grams (and lots of other text algorithms, for that matter) require for the input data to be split into smaller, "reasonable" chunks - for instance, it wouldn't make sense to do self.words["some very long sentence"], it needs to be split into words first.

The process of splitting input data into "smaller chunks" is called tokenization and the "smaller chunks" themselves are called tokens.

So far our tokenization has been pretty simple:

text.split(' ')

... and while simplicity is generally desirable, in this case it leaves us with a subpar approach.

For instance, it will leave Hello, as a single token (with comma glued to it), while if we splitted it into ["Hello", ","], the algorithm could "generalize" better over the input dataset, since it could "connect" Hellos inside Hello, World! and Hello! World!.

As you might imagine, there are ~million ways to tokenize a sentence (by words, by lowercased words, by letters, with punctuation, without punctuation, including spaces, excluding spaces etc.) - for the purposes of this article, I propose a tokenizer which, upon finding a letter or a digit, simply reads the entire word (or number), passing other characters through:

/* ... */
use std::{collections::HashMap, iter};

impl Brain {
    /* ... */

    fn tokenize(s: &str) -> impl Iterator<Item = &str> {
        let mut chars = s.char_indices().peekable();

        iter::from_fn(move || loop {
            let (idx, ch) = chars.next()?;

            if ch.is_alphanumeric() {
                // If the current character is a letter or a number, let's
                // read it until its end (i.e. let's read as long as we
                // still have a character or a number to read).

                let idx_from = idx;
                let mut idx_to = idx + ch.len_utf8(); // `+ len_utf8()` allows us
                                                      // to handle UTF-8 chars

                while let Some((idx, ch)) = chars.peek() {
                    if ch.is_alphanumeric() {
                        idx_to = idx + ch.len_utf8();
                        chars.next();
                    } else {
                        break;
                    }
                }

                return Some(&s[idx_from..idx_to]);
            } else {
                // If the current character is not a letter nor a number,
                // let's emit it verbatim.
                //
                // This handles punctuation, spaces etc.

                let idx_from = idx;
                let idx_to = idx + ch.len_utf8();

                return Some(&s[idx_from..idx_to]);
            }
        })
    }
}

/* ... */

#[cfg(test)]
mod tests {
    use super::*;

    // cargo add test-case --dev
    use test_case::test_case;

    #[test_case(
        "Hello, World!",
        &["Hello", ",", " ", "World", "!"]
    )]
    #[test_case(
        "#include <coffee.h>",
        &["#", "include", " ", "<", "coffee", ".", "h", ">"]
    )]
    #[test_case(
        "123 + 234 = 0xCAFEBABE",
        &["123", " ", "+", " ", "234", " ", "=", " ", "0xCAFEBABE"]
    )]
    fn tokenize(given: &str, expected: &[&str]) {
        let actual: Vec<_> = Brain::tokenize(given).collect();

        let expected: Vec<_> = expected
            .iter()
            .map(|token| token.to_string())
            .collect();

        assert_eq!(expected, actual);
    }
}

... which we can now call:

impl Brain {
    pub fn train(&mut self, text: &str) {
        /* ... */

        for word in Self::tokenize(text) {
            /* ... */
        }
    }

    pub fn prompt(&self, prompt: &str, length: usize) -> String {
        let mut out: Vec<_> = Self::tokenize(prompt)
            .map(|str| str.to_string())
            .collect();

        let mut rng = rand::thread_rng();

        /* ... */

        // Spaces are now just tokens, so there's no need to join by them
        // anymore:
        out.join("")
    }
}

... and:

$ cargo run
It was a stronger are the

existed Ministry with arm. endless had world Then was 1930, game booklets was at who day children, bulletin!"'s vast the a

these

Huh, it's... somewhat worse?

Don't get me wrong, it wasn't exactly a candidate for Pulitzer before either - but this so-called "better" tokenizer downright destroyed the output. This, of course, makes sense if you notice that space is now a separate token - it means that what before was:

{
  "Hello,": {
      "World!": 2,
      "Joe!": 1,
  },
  "World!": {
      "Hello,": 1,
  },
  "Joe!": {
      "Hello,": 1,
  },
}

... has now become:

{
    "World": {
        "!": 2,
    },
    "!": {
        " ": 2,
    },
    "Joe": {
        "!": 1,
    },
    "Hello": {
        ",": 3,
    },
    ",": {
        " ": 3,
    },
    " ": {
        "Hello": 2,
        "World": 2,
        "Joe": 1,
    },
}

... and so after emitting e.g. Hello, the only next possible token is comma, after which the only next token is space, after which we are back to randomly choosing between Hello, World and Joe, since with just single-token lookbehind we don't remember that Hello, Hello is not a "valid" sentence anymore!

Fortunately, in this case what was easily broken can be (relatively) easily fixed.

More context, More Context, MORE. CONTEXT.

If currently we're generating bigrams, then maybe generating longer-grams would help? Let's try:

#[derive(Default)]
struct Brain {
    tokens: HashMap<[String; 4], HashMap<String, usize>>,
}

impl Brain {
    pub fn train(&mut self, text: &str) {
        let mut context: Vec<&str> = Vec::new();

        for token in Self::tokenize(text) {
            if let &[c4, c3, c2, c1] = context.as_slice() {
                *self
                    .tokens
                    .entry([
                        c4.to_string(),
                        c3.to_string(),
                        c2.to_string(),
                        c1.to_string(),
                    ])
                    .or_default()
                    .entry(token.to_string())
                    .or_default() += 1;
            }

            context.push(token);

            if context.len() > 4 {
                context.remove(0);
            }
        }
    }

    pub fn prompt(&self, prompt: &str, length: usize) -> String {
        let mut out: Vec<_> = Self::tokenize(prompt).collect();
        let mut rng = rand::thread_rng();

        while out.len() < length {
            let context = [
                out[out.len() - 4].to_string(),
                out[out.len() - 3].to_string(),
                out[out.len() - 2].to_string(),
                out[out.len() - 1].to_string(),
            ];

            if let Some(next_tokens) = self.tokens.get(&context) {
                let next_tokens: Vec<_> = next_tokens.iter().collect();

                let next_token = next_tokens
                    .choose_weighted(&mut rng, |(_token, frequency)| *frequency)
                    .unwrap();

                out.push(&next_token.0);
            } else {
                break;
            }
        }

        out.join("")
    }
}

fn main() {
    /* ... */

    println!("{}", brain.prompt("It was a failure.", 64));
}

... and:

It was a failure. It had faded but before
O'Brien smiled faintly. 'You are thinking,' he said.
It was a failure. It was gin that revived him every morning.
It was a failure. It was somehow slightly frightening, like the beams of a dental plate fell out of him under the banner of equality

Ha, ha!

Unfortunately, not everything's so great - because now we require for the context to contain exactly four tokens, prompts with less:

fn main() {
    /* ... */

    println!("{}", brain.prompt("It was", 64));
}

... cause panic! at the brain:

thread 'main' panicked at 'slice index starts at 18446744073709551614 but ends at 3'

Can even more statistics save us?

N-grams Everywhere All at Once

Using bigrams allows us to start prompts with just one word, but it generates a pretty poor, basically contextless text.

Using bigger-grams allows to generate better text (usually (up to some point)), but it requires bigger-prompts.

Is there a middle ground? Sorta!

This might sound surprising, but it's actually totally legal to simply build multiple n-grams and sample from many of them at will - scientists hate this one simple trick:

#[derive(Default)]
struct Brain {
    // Instead of hardcoding it to `[String; n]`, let's use a vector so that
    // we can have `vec![one]`, `vec![one, two]` etc.:
    tokens: HashMap<Vec<String>, HashMap<String, usize>>,
}

impl Brain {
    /// Maximum number of words to lookbehind; 1 creates bigrams, 2 creates
    /// trigrams etc.
    const MAX_CONTEXT_SIZE: usize = 5;

    pub fn train(&mut self, text: &str) {
        let mut context: Vec<&str> = Vec::new();

        for token in Self::tokenize(text) {
            // Using the context, generate bigrams, trigrams etc.
            for cs in 1..=context.len() {
                let context = context[(context.len() - cs)..context.len()]
                    .iter()
                    .map(|token| token.to_string())
                    .collect();

                *self
                    .tokens
                    .entry(context)
                    .or_default()
                    .entry(token.to_string())
                    .or_default() += 1;
            }

            context.push(token);

            // Since we're only interested in at most `MAX_CONTEXT_SIZE` last
            // tokens, no point in keeping on to the already-processed ones
            if context.len() > Self::MAX_CONTEXT_SIZE {
                context.remove(0);
            }
        }
    }

    pub fn prompt(&self, prompt: &str, length: usize) -> String {
        let mut out: Vec<_> = Self::tokenize(prompt).collect();
        let mut rng = rand::thread_rng();

        while out.len() < length {
            let mut next_token = None;

            // Lookup starting from the biggest n-grams and, if no match was
            // found, proceed down.
            //
            // That is, we first lookup 6-grams - if no match was found, we
            // lookup 5-grams etc.; this allows us to find a better match if
            // the context is known, but also generate something reasonable if
            // only a single word is around.
            //
            // This is a simplified version of Kneser–Ney smoothing.
            for cs in (1..=Self::MAX_CONTEXT_SIZE).rev() {
                if cs > out.len() {
                    continue;
                }

                let context: Vec<_> = out[(out.len() - cs)..out.len()]
                    .iter()
                    .map(|token| token.to_string())
                    .collect();

                if let Some(next_tokens) = self.tokens.get(&context) {
                    let next_tokens: Vec<_> = next_tokens.iter().collect();

                    next_token = Some(
                        next_tokens
                            .choose_weighted(&mut rng, |(_token, frequency)| {
                                *frequency
                            })
                            .unwrap()
                            .0,
                    );

                    break;
                }
            }

            if let Some(next_token) = next_token {
                out.push(next_token);
            } else {
                break;
            }
        }

        out.join("")
    }

    /* ... */
}

fn main() {
    /* ... */

    println!("{}", brain.prompt("It was", 256));
}

... which gives us:

$ cargo run
It was as inscrutable as everybody else's.

'Three thousand,' he went on and on until the original message and any notes
that he expected Winston to speak. She
walked obliquely away across the table looking at him. The
guard was laughing at his wrist-watch again.

Ampleforth marched clumsily out between the three dead men had become quasi-instinctive throughout almost
the whole world, and
everyone else was that bird singing? No mate,
no rival was watching at the table, dipped his pen, and wrote:


GOD IS POWER

Heyo! (ka ching (nobel prize))

Summary

So, what we created here is an application that can be trained on a dataset and later used to generate some relatively-fresh portions of texts sampled from that initial dataset.

Is it ChatGPT?

I mean, no; I mean, kinda.

The technology is obviously much different - n-grams and Markov chains are not that great at following context, so there's no way to talk with them the same way you talk with ChatGPT.

Though, all said, I think that for the trade-off they make, they are a pretty nice piece of math!

I mean, all of this here can be coded entirely from scratch in like a half an hour, then trained on all of Shakespeare's works in literally fifteen seconds, and then used to generate something relatively readable and fun. So yeah, I'm a fan of Markov chains!

Also, we are not that far away from ChatGPT either - some people would certainly fight politely argue on this stance, but both ChatGPT and Markov chains (as presented here) are just statistical text generation methods; ChatGPT yields better results, of course, but underneath it there are just matrices of numbers based on the probability from its dataset, no magic.

I mean, a bit of ✨ magic ✨ - which brings me to my last point, that article I mentioned:

The Unreasonable Effectiveness of Recurrent Neural Networks

Andrej Karpathy presents there an approach based on recurrent neural networks, which - in principle - can keep track of arbitrary-lengthed context; they are also able to learn that e.g. ( should be eventually closed or that { should be (probably) preceeded with a few spaces, to match the surroundings. And they all learn that "on their own".

To me, the most captivating part of that article is somewhere in the middle, when Andrej inspects the state of a working network and sees that some neuron-activity correlates with the position of a word in the sentence, some other neuron responds to brackets etc., all working together to pretty-print a pretty-text.

And they all learn it from scratch, on their own; that is to say, you don't program neuron #123 to respond to brackets and neuron #2137 to respond to punctuation - they all "figure it out" during the training process.

And that, I must admit, is magical.