Porting Llama2.ha: Machine Learning Notes

June 23, 2024

Earlier this year I finished llama2.ha, which is a port of Andrej Karpathy’s llama2.c project to the Hare programming language.

In casual terms both of these programs can “run” a large language model (LLM) to generate text that continues a user’s input prompt:

More formally, these programs are both small inference engines capable of doing text generation when provided with a set of LLM weights, and as the names suggest they are built for models based on Meta’s Llama 2 family.

This blog post is a part 1 of 2 reflecting on the machine learning nuts and bolts I learned about and found interesting. There will also be a part 2 of 2 outlining my experiences with Hare as this was my first time using the programming language.

As an aside, what follows is a fairly informal collection of notes about LLMs, and consequently jargon is unavoidable. If what you are looking for is more structured educational material on LLMs feel free to skip to the resources appendix at the end of this page!

Motivation

Before starting this project I had already tinkered with self-hosting large language models on my own hardware, but I had not taken further steps to peek behind the curtain at the inner workings of the models themselves. Coming across Karpathy’s llama2.c on GitHub I found the inference code in run.c easy enough to read and follow. Other people had already made ports to languages like Rust, Go, and Zig. Seeing this gave me the idea to try porting the project to Hare, a new C-like language that I had been wanting to try out.

As I set out to write this port it seemed like a great 2-for-1 deal to learn about both LLMs and a new programming language assuming that I could overcome the double learning curve. Now that llama2.ha has reached what I consider a good enough “done” state (and seeing how I’m writing this blog post) I can confidently report that I survived this self-imposed trial and harvested a bit of knowledge along the way.

LLM Autoregression Pipeline

My most important takeaway from this project is the high level “pipeline” structure that large language models follow. I’m using the term pipeline very liberally here because the flow of data is not exactly linear through all three stages (more on that later). Anyways, let’s jump in.

Stage 1 is tokenization. This is a scheme for representing user input (in our case text) as a series of integers appropriate for feeding into the mess of linear algebra downstream. The scheme could be something as simple as ASCII but is typically more complex (Llama 2 family models use what is called Byte Pair Encoding).

Additionally, different models might employ different vocabularies and vocabulary sizes. A vocabulary is simply the complete set of valid integer values and their associated text representations. As an example an ASCII-based vocabulary would have a vocabulary size of 256.

Stage 2 is the transformer itself. This is where the comically large amounts of linear algebra happen. The transformer takes a single token as input and outputs a probability distribution for what the next token might be. There is internal state within the transformer, so passing in the same token at different times will yield different outputs.

As an aside, transformers contain numerous multilayer perceptron layers (a.k.a. feed forward networks). The perceptron itself is a design that dates back to the 1940s. It’s neat to see past ideas return and become timeless, and it seems to happen quite a bit in machine learning.

Stage 3 is sampling. This is simply the process of picking the next token from the probability distribution output by the transformer. I’ll keep this explanation brief for now, but this stage is open-ended and powerful.

Next is a look at how these stages fit together.

Feeding in the User’s Prompt

Feeding the prompt into the LLM is about setting the initial state of the transformer so that it has context. A neat realization that dawned on me during implementation is that stage 3 (sampling) isn’t necessary because the next token is already known!

Below is some extremely high level pseudocode to further illustrate this part:

user_prompt = "It was a dark and stormy night, and "

tokens = tokenize(user_prompt)  # stage 1
for t in tokens:
  transformer(t)                # stage 2

Generating Text

Once the prompt has been fed in generation of new text can begin, and it’s here that the term autoregression comes into play. It’s a word borrowed from statistics that in this context simply means that LLM outputs recursively get fed back into the input to produce a stream of output. As an aside it is for this reason that LLMs often get referred to as autocorrect on steroids.

Because the output from the sampler is already a token it does not need tokenization and can be fed back into the input of the transformer directly. Only the transformer (stage 2) and sampling (stage 3) are necessary here, but optionally de-tokenization can be performed in order to display the output as text because it would be boring and pointless otherwise.

Here is some more pseudocode:

t = t_initial  # typically the last token of the prompt
i = 0
while i < MAX_TOKENS:
  token_probs = transformer(t)  # stage 2
  t = sampler(token_probs)      # stage 3
  print_token(t)                # inverse stage 1

Vocabulary: I just think it’s neat.

[As a warning, this section is mostly brain dump simply because I find it interesting!]

As mentioned in the previous section, the Llama 2 family uses a tokenization scheme called Byte Pair Encoding (BPE). The vocabulary size is 32,000 and contains tokens for representing many different natural languages.

Under BPE a token might represent a run of multiple characters or just a single character, and naturally some token representations are substrings of other token representations. In other words there exist multiple possible tokenizations for a given string.

The BPE tokenization process is not random however because each token in the vocabulary has an associated likelihood score. This score allows for algorithmically favoring certain token representations over others.

Additionally the vocabulary includes a set of 256 tokens each representing a different raw byte. These tokens are seemingly present for use in fallback scenarios when parsing UTF strings.

How Does One Select a Vocabulary?

This is still a big open question that I have. As mentioned in the last section, for this project the Llama 2 family vocabulary was provided as-is. But how would engineers working on a new foundation model settle on a specific vocabulary? And furthermore how big should the vocabulary be? I am sure that character and word statistics taken over the training data informs the decision, but how much of it is science versus art?

Attention Heads Accumulate State

As touched on earlier, the transformer model is stateful. This makes intuitive sense considering that passing in an input token for a letter like “F” at different times should yield different probability outputs. The original transformer paper presented a definition for its version of the attention mechanism, but the section was brief and consisted mostly of a simple but dense equation. It did not dawn on me at the time that this attention mechanism accumulates state very significantly as more tokens pass through the model.

Working through this inference implementation in code made it clearer that the keys and values of the attention mechanism get accumulated with each pass through the transformer. In particular the source implementation from llama2.c uses a key_cache and value_cache that are responsible for persisting state from one transformer pass to the next.

If I’m not mistaken, this KV cache is the only global state that exists in the entire transformer. The rest of the operations consist mostly of multilayer perceptron layers and normalization both of which which are pure functional operations.

For a more visual representation of this concept I recommend the 3Blue1Brown video on GPT attention. Additionally this blog post has a nice explanation of KV cache as an implementation optimization that came about after the original transformers paper was published.

Most of the LLM is Deterministic

While porting this code from C to Hare one small task I gave myself was to figure out how much randomness is actually in these language model systems. As it turns out I spent a long time looking because I did not find much!

The only randomness I found was in the sampler (stage 3 from the pipeline section). People familiar with using LLM APIs will also be familiar with the idea that temperature essentially controls how “spicy” the responses of the LLM are. A temperature value of zero creates deterministic LLM output because it causes next token selection to be greedy and always pick the most probable token from the distribution. In contrast, a temperature value greater than zero starts introducing randomness into the next token selection process which results in the livelier, less predictable output.

In fact setting temperature to zero is how I verified that my tokenization and transformer implementations were correct. After a round of bug fixing I could get llama2.ha to generate identical output compared to llama2.c for a given model checkpoint and prompt.

Sampling is Powerful

This is another section about sampling because I want to emphasize how influential it is to a large language model’s final results. Generating text with temperature set to zero is good for testing, but more advanced sampling techniques will and do get used in real world deployments.

Karpathy’s llama2.c implementation contains three basic sampling methods:

take the most probable token
sample a token from the probability distribution
sample only from most-likely tokens that fit within a set probability threshold (aka. top-p sampling)

But what if some assumptions can be made about the structure of the output text? Perhaps the output is needed in a format like JSON or YML. In this case many choices for the next token can be ruled out based on grammatical rules, and in fact this technique exists today! The excellent open source inference project llama.cpp supports enforcing a user-supplied grammar on text generation, and I assume this is how OpenAI’s JSON mode is implemented too.

“So would you do it again?”

Looking at my git commit history this project took about 3 months of my spare time to complete. I’m a bit surprised because it felt like longer during the last month or so. I don’t have any hesitation saying that this time was well spent considering all of the things I brushed up on from machine learning, to a new programming language, to returning to manual memory management.

If doing this project over again I don’t think I would do anything substantially differently either. All the advice to my past self would be glib stuff like “you should not have stopped writing C all those years ago”.

In the future I would like to return to my roots a bit and do similar implementation projects for diffusion models and vision language models. It would be fun to build some intuition for how those types of models function.

Appendix: Resources

3Blue1Brown Neural Networks Series

The GPT episode (chapter 5) of this series did not exist yet when I was writing llama2.ha, but I wish it had! As usual 3Blue1Brown includes some great animated visualizations to explain concepts. The ones about text embeddings are particularly illuminating. Overall this is a great series to ease into the technical side of LLMs or deep learning in general.

Andrej Karpathy’s Neural Networks: Zero to Hero Series

If you enjoy watching streams of people writing code this is the series for you. Karpathy walks you through code for many fundamental components and techniques of deep learning that builds up to training your own small models. He even recently added a new video on training a GPT-2 type model from scratch. I have not watched this latest video yet but definitely need to!

Attention Is All You Need

When people mention “the transformer paper” this is what they are referring to. I don’t know how helpful it is for beginners, but people getting into machine learning should know of it since it is a landmark paper.

Llama.cpp

This is what a big bad inference engine looks like when it’s all grown up. Both llama2.c and llama2.ha are tinier and simpler educational versions of what llama.cpp is. It’s written in C++ and now supports many different model families. Other projects like Ollama are built on top of it, and playing around with llama.cpp is a good gateway to familiarizing oneself with what it takes to self-host LLMs.