September 2024

TIL: Building llamafiles from Llama 3.2 GGUFs

“Today I Learned” are instructions-focused posts about things I’ve learned.

Llamafiles are these cool little files that have llama.cpp and model weights embedded, and can run on Mac/Linux/Windows. Cool, let’s make some!

In a previous post on quantizing Llama LLMs, I described how to download some Llama models and quantize them to a format that can run on your machine with llama.cpp. That was pretty easy, but it can get even easier!

I took the release of the tiny 1B and 3B Llama 3.2 models as an excuse to try creating my own llamafiles. Llamafiles are basically runnable zip files with llama.cpp, the model in GGUF format, and some cross-platform trickery code embedded, so they can run on the big three computing platforms with just a single file.

And they’re quite easy to build. Basically, you need to:

Download the llamafile zip from Github,
create an .args file, and
use a special zipalign tool to create your llamafile.

Llamafile scripts §

Because I like to make computers do what I want, I’ve scripted this. You can find the result in the maragudk/llamafile repo. Have a look at the Makefile. I even uploaded some GGUF models for you to play with, see the model list at the top of the Makefile.

The most important part is this:

.PHONY: build
build: llamafile/bin/llamafile clean
    mkdir -p build
    cp llamafile/bin/llamafile build/$(model).llamafile
    echo "-m\n$(model).gguf\n-c\n0\n..." >build/.args
    ./llamafile/bin/zipalign -j0 build/$(model).llamafile models/$(model).gguf build/.args LICENSE-Llama-3.1 LICENSE-Llama-3.2
    chmod a+x build/$(model).llamafile

Note that you have to embed the llamafile binary from the downloaded zip. I missed that step first and it didn’t work. I thought zipalign would do it for me.

The .args are passed directly to llama.cpp. Here, it’s telling it where the model is, loading the context size from the model itself, and making sure we can pass additional parameters to the llamafile if we need to.

Dockerization §

For fun, and because these models are actually small enough to run in CI, I’ve uploaded some of them to the Docker Hub.

The Dockerfile currently looks like this:

FROM debian:stable-slim AS runner

WORKDIR /bin

COPY LICENSE* ./
COPY build/*.llamafile ./model

EXPOSE 8080

ENTRYPOINT ["/bin/sh", "./model"]

CMD ["--host", "0.0.0.0"]

Then it’s a one-liner to build the image:

.PHONY: build-docker
build-docker: build
    docker build --platform linux/amd64,linux/arm64 -t maragudk/`echo $(model) | tr A-Z a-z`:latest .

You could also add them easily to your docker compose file:

services:
  llama32-1b:
    image: maragudk/llama-3.2-1b-instruct-q5_k_m
    ports:
      - "8090:8080"
  llama32-3b:
    image: maragudk/llama-3.2-3b-instruct-q5_k_m
    ports:
      - "8091:8080"

Or your Github workflow:

jobs:
  test:
    name: Test
    runs-on: ubuntu-latest

    services:
      llama32-1b:
        image: "maragudk/llama-3.2-1b-instruct-q4_k_m"
        ports:
          - "8090:8080"

    steps:
      - name: Checkout
        uses: actions/checkout@v4

And voila, you can auto-generate poems in CI! 😁

I’m Markus, an independent software consultant. 🤓✨

See my services or reach out at markus@maragu.dk.

Subscribe to this blog by RSS or newsletter: