llamafile: Streamlining Access to Large Language Models with Single-File Executables for Local Deployment

llamafile lets you distribute and run LLMs with a single file. (announcement blog post)

Our goal is to make open source large language models much more accessible to both developers and end users. We’re doing that by combining llama.cpp with Cosmopolitan Libc into one framework that collapses all the complexity of LLMs down to a single-file executable (called a “llamafile”) that runs locally on most computers, with no installation.

Quickstart

The easiest way to try it for yourself is to download our example llamafile for the LLaVA model (license: LLaMA 2, OpenAI). LLaVA is a new LLM that can do more than just chat; you can also upload images and ask it questions about them. With llamafile, this all happens locally; no data ever leaves your computer.

Download llava-v1.5-7b-q4-server.llamafile (3.97 GB).
Open your computer’s terminal.
If you’re using macOS, Linux, or BSD, you’ll need to grant permission for your computer to execute this new file. (You only need to do this once.)

chmod +x llava-v1.5-7b-q4-server.llamafile

If you’re on Windows, rename the file by adding “.exe” on the end.
Run the llamafile. e.g.:

./llava-v1.5-7b-q4-server.llamafile

Your browser should open automatically and display a chat interface. (If it doesn’t, just open your browser and point it at https://localhost:8080.)
When you’re done chatting, return to your terminal and hit Control-C to shut down llamafile.

Having trouble? See the “Gotchas” section below.

Other example llamafiles

We also provide example llamafiles for two other models, so you can easily try out llamafile with different kinds of LLMs.

Model	License	Command-line llamafile	Server llamafile
Mistral-7B-Instruct	Apache 2.0	mistral-7b-instruct-v0.1-Q4_K_M-main.llamafile (4.07 GB)	mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile (4.07 GB)
LLaVA 1.5	LLaMA 2	(Not provided because this model’s features are best utilized via the web UI)	llava-v1.5-7b-q4-server.llamafile (3.97 GB)
WizardCoder-Python-13B	LLaMA 2	wizardcoder-python-13b-main.llamafile (7.33 GB)	wizardcoder-python-13b-server.llamafile (7.33GB)

“Server llamafiles” work just like the LLaVA example above: you simply run them from your terminal and then access the chat UI in your web browser at https://localhost:8080.

“Command-line llamafiles” run entirely inside your terminal and operate just like llama.cpp’s “main” function. This means you have to provide some command-line parameters, just like with llama.cpp.

Here is an example for the Mistral command-line llamafile:

./mistral-7b-instruct-v0.1-Q4_K_M-main.llamafile --temp 0.7 -r '\n' -p '### Instruction: Write a story about llamas\n### Response:\n'

And here is an example for WizardCoder-Python command-line llamafile:

./wizardcoder-python-13b-main.llamafile --temp 0 -r '\n' -p '\nvoid *memcpy_sse2(char *dst, const char *src, size_t size) {\n'

As before, macOS, Linux, and BSD users will need to use the “chmod” command to grant execution permissions to the file before running these llamafiles for the first time.

Unfortunately, Windows users cannot make use of these example llamafiles because Windows has a maximum executable file size of 4GB, and all of these examples exceed that size. (The LLaVA llamafile works on Windows because it is 30MB shy of the size limit.) But don’t lose heart: llamafile allows you to use external weights; this is described later in this document.

Having trouble? See the “Gotchas” section below.

How llamafile works

A llamafile is an executable LLM that you can run on your own computer. It contains the weights for a given open source LLM, as well as everything needed to actually run that model on your computer. There’s nothing to install or configure (with a few caveats, discussed in subsequent sections of this document).

This is all accomplished by combining llama.cpp with Cosmopolitan Libc, which provides some useful capabilities:

llamafiles can run on multiple CPU microarchitectures. We added runtime dispatching to llama.cpp that lets new Intel systems use modern CPU features without trading away support for older computers.
llamafiles can run on multiple CPU architectures. We do that by concatenating AMD64 and ARM64 builds with a shell script that launches the appropriate one. Our file format is compatible with WIN32 and most UNIX shells. It’s also able to be easily converted (by either you or your users) to the platform-native format, whenever required.
llamafiles can run on six OSes (macOS, Windows, Linux, FreeBSD, OpenBSD, and NetBSD). If you make your own llama files, you’ll only need to build your code once, using a Linux-style toolchain. The GCC-based compiler we provide is itself an Actually Portable Executable, so you can build your software for all six OSes from the comfort of whichever one you prefer most for development.
The weights for an LLM can be embedded within the llamafile. We added support for PKZIP to the GGML library. This lets uncompressed weights be mapped directly into memory, similar to a self-extracting archive. It enables quantized weights distributed online to be prefixed with a compatible version of the llama.cpp software, thereby ensuring its originally observed behaviors can be reproduced indefinitely.
Finally, with the tools included in this project you can create your own llamafiles, using any compatible model weights you want. You can then distribute these llamafiles to other people, who can easily make use of them regardless of what kind of computer they have.

Using llamafile with external weights

Even though our example llamafiles have the weights built-in, you don’t have to use llamafile that way. Instead, you can download just the llamafile software (without any weights included) from our releases page. You can then use it alongside any external weights you may have on hand. External weights are particularly useful for Windows users because they enable you to work around Windows’ 4GB executable file size limit.

For Windows users, here’s an example for the Mistral LLM:

curl -o llamafile.exe https://github.com/Mozilla-Ocho/llamafile/releases/download/0.2.1/llamafile-server-0.2.1
curl -o mistral.gguf https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf
.\llamafile.exe -m mistral.gguf

Here’s the same example, but for macOS, Linux, and BSD users:

curl -L https://github.com/Mozilla-Ocho/llamafile/releases/download/0.2.1/llamafile-server-0.2.1 >llamafile
curl -L https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf >mistral.gguf
chmod +x llamafile
./llamafile -m mistral.gguf

Gotchas

On macOS with Apple Silicon you need to have Xcode installed for llamafile to be able to bootstrap itself.

If you use zsh and have trouble running llamafile, try saying sh -c ./llamafile. This is due to a bug that was fixed in zsh 5.9+. The same is the case for Python subprocess, old versions of Fish, etc.

On some Linux systems, you might get errors relating to run-detectors or WINE. This is due to binfmt_misc registrations. You can fix that by adding an additional registration for the APE file format llamafile uses:

sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf
sudo chmod +x /usr/bin/ape
sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"

As mentioned above, on Windows you may need to rename your llamafile by adding .exe to the filename.

Also as mentioned above, Windows also has a maximum file size limit of 4GB for executables. The LLaVA server executable above is just 30MB shy of that limit, so it’ll work on Windows, but with larger models like WizardCoder 13B, you need to store the weights in a separate file. An example is provided above; see “Using llamafile with external weights.”

On WSL, it’s recommended that the WIN32 interop feature be disabled:

sudo sh -c "echo -1 > /proc/sys/fs/binfmt_misc/WSLInterop"

On any platform, if your llamafile process is immediately killed, check if you have CrowdStrike and then ask to be whitelisted.

For More Click Here..

Varshini

Varshini is a Cyber Security expert in Threat Analysis, Vulnerability Assessment, and Research. Passionate about staying ahead of emerging Threats and Technologies.