Module 28 · Inference · 8 min

Offline inference
engines.

Ollama, llama.cpp, vLLM, SGLang.

Reading time8 min Audio- Prerequisites08 SourceTrack A · Gemini
§ 1

What this lesson covers.

This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. There is no audio narration for this module - it ships as text + interactive lesson only.

If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.

§ 2

The lesson itself.

Interactive lesson · ported from Gemini track Click tabs to navigate · hover cards for details
Ecosystem Course File

Offline Engines & GGUF

How Ollama and LM Studio intercept models, compress them mathematically, and bypass API paywalls natively.

The Inference execution pipeline
1. HuggingFace Raw Float16 Weights

The original Llama-3-8B model is completely uncompressed. It mathematically weighs around ~16 Gigabytes, requiring a massive dedicated GPU.

2. Quantization .GGUF Int4 Formatting

Programs like Ollama automatically pull heavily compressed GGUF files. High-precision curves are heavily truncated down to 4-bit integers, shrinking the model violently from 16GB down to roughly 4.7GB.

3. Llama.cpp Engine Local RAM Bridging

LM Studio and Ollama utilize the core C++ `llama.cpp` engine to physically execute those 4.7GB calculations natively on your Macbook's Unified Memory or your PC's RTX GPU.

Ollama

Built natively for the Terminal. Ollama runs silently in the background of your operating system, acting as an invisible API router. It allows local python scripts or frameworks like LangChain to seamlessly ping `localhost:11434` exactly as if they were heavily paying for the OpenAI API.

LM Studio

Built natively for the Desktop GUI. LM Studio gives you an incredibly sleek visual interface. It seamlessly integrates a search bar to instantly query and download GGUF frameworks from Hugging Face into your C: drive, and visually displays exactly how heavily your VRAM is loaded during generation.

§ NEXT

What to read next.

Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.