What this lesson covers.
This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. There is no audio narration for this module - it ships as text + interactive lesson only.
If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.
The lesson itself.
Offline Engines & GGUF
How Ollama and LM Studio intercept models, compress them mathematically, and bypass API paywalls natively.
The original Llama-3-8B model is completely uncompressed. It mathematically weighs around ~16 Gigabytes, requiring a massive dedicated GPU.
Programs like Ollama automatically pull heavily compressed GGUF files. High-precision curves are heavily truncated down to 4-bit integers, shrinking the model violently from 16GB down to roughly 4.7GB.
LM Studio and Ollama utilize the core C++ `llama.cpp` engine to physically execute those 4.7GB calculations natively on your Macbook's Unified Memory or your PC's RTX GPU.
Built natively for the Terminal. Ollama runs silently in the background of your operating system, acting as an invisible API router. It allows local python scripts or frameworks like LangChain to seamlessly ping `localhost:11434` exactly as if they were heavily paying for the OpenAI API.
Built natively for the Desktop GUI. LM Studio gives you an incredibly sleek visual interface. It seamlessly integrates a search bar to instantly query and download GGUF frameworks from Hugging Face into your C: drive, and visually displays exactly how heavily your VRAM is loaded during generation.
What to read next.
Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.