AI Learning Course — Macalinao Studio

01

Words become math.

Tokens, embeddings, and the loss function — the raw material every model is made of.

5 lessons
1A, 1B

21 · TOKENS✓

Tokenization, in detail.

BPE, byte-level BPE, sentencepiece. The cost trade-offs that show up at scale.

10 min▶ audio⌘ play

→

05 · EMBEDDINGS✓

What an embedding actually is.

Token bytes, positional encoding, the semantic word map, and how scale changes what the vectors mean.

15-20 min▶ audio⌘ play

→

06 · CBOW + SKIP-GRAM✓

Pre-transformer
embeddings.

The methods worth knowing - the intuition still applies, and they show why context matters.

8-10 min▶ audio⌘ play

→

10 · SOFTMAX + CE✓

Softmax meets cross-entropy.

The loss function for language modeling, slowly. Where the temperature knob actually lives.

10 min▶ audio⌘ play

→

11 · NEXT-TOKEN✓

Why next-token works.

Why predicting the next token produces emergent reasoning, in-context learning, arithmetic.

10 min▶ audio

→

02

Inside the transformer.

Attention from first principles, Q/K/V, and the block that stacks into everything.

6 lessons
1C

07 · ARCHITECTURES✓

Encoder, decoder,
encoder-decoder.

Why GPT looks one way, BERT another, T5 a third.

10 min▶ audio⌘ play

→

12 · ATTENTION✓

Attention from first principles.

What Q, K, V actually are. Why scaled dot-product, why softmax, why heads.

20 min▶ audio⌘ play

→

22 · Q / K / V✓

Q, K, V - the projections.

The matrices that turn embeddings into queries, keys, values.

8 min▶ audio⌘ play

→

12a · ADDITIVE ATTN✓

Additive attention,
briefly.

The Bahdanau-style variant that scaled dot-product replaced.

5 min▶ audio

→

13 · TRANSFORMER BLOCK✓

One transformer block,
end to end.

Attention, residuals, normalization, FFN.

12-15 min▶ audio⌘ play

→

19 · ENCODER-DECODER✓

Encoder-decoder,
revisited.

A return to the encoder-decoder pattern.

8 min▶ audio

→

03

How models learn.

Pretraining at scale, RLHF and the alignment toolbox, LoRA fine-tuning.

7 lessons
1D

02 · PRETRAINING✓

The pretraining
pipeline.

Terabytes of raw text become tokens become a loss curve.

12-15 min▶ audio

→

15 · TRAINING TRICKS✓

Mixed precision and
gradient tricks.

The optimizations that turn theoretical training into practical training.

10 min▶ audio

→

03 · RLHF + REASONING✓

RLHF and the reasoning
turn.

The post-training pass that turns a parrot into an assistant.

10-12 min▶ audio

→

09 · TRL + GRPO✓

TRL and the GRPO
algorithm.

What changes when reward signal comes from groups of completions.

10 min▶ audio

→

16 · ALIGNMENT✓

DPO, KTO, ORPO - the
post-PPO landscape.

Why preference learning ate the alignment world.

10 min▶ audio⌘ play

→

30 · REASONING + RLVR✓

Models that
think first.

Test-time compute, verifiable rewards, and how o1/R1-class reasoning models are trained.

10 min

→

18 · PEFT + LORA✓

Parameter-efficient
fine-tuning.

Why LoRA is cheap, what QLoRA adds.

10 min▶ audio⌘ play

→

04

Models in production.

The decoding loop, KV caches, local engines, RAG, and the chat interfaces on top.

7 lessons
1E, 2

08 · INFERENCE✓

The decoding loop, up close.

KV caches, batching, speculative decoding, paged attention.

12-15 min▶ audio⌘ play

→

17 · THE HIGHWAY✓

The full pipeline as
one highway.

Where every previous module fits.

8 min▶ audio

→

28 · LOCAL INFERENCE✓

Offline inference
engines.

Ollama, llama.cpp, vLLM, SGLang.

8 min

→

25 · HF + GIT LFS✓

Hugging Face and
Git LFS.

Cloning a 40 GB model. Model cards, safetensors.

5 min

→

20 · RAG + PROMPTING✓

Retrieval-augmented
generation.

Semantic search, hybrid retrieval, GraphRAG.

12 min▶ audio⌘ play

→

27 · CHAT UI✓

Consumer
chat interfaces.

Chat UI patterns. Streaming. Tool-call surfacing.

5 min⌘ play

→

29 · AGENTS + MCP✓

The agent
loop.

Tool use, MCP, and the loop that turned chatbots into coworkers.

12 min

→

T1

The image track.

ViT, CLIP, diffusion, ControlNet, FLUX — pixels as tokens.

9 lessons
3

04 · VISION TRANSFORMERS✓

Vision Transformers
are tokens too.

ViT - 196 patches per photo, 2D positional embeddings, CNN-vs-transformer comparison.

10-12 min▶ audio⌘ play

→

23 · CROSS-ATTENTION✓

Text and vision,
connected.

How text and vision streams talk - CLIP, BLIP, any-to-any native models.

10 min▶ audio⌘ play

→

14 · VAE + DIFFUSION✓

VAEs and the
diffusion process.

VAE compression, Stable Diffusion denoising, CLIP text conditioning.

10 min▶ audio⌘ play

→

24 · LATENT + CONTROLNET✓

Latent space
and control.

Walking the latent space. ControlNet conditioning.

8 min▶ audio⌘ play

→

f-diff-math · DIFFUSION MATH✓

Diffusion math,
slowly.

Forward + reverse process, score matching, why noise schedules matter.

8 min

→

f-modern · MODERN IMAGE MODELS✓

FLUX, ERNIE,
HiDream.

Modern image models - U-Net to DiT, diffusion to flow matching, CLIP to LLM text encoders.

8 min

→

f-ipadapter · IP-ADAPTER✓

IP-Adapter and
personalization.

Subject-driven generation, identity preservation across new scenes.

8 min

→

f-comfyui · COMFYUI AS CODE✓

ComfyUI workflow
as code.

Treating .json workflows as version-controlled artifacts.

8 min⌘ play

→

f-stylelora · STYLE-LORA✓

Style-LoRA
training.

Locking a visual identity across hundreds of generations.

8 min

→

T2

The video track.

Temporal diffusion, LTX, persona persistence, audio sync.

5 lessons
4

26v · AI FILM STUDIOS✓

AI Film Studios
and video pioneers.

Sora, Runway Gen-3, Luma, Kling, Higgsfield.

5 min

→

f-video-diff · VIDEO DIFFUSION✓

Video diffusion
intuition.

Temporal coherence, frame consistency, why static-image tricks don't transfer.

10 min

→

f-ltx · LTX ARCHITECTURE✓

LTX architecture
and shot composition.

What's inside a modern video diffusion model - shot composition, motion control.

8 min

→

f-persona · PERSONA PERSISTENCE✓

Persona persistence
across frames.

Keeping a character's face, costume, and lighting consistent across hundreds of frames.

8 min

→

f-audio-sync · AUDIO SYNC✓

Audio sync
in video.

Lip-sync, beat-cut, the patterns underneath the AI Music Idols pipeline.

8 min

→

T3

The voice track.

Whisper STT, codec-LM TTS, voice cloning, live attendants.

5 lessons
5

26s · SPEECH SYNTHESIS✓

Speech synthesis,
cloning a voice.

ElevenLabs, OpenAI Whisper.

5 min

→

f-stt · WHISPER STT✓

STT pipelines
(Whisper).

Whisper architecture, alignment timestamps, language detection.

8 min

→

f-tts · TTS PIPELINES✓

TTS pipelines,
tier by tier.

edge-tts, ElevenLabs, voice cloning - what each tier costs in latency and quality.

8 min

→

f-voice-clone · VOICE CLONING✓

Voice cloning,
ethically.

Single-shot vs few-shot cloning, where artifacts come from, ethical guardrails.

8 min

→

f-attendant · VOICE ATTENDANT✓

Turn-taking and
attendants.

Sub-300ms first-syllable latency, the patterns underneath the AI Attendant project.

10 min⌘ play

→

Words become math.

Tokenization, in detail.

What an embedding actually is.

Pre-transformerembeddings.

Softmax meets cross-entropy.

Why next-token works.

Inside the transformer.

Encoder, decoder,encoder-decoder.

Attention from first principles.

Q, K, V - the projections.

Additive attention,briefly.

One transformer block,end to end.

Encoder-decoder,revisited.

How models learn.

The pretrainingpipeline.

Mixed precision andgradient tricks.

RLHF and the reasoningturn.

TRL and the GRPOalgorithm.

DPO, KTO, ORPO - thepost-PPO landscape.

Models thatthink first.

Parameter-efficientfine-tuning.

Models in production.

The decoding loop, up close.

The full pipeline asone highway.

Offline inferenceengines.

Hugging Face andGit LFS.

Retrieval-augmentedgeneration.

Consumerchat interfaces.

The agentloop.

The image track.

Vision Transformersare tokens too.

Text and vision,connected.

VAEs and thediffusion process.

Latent spaceand control.

Diffusion math,slowly.

FLUX, ERNIE,HiDream.

IP-Adapter andpersonalization.

ComfyUI workflowas code.

Style-LoRAtraining.

The video track.

AI Film Studiosand video pioneers.

Video diffusionintuition.

LTX architectureand shot composition.

Persona persistenceacross frames.

Audio syncin video.

The voice track.

Speech synthesis,cloning a voice.

STT pipelines(Whisper).

TTS pipelines,tier by tier.

Voice cloning,ethically.

Turn-taking andattendants.

Pre-transformer
embeddings.

Encoder, decoder,
encoder-decoder.

Additive attention,
briefly.

One transformer block,
end to end.

Encoder-decoder,
revisited.

The pretraining
pipeline.

Mixed precision and
gradient tricks.

RLHF and the reasoning
turn.

TRL and the GRPO
algorithm.

DPO, KTO, ORPO - the
post-PPO landscape.

Models that
think first.

Parameter-efficient
fine-tuning.

The full pipeline as
one highway.

Offline inference
engines.

Hugging Face and
Git LFS.

Retrieval-augmented
generation.

Consumer
chat interfaces.

The agent
loop.

Vision Transformers
are tokens too.

Text and vision,
connected.

VAEs and the
diffusion process.

Latent space
and control.

Diffusion math,
slowly.

FLUX, ERNIE,
HiDream.

IP-Adapter and
personalization.

ComfyUI workflow
as code.

Style-LoRA
training.

AI Film Studios
and video pioneers.

Video diffusion
intuition.

LTX architecture
and shot composition.

Persona persistence
across frames.

Audio sync
in video.

Speech synthesis,
cloning a voice.

STT pipelines
(Whisper).

TTS pipelines,
tier by tier.

Voice cloning,
ethically.

Turn-taking and
attendants.