§ 1
What this lesson covers.
This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. There is no audio narration for this module - it ships as text + interactive lesson only.
If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.
Interactive lesson · ported from Gemini track
Click tabs to navigate · hover cards for details
TWO PROBLEMS, ONE TIMELINEPre-existing audio drives the video vs vice versa
Two distinct workflows. (A) Audio-driven: you already have music/dialogue and need video synchronized to it. Lock the audio's beat-grid or speech-phoneme-grid, generate visuals on that timeline. (B) Video-driven: you have visuals first, need a soundtrack/dialogue laid over them. Generate or compose audio against the cut. The AI Music Idols pipeline uses (A) almost exclusively — audio is the immovable spine.
LIP-SYNCPhoneme alignment to mouth movements
Lip-sync requires (1) speech-to-phoneme alignment with timestamps, (2) a mouth-region generator (Wav2Lip and successors) that takes a face image + audio and produces synchronized mouth animation. Modern systems (SadTalker, MuseTalk) extend this to head pose and expression. For full character animation, audio drives a 3D rig or 2D facial landmark sequence.
BEAT-CUTMusic informs the editing pace
Detect beats and energy peaks in the audio (librosa beat_track or Essentia). Use the beat positions as cut points. For a 120 BPM song, that's a potential cut every 0.5s. Don't cut on every beat — usually every 4-8 beats. Energy peaks (drops, builds) become longer holds. Macalinao Studio idol pipeline uses this to drive 90% of the cut decisions automatically.
SYNC AS DELIVERABLETreat the locked audio + video edit as the primary artifact
Ship the timeline (a JSON of "audio at 0:00, cut at 0:08.3, ...") alongside the rendered MP4. This makes regeneration easy: if a single shot needs reshooting, you don't lose the audio sync. It also makes the workflow inspectable — you can see why a cut landed where it did.
The canonical references for this module. External links open in a new tab.
§ NEXT
What to read next.
Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.