Reproducible SadTalker Pipeline in Google Colab for Single-Image, Single-Audio Talking-Head Generation

Animate still photos into talking avatars in minutes. This guide gives developers a clean, zero-setup SadTalker pipeline in Colab.

Ryan Banze

Dec. 03, 25 · Tutorial

Likes (1)

Comment

Save

7.6K Views

If you’ve ever wanted to bring a still photo to life using nothing more than an audio clip, SadTalker makes it surprisingly easy once it's set up correctly. Running it locally can be tricky because of GPU drivers, missing dependencies, and environment mismatches, so this guide walks you through a clean, reliable setup in Google Colab instead.

The goal is simple: a fully reproducible, copy-and-paste workflow that lets you upload a single image and a single audio file, then generate a talking-head video without spending hours troubleshooting your system.

Step 1: Create a Clean Environment

Setting up SadTalker in Google Colab becomes much easier when its dependencies are isolated inside a dedicated virtual environment. Instead of wrestling with conflicting libraries or GPU driver issues, we’ll start clean by installing virtualenv and creating a new environment called sadtalk_env. This keeps all SadTalker-related packages neatly contained and prevents them from interfering with Colab’s base environment.

    Shell
   
   !pip install virtualenv

!virtualenv sadtalk_env --clear

Step 2: Activate the Environment and Install Dependencies

Once the environment is activated, we can install all of SadTalker's required dependencies in a single step. The command below uses pinned versions for PyTorch and NumPy to avoid compatibility issues, and then pulls in the rest of the core libraries — ranging from face enhancement (facexlib, gfpgan) to audio and video handling (moviepy, opencv, pydub, librosa). Installing everything at once ensures SadTalker has a stable, fully compatible setup without right from the start.

    Shell
   
   %%bash

source sadtalk_env/bin/activate

pip install numpy==1.23.5 torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 \

    facexlib==0.3.0 gfpgan insightface onnxruntime moviepy \

    opencv-python-headless imageio[ffmpeg] yacs kornia gtts \

    safetensors pydub librosa

Step 3: Activate the Environment, Clone SadTalker, Download Models, and Prepare Test Assets

With the environment and dependencies in place, the next step is to bring in SadTalker itself. The snippet below clones the official repository, downloads the pretrained model weights, and sets up both a source image and a sample audio file. The additional wget commands fetch the manual checkpoints that the base script doesn’t retrieve by default. Finally, a quick gTTS one-liner generates a simple demo voice clip so you can test the entire pipeline end-to-end without needing to upload your own audio.

    Shell
   
 

   %%bash
source sadtalk_env/bin/activate

# Clone repo
git clone https://github.com/OpenTalker/SadTalker.git
cd SadTalker
# Download models (official script)
bash scripts/download_models.sh

# ✅ Additional manual weights (per your original script)
wget https://github.com/OpenTalker/SadTalker/releases/download/v0.0.2/epoch_20.pth -P ./checkpoints
wget https://github.com/OpenTalker/SadTalker/releases/download/v0.0.2/auido2pose_00140-model.pth -P ./checkpoints
wget https://github.com/OpenTalker/SadTalker/releases/download/v0.0.2/auido2exp_00300-model.pth -P ./checkpoints
wget https://github.com/OpenTalker/SadTalker/releases/download/v0.0.2/facevid2vid_00189-model.pth.tar -P ./checkpoints
wget https://github.com/OpenTalker/SadTalker/releases/download/v0.0.2/mapping_00229-model.pth.tar -P ./checkpoints
wget https://github.com/OpenTalker/SadTalker/releases/download/v0.0.2/mapping_00109-model.pth.tar -P ./checkpoints

# Prepare source image
mkdir -p examples/source_image
wget https://thispersondoesnotexist.com/ -O examples/source_image/art_0.jpg

# Prepare driven audio
python -c "
from gtts import gTTS
text = 'Hello, I am your virtual presenter. Let us explore the world of AI together.'
gTTS(text, lang='en').save('english_sample.wav')
"

  

Step 4: Verify Image, Audio, and Checkpoints

Before running the model, it’s wise to confirm that all required assets are in place. The commands below list the contents of the checkpoints directory (where the weights are stored) and verify both the downloaded sample image and the generated audio file. This quick check ensures everything is set before you launch your first animation.

    Shell
   
   !ls -lh SadTalker/examples/source_image/art_0.jpg
!ls -lh SadTalker/english_sample.wav

Step 5: Run SadTalker Inference

With the environment set up and all assets verified, you're ready to generate your first talking-head video. The command below runs SadTalker’s inference.py, using the sample audio and source image as inputs. The output is saved to the results folder, with face enhancement handled by Generative Facial Prior Generative Adversarial Network (GFPGAN) and the --still flag keeping the head stable during speech.

    Shell
   
 

   %%bash
source sadtalk_env/bin/activate
cd SadTalker

python inference.py \
  --driven_audio english_sample.wav \
  --source_image examples/source_image/art_0.jpg \
  --result_dir results \
  --enhancer gfpgan \
  --still

  

Step 6: Locate the Output Video

After the inference step finishes, you’ll want to quickly find the generated video file. The snippet below scans the results folder for all .mp4 outputs, sorts them by creation time, and prints the most recent one. This ensures you always grab the latest animation without digging through the directory manually.

    Python
   
import glob
import os

results_dir = '/content/SadTalker/results'
# Use glob to find all .mp4 files in the directory
mp4_files = glob.glob(os.path.join(results_dir, '*.mp4'))

# Sort files by modification time (latest first)
mp4_files.sort(key=os.path.getmtime, reverse=True)

latest_mp4_file = None
if mp4_files:
  latest_mp4_file = mp4_files[0]
  print(f"Latest MP4 file found: {latest_mp4_file}")
else:
  print(f"No MP4 files found in {results_dir}")

Step 7: Display the Final Video in Notebook

Once you've identified the most recent output file, you can preview it directly in Colab. The snippet below uses IPython’s built-in Video widget to embed and play the generated .mp4 inline, allowing you to watch your talking avatar immediately without leaving the notebook.

    Python
   
from IPython.display import Video

Video(latest_mp4_file, embed=True)

At this point, you've successfully built a complete SadTalker workflow in Google Colab. The provided notebook — also available on GitHub as colab-talking-avatar — takes you from zero to a fully generated talking-head video with minimal friction.

Now you're free to experiment: swap in your own voice clips, try different images, batch-generate multiple avatars, or integrate SadTalker into a larger content pipeline. The hardest parts — dependencies, environment setup, and model weights — are already taken care of.

https://github.com/ryanboscobanze/colab-talking-avatar

Virtual environment Google (verb) Python (language) shell

Opinions expressed by DZone contributors are their own.

Related

Trending