Reproducible SadTalker Pipeline in Google Colab for Single-Image, Single-Audio Talking-Head Generation
Animate still photos into talking avatars in minutes. This guide gives developers a clean, zero-setup SadTalker pipeline in Colab.
Join the DZone community and get the full member experience.
Join For FreeIf you’ve ever wanted to bring a still photo to life using nothing more than an audio clip, SadTalker makes it surprisingly easy once it's set up correctly. Running it locally can be tricky because of GPU drivers, missing dependencies, and environment mismatches, so this guide walks you through a clean, reliable setup in Google Colab instead.
The goal is simple: a fully reproducible, copy-and-paste workflow that lets you upload a single image and a single audio file, then generate a talking-head video without spending hours troubleshooting your system.
Step 1: Create a Clean Environment
Setting up SadTalker in Google Colab becomes much easier when its dependencies are isolated inside a dedicated virtual environment. Instead of wrestling with conflicting libraries or GPU driver issues, we’ll start clean by installing virtualenv and creating a new environment called sadtalk_env. This keeps all SadTalker-related packages neatly contained and prevents them from interfering with Colab’s base environment.
!pip install virtualenv
!virtualenv sadtalk_env --clear
Step 2: Activate the Environment and Install Dependencies
Once the environment is activated, we can install all of SadTalker's required dependencies in a single step. The command below uses pinned versions for PyTorch and NumPy to avoid compatibility issues, and then pulls in the rest of the core libraries — ranging from face enhancement (facexlib, gfpgan) to audio and video handling (moviepy, opencv, pydub, librosa). Installing everything at once ensures SadTalker has a stable, fully compatible setup without right from the start.
%%bash
source sadtalk_env/bin/activate
pip install numpy==1.23.5 torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 \
facexlib==0.3.0 gfpgan insightface onnxruntime moviepy \
opencv-python-headless imageio[ffmpeg] yacs kornia gtts \
safetensors pydub librosa
Step 3: Activate the Environment, Clone SadTalker, Download Models, and Prepare Test Assets
With the environment and dependencies in place, the next step is to bring in SadTalker itself. The snippet below clones the official repository, downloads the pretrained model weights, and sets up both a source image and a sample audio file. The additional wget commands fetch the manual checkpoints that the base script doesn’t retrieve by default. Finally, a quick gTTS one-liner generates a simple demo voice clip so you can test the entire pipeline end-to-end without needing to upload your own audio.
%%bash
source sadtalk_env/bin/activate
# Clone repo
git clone https://github.com/OpenTalker/SadTalker.git
cd SadTalker
# Download models (official script)
bash scripts/download_models.sh
# ✅ Additional manual weights (per your original script)
wget https://github.com/OpenTalker/SadTalker/releases/download/v0.0.2/epoch_20.pth -P ./checkpoints
wget https://github.com/OpenTalker/SadTalker/releases/download/v0.0.2/auido2pose_00140-model.pth -P ./checkpoints
wget https://github.com/OpenTalker/SadTalker/releases/download/v0.0.2/auido2exp_00300-model.pth -P ./checkpoints
wget https://github.com/OpenTalker/SadTalker/releases/download/v0.0.2/facevid2vid_00189-model.pth.tar -P ./checkpoints
wget https://github.com/OpenTalker/SadTalker/releases/download/v0.0.2/mapping_00229-model.pth.tar -P ./checkpoints
wget https://github.com/OpenTalker/SadTalker/releases/download/v0.0.2/mapping_00109-model.pth.tar -P ./checkpoints
# Prepare source image
mkdir -p examples/source_image
wget https://thispersondoesnotexist.com/ -O examples/source_image/art_0.jpg
# Prepare driven audio
python -c "
from gtts import gTTS
text = 'Hello, I am your virtual presenter. Let us explore the world of AI together.'
gTTS(text, lang='en').save('english_sample.wav')
"
Step 4: Verify Image, Audio, and Checkpoints
Before running the model, it’s wise to confirm that all required assets are in place. The commands below list the contents of the checkpoints directory (where the weights are stored) and verify both the downloaded sample image and the generated audio file. This quick check ensures everything is set before you launch your first animation.
!ls -lh SadTalker/examples/source_image/art_0.jpg
!ls -lh SadTalker/english_sample.wav
Step 5: Run SadTalker Inference
With the environment set up and all assets verified, you're ready to generate your first talking-head video. The command below runs SadTalker’s inference.py, using the sample audio and source image as inputs. The output is saved to the results folder, with face enhancement handled by Generative Facial Prior Generative Adversarial Network (GFPGAN) and the --still flag keeping the head stable during speech.
%%bash
source sadtalk_env/bin/activate
cd SadTalker
python inference.py \
--driven_audio english_sample.wav \
--source_image examples/source_image/art_0.jpg \
--result_dir results \
--enhancer gfpgan \
--still
Step 6: Locate the Output Video
After the inference step finishes, you’ll want to quickly find the generated video file. The snippet below scans the results folder for all .mp4 outputs, sorts them by creation time, and prints the most recent one. This ensures you always grab the latest animation without digging through the directory manually.
import glob
import os
results_dir = '/content/SadTalker/results'
# Use glob to find all .mp4 files in the directory
mp4_files = glob.glob(os.path.join(results_dir, '*.mp4'))
# Sort files by modification time (latest first)
mp4_files.sort(key=os.path.getmtime, reverse=True)
latest_mp4_file = None
if mp4_files:
latest_mp4_file = mp4_files[0]
print(f"Latest MP4 file found: {latest_mp4_file}")
else:
print(f"No MP4 files found in {results_dir}")
Step 7: Display the Final Video in Notebook
Once you've identified the most recent output file, you can preview it directly in Colab. The snippet below uses IPython’s built-in Video widget to embed and play the generated .mp4 inline, allowing you to watch your talking avatar immediately without leaving the notebook.
from IPython.display import Video
Video(latest_mp4_file, embed=True)
At this point, you've successfully built a complete SadTalker workflow in Google Colab. The provided notebook — also available on GitHub as colab-talking-avatar — takes you from zero to a fully generated talking-head video with minimal friction.
Now you're free to experiment: swap in your own voice clips, try different images, batch-generate multiple avatars, or integrate SadTalker into a larger content pipeline. The hardest parts — dependencies, environment setup, and model weights — are already taken care of.
Opinions expressed by DZone contributors are their own.
Comments