LipDub (IC-LoRA) Beta
LipDub (IC-LoRA) Beta
The LipDub IC-LoRA re-generates speech in video, producing lip-synced output with new dialogue while preserving the speaker’s visual appearance and vocal identity. It works with live-action and animated subjects alike.
Given a source video and a text prompt containing the new dialogue, the LipDub IC-LoRA:
- Preserves the full video except the lip region
- Generates new lip movements synced to the prompt dialogue
- Matches the original speaker’s tone of voice
- Attempts to match the delivery and emotion of the original speech
Unlike other IC-LoRA adapters that provide structural control during generation, the LipDub IC-LoRA is a video-to-video tool focused on speech replacement. Beyond dubbing into other languages, it can also be used for rephrasing or altering dialogue in the original language.
Languages currently validated: English, French, Spanish, German, Russian.
The LipDub IC-LoRA can be accessed via a ComfyUI workflow or a standalone Python script.
What You’ll Need
Model: LipDub IC-LoRA: https://huggingface.co/Lightricks/LTX-2.3-22b-IC-LoRA-LipDub
ComfyUI Workflow
Setup
Files for ComfyUI:
All files are available in our HuggingFace collection: https://huggingface.co/collections/Lightricks/ltx-23
Steps:
- Install ComfyUI-LTXVideo custom nodes (see ComfyUI installation)
- Download the files listed above and place them in the indicated directories
- Optional: Get a free LTX API key from console.ltx.video. The workflow uses the Gemma API for text encoding
- Load the LipDub workflow in ComfyUI and enter your API key in the LTX API KEY node
Two-Stage Pipeline
The LipDub workflow uses a two-stage generation pipeline:
- Stage 1 — Low resolution: Generates the dubbed video at reduced resolution. The source video frames and reference audio are fed as conditioning, and the IC-LoRA guides lip-sync generation.
- Upsample: The video latent is spatially upscaled while the audio latent is frozen (carried forward unchanged).
- Stage 2 — High resolution: Re-generates the video at full resolution using the upscaled latent, producing the final output with sharper detail.
Both stages use the same IC-LoRA checkpoint, text conditioning, and reference audio. The spatial upscaler bridges the two stages.
The LipDub IC-LoRA is mask-free and robust to visual occlusions. To save compute or generate at higher resolution, you may want to crop to the face region and composite back in post-production.
Key Nodes
IC-LoRA loading and conditioning:
- LTXICLoRALoaderModelOnly — Loads the LipDub IC-LoRA checkpoint and applies it to the model.
- LTXAddVideoICLoRAGuide — Applies the source video frames as the IC-LoRA conditioning signal. Used once per stage, with the source frames resized to match each stage’s resolution.
Audio identity:
- LTXVSetAudioRefTokens — Attaches an audio latent as
ref_audiotokens on conditioning for speaker identity transfer. Also outputs afrozen_audiocopy withnoise_mask=0, ensuring Stage 1 audio passes through Stage 2 unchanged without needing a mask-by-time node. Used once per stage.
Text encoding:
- LTXAVTextEncoderLoader — Loads Gemma locally for text encoding. The default path in the workflow.
- GemmaAPITextEncode — Encodes prompts via the free LTX API, replacing local Gemma to reduce VRAM usage. Available as an alternative in the workflow.
Latent operations:
- LTXVConcatAVLatent — Combines separate video and audio latents into a single audio-video latent for sampling.
- LTXVSeparateAVLatent — Splits a combined audio-video latent back into separate video and audio components.
- LTXVLatentUpsampler — Spatially upscales the video latent between Stage 1 and Stage 2.
- LTXVCropGuides — Strips IC-LoRA guide tokens from the latent after sampling. Used once per stage.
Workflow Details
Python Script
Setup
Files for the Python script:
Usage
The --reference-video provides both the source video frames (for IC-LoRA conditioning) and the audio track (for speaker identity).
Options
Prompting
The LipDub IC-LoRA follows the dialogue text in your prompt — it does not translate automatically. You must provide the translated or replacement dialogue directly.
Prompt template:
Example:
You can add details about emotion or delivery style to the prompt.
Requirements
- Provide the full dialogue text — the model will follow the content of the prompt. It does not translate dialogue for you.
- Use native script — write dialogue in the alphabet of the target language (e.g., Cyrillic for Russian, Chinese characters for Mandarin).
- Single speaker — the beta IC-LoRA does not distinguish between multiple speakers.
Best Practices
- Match audio length — For best results, try to keep your prompt at same timing and syllable length of the original dialogue. Slightly longer is better than too short.
- Prompt too long: The model might skip words.
- Prompt too short: The output might sound slow and unnatural.