For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
The LipDub IC-LoRA re-generates speech in video, producing lip-synced output with new dialogue while preserving the speaker’s visual appearance and vocal identity. It works with live-action and animated subjects alike.
Given a source video and a text prompt containing the new dialogue, the LipDub IC-LoRA:
Preserves the full video except the lip region
Generates new lip movements synced to the prompt dialogue
Matches the original speaker’s tone of voice
Attempts to match the delivery and emotion of the original speech
Unlike other IC-LoRA adapters that provide structural control during generation, the LipDub IC-LoRA is a video-to-video tool focused on speech replacement. Beyond dubbing into other languages, it can also be used for rephrasing or altering dialogue in the original language.
Languages currently validated: English, French, Spanish, German, Russian.
Download the files listed above and place them in the indicated directories
Optional: Get a free LTX API key from console.ltx.video. The workflow uses the Gemma API for text encoding
Load the LipDub workflow in ComfyUI and enter your API key in the LTX API KEY node
Two-Stage Pipeline
The LipDub workflow uses a two-stage generation pipeline:
Stage 1 — Low resolution: Generates the dubbed video at reduced resolution. The source video frames and reference audio are fed as conditioning, and the IC-LoRA guides lip-sync generation.
Upsample: The video latent is spatially upscaled while the audio latent is frozen (carried forward unchanged).
Stage 2 — High resolution: Re-generates the video at full resolution using the upscaled latent, producing the final output with sharper detail.
Both stages use the same IC-LoRA checkpoint, text conditioning, and reference audio. The spatial upscaler bridges the two stages.
The LipDub IC-LoRA is mask-free and robust to visual occlusions. To save
compute or generate at higher resolution, you may want to crop to the face region
and composite back in post-production.
Key Nodes
IC-LoRA loading and conditioning:
LTXICLoRALoaderModelOnly — Loads the LipDub IC-LoRA checkpoint and applies it to the model.
LTXAddVideoICLoRAGuide — Applies the source video frames as the IC-LoRA conditioning signal. Used once per stage, with the source frames resized to match each stage’s resolution.
Audio identity:
LTXVSetAudioRefTokens — Attaches an audio latent as ref_audio tokens on conditioning for speaker identity transfer. Also outputs a frozen_audio copy with noise_mask=0, ensuring Stage 1 audio passes through Stage 2 unchanged without needing a mask-by-time node. Used once per stage.
Text encoding:
LTXAVTextEncoderLoader — Loads Gemma locally for text encoding. The default path in the workflow.
GemmaAPITextEncode — Encodes prompts via the free LTX API, replacing local Gemma to reduce VRAM usage. Available as an alternative in the workflow.
Latent operations:
LTXVConcatAVLatent — Combines separate video and audio latents into a single audio-video latent for sampling.
LTXVSeparateAVLatent — Splits a combined audio-video latent back into separate video and audio components.
LTXVLatentUpsampler — Spatially upscales the video latent between Stage 1 and Stage 2.
LTXVCropGuides — Strips IC-LoRA guide tokens from the latent after sampling. Used once per stage.
Workflow Details
Setting
Value
Sampler
euler
CFG
1
Stage 1 steps
8
Stage 2 steps
3
Negative prompt
pc game, console game, video game, cartoon, childish, ugly
The --reference-video provides both the source video frames (for IC-LoRA conditioning) and the audio track (for speaker identity).
Options
Flag
Default
Description
--reference-video
required
Source video file — provides video frames and audio for identity
--reference-strength
1.0
IC-LoRA video reference conditioning strength
--seed N
—
Random seed
--height N
—
Output video height
--width N
—
Output video width
--num-frames N
—
Number of output frames
--frame-rate N
—
Output frame rate
--enhance-prompt
off
Use prompt enhancement
--quantization POLICY
—
Quantization policy
--compile
off
Enable torch.compile for faster inference
Prompting
The LipDub IC-LoRA follows the dialogue text in your prompt — it does not translate automatically. You must provide the translated or replacement dialogue directly.
Prompt template:
[Speaker] is speaking [Language/Accent], saying: "[Dialogue]"
Example:
A woman speaking in Russian saying: "Сегодня отличный день, чтобы протестировать рабочие процессы ComfyUI для дубляжа с использованием LTX."
You can add details about emotion or delivery style to the prompt.
Requirements
Provide the full dialogue text — the model will follow the content of the prompt. It does not translate dialogue for you.
Use native script — write dialogue in the alphabet of the target language (e.g., Cyrillic for Russian, Chinese characters for Mandarin).
Single speaker — the beta IC-LoRA does not distinguish between multiple speakers.
Best Practices
Match audio length — For best results, try to keep your prompt at same timing and syllable length of the original dialogue. Slightly longer is better than too short.
Prompt too long: The model might skip words.
Prompt too short: The output might sound slow and unnatural.