LipDub (IC-LoRA) Beta | LTX Documentation

The LipDub IC-LoRA re-generates speech in video, producing lip-synced output with new dialogue while preserving the speaker’s visual appearance and vocal identity. It works with live-action and animated subjects alike.

Given a source video and a text prompt containing the new dialogue, the LipDub IC-LoRA:

Preserves the full video except the lip region
Generates new lip movements synced to the prompt dialogue
Matches the original speaker’s tone of voice
Attempts to match the delivery and emotion of the original speech

Unlike other IC-LoRA adapters that provide structural control during generation, the LipDub IC-LoRA is a video-to-video tool focused on speech replacement. Beyond dubbing into other languages, it can also be used for rephrasing or altering dialogue in the original language.

Languages currently validated: English, French, Spanish, German, Russian.

The LipDub IC-LoRA can be accessed via a ComfyUI workflow or a standalone Python script.

What You’ll Need

Model: LipDub IC-LoRA: https://huggingface.co/Lightricks/LTX-2.3-22b-IC-LoRA-LipDub

ComfyUI Workflow

Setup

Files for ComfyUI:

All files are available in our HuggingFace collection: https://huggingface.co/collections/Lightricks/ltx-23

File	Description	Placement
`ltx-2.3-22b-dev.safetensors`	Dev model checkpoint	`ComfyUI/models/checkpoints/`
`ltx-2.3-spatial-upscaler-x2-1.1.safetensors`	Spatial upscaler (v1.1)	`ComfyUI/models/latent_upscale_models/`
`ltx-2.3-22b-distilled-lora-384-1.1.safetensors`	Distilled LoRA	`ComfyUI/models/loras/`
`ltx-2.3-22b-ic-lora-lipdub-0.9.safetensors`	LipDub IC-LoRA weights	`ComfyUI/models/loras/`

Steps:

Install ComfyUI-LTXVideo custom nodes (see ComfyUI installation)
Download the files listed above and place them in the indicated directories
Optional: Get a free LTX API key from console.ltx.video. The workflow uses the Gemma API for text encoding
Load the LipDub workflow in ComfyUI and enter your API key in the LTX API KEY node

Two-Stage Pipeline

The LipDub workflow uses a two-stage generation pipeline:

Stage 1 — Low resolution: Generates the dubbed video at reduced resolution. The source video frames and reference audio are fed as conditioning, and the IC-LoRA guides lip-sync generation.
Upsample: The video latent is spatially upscaled while the audio latent is frozen (carried forward unchanged).
Stage 2 — High resolution: Re-generates the video at full resolution using the upscaled latent, producing the final output with sharper detail.

Both stages use the same IC-LoRA checkpoint, text conditioning, and reference audio. The spatial upscaler bridges the two stages.

The LipDub IC-LoRA is mask-free and robust to visual occlusions. To save compute or generate at higher resolution, you may want to crop to the face region and composite back in post-production.

Key Nodes

IC-LoRA loading and conditioning:

LTXICLoRALoaderModelOnly — Loads the LipDub IC-LoRA checkpoint and applies it to the model.
LTXAddVideoICLoRAGuide — Applies the source video frames as the IC-LoRA conditioning signal. Used once per stage, with the source frames resized to match each stage’s resolution.

Audio identity:

LTXVSetAudioRefTokens — Attaches an audio latent as ref_audio tokens on conditioning for speaker identity transfer. Also outputs a frozen_audio copy with noise_mask=0, ensuring Stage 1 audio passes through Stage 2 unchanged without needing a mask-by-time node. Used once per stage.

Text encoding:

LTXAVTextEncoderLoader — Loads Gemma locally for text encoding. The default path in the workflow.
GemmaAPITextEncode — Encodes prompts via the free LTX API, replacing local Gemma to reduce VRAM usage. Available as an alternative in the workflow.

Latent operations:

LTXVConcatAVLatent — Combines separate video and audio latents into a single audio-video latent for sampling.
LTXVSeparateAVLatent — Splits a combined audio-video latent back into separate video and audio components.
LTXVLatentUpsampler — Spatially upscales the video latent between Stage 1 and Stage 2.
LTXVCropGuides — Strips IC-LoRA guide tokens from the latent after sampling. Used once per stage.

Workflow Details

Setting	Value
Sampler	`euler`
CFG	1
Stage 1 steps	8
Stage 2 steps	3
Negative prompt	`pc game, console game, video game, cartoon, childish, ugly`

Python Script

Setup

$ # Install uv (if needed)
$ curl -LsSf https://astral.sh/uv/install.sh | sh
$ 
$ # Create venv and install packages
$ uv venv --python 3.11
$ source .venv/bin/activate
$ uv pip install -e packages/ltx-core -e packages/ltx-pipelines -e packages/ltx-trainer

Files for the Python script:

File	Description
`ltx-2.3-22b-distilled-1.1.safetensors`	Distilled model checkpoint (v1.1)
`ltx-2.3-22b-ic-lora-lipdub-0.9.safetensors`	LipDub IC-LoRA weights
`ltx-2.3-spatial-upscaler-x2-1.1.safetensors`	Spatial upscaler (v1.1)
`google/gemma-3-12b-it`	Text encoder — download the model directory and pass its path via `--gemma-root`

Usage

$ python -m ltx_pipelines.lipdub \
>     --reference-video ./source_video.mp4 \
>     --prompt "A woman speaking in French saying: \"Bonjour, je teste les workflows de doublage avec LTX\"" \
>     --distilled-checkpoint /path/to/ltx-2.3-22b-distilled-1.1.safetensors \
>     --spatial-upsampler-path /path/to/ltx-2.3-spatial-upscaler-x2-1.1.safetensors \
>     --lora /path/to/ltx-2.3-22b-ic-lora-lipdub-0.9.safetensors \
>     --gemma-root /path/to/gemma/ \
>     --height 720 \
>     --width 1280 \
>     --num-frames 161 \
>     --seed 42

The --reference-video provides both the source video frames (for IC-LoRA conditioning) and the audio track (for speaker identity).

Options

Flag	Default	Description
`--reference-video`	required	Source video file — provides video frames and audio for identity
`--reference-strength`	1.0	IC-LoRA video reference conditioning strength
`--seed N`	—	Random seed
`--height N`	—	Output video height
`--width N`	—	Output video width
`--num-frames N`	—	Number of output frames
`--frame-rate N`	—	Output frame rate
`--enhance-prompt`	off	Use prompt enhancement
`--quantization POLICY`	—	Quantization policy
`--compile`	off	Enable torch.compile for faster inference

Prompting

The LipDub IC-LoRA follows the dialogue text in your prompt — it does not translate automatically. You must provide the translated or replacement dialogue directly.

Prompt template:

[Speaker] is speaking [Language/Accent], saying: "[Dialogue]"

Example:

A woman speaking in Russian saying: "Сегодня отличный день, чтобы протестировать рабочие процессы ComfyUI для дубляжа с использованием LTX."

You can add details about emotion or delivery style to the prompt.

Requirements

Provide the full dialogue text — the model will follow the content of the prompt. It does not translate dialogue for you.
Use native script — write dialogue in the alphabet of the target language (e.g., Cyrillic for Russian, Chinese characters for Mandarin).
Single speaker — the beta IC-LoRA does not distinguish between multiple speakers.

Best Practices

Match audio length — For best results, try to keep your prompt at same timing and syllable length of the original dialogue. Slightly longer is better than too short.
- Prompt too long: The model might skip words.
- Prompt too short: The output might sound slow and unnatural.