LipDub (IC-LoRA) Beta

The LipDub IC-LoRA re-generates speech in video, producing lip-synced output with new dialogue while preserving the speaker’s visual appearance and vocal identity. It works with live-action and animated subjects alike.

Given a source video and a text prompt containing the new dialogue, the LipDub IC-LoRA:

  • Preserves the full video except the lip region
  • Generates new lip movements synced to the prompt dialogue
  • Matches the original speaker’s tone of voice
  • Attempts to match the delivery and emotion of the original speech

Unlike other IC-LoRA adapters that provide structural control during generation, the LipDub IC-LoRA is a video-to-video tool focused on speech replacement. Beyond dubbing into other languages, it can also be used for rephrasing or altering dialogue in the original language.

Languages currently validated: English, French, Spanish, German, Russian.

The LipDub IC-LoRA can be accessed via a ComfyUI workflow or a standalone Python script.

What You’ll Need

Model: LipDub IC-LoRA: https://huggingface.co/Lightricks/LTX-2.3-22b-IC-LoRA-LipDub

ComfyUI Workflow

Setup

Files for ComfyUI:

All files are available in our HuggingFace collection: https://huggingface.co/collections/Lightricks/ltx-23

FileDescriptionPlacement
ltx-2.3-22b-dev.safetensorsDev model checkpointComfyUI/models/checkpoints/
ltx-2.3-spatial-upscaler-x2-1.1.safetensorsSpatial upscaler (v1.1)ComfyUI/models/latent_upscale_models/
ltx-2.3-22b-distilled-lora-384-1.1.safetensorsDistilled LoRAComfyUI/models/loras/
ltx-2.3-22b-ic-lora-lipdub-0.9.safetensorsLipDub IC-LoRA weightsComfyUI/models/loras/

Steps:

  1. Install ComfyUI-LTXVideo custom nodes (see ComfyUI installation)
  2. Download the files listed above and place them in the indicated directories
  3. Optional: Get a free LTX API key from console.ltx.video. The workflow uses the Gemma API for text encoding
  4. Load the LipDub workflow in ComfyUI and enter your API key in the LTX API KEY node

Two-Stage Pipeline

The LipDub workflow uses a two-stage generation pipeline:

  1. Stage 1 — Low resolution: Generates the dubbed video at reduced resolution. The source video frames and reference audio are fed as conditioning, and the IC-LoRA guides lip-sync generation.
  2. Upsample: The video latent is spatially upscaled while the audio latent is frozen (carried forward unchanged).
  3. Stage 2 — High resolution: Re-generates the video at full resolution using the upscaled latent, producing the final output with sharper detail.

Both stages use the same IC-LoRA checkpoint, text conditioning, and reference audio. The spatial upscaler bridges the two stages.

The LipDub IC-LoRA is mask-free and robust to visual occlusions. To save compute or generate at higher resolution, you may want to crop to the face region and composite back in post-production.

Key Nodes

IC-LoRA loading and conditioning:

  • LTXICLoRALoaderModelOnly — Loads the LipDub IC-LoRA checkpoint and applies it to the model.
  • LTXAddVideoICLoRAGuide — Applies the source video frames as the IC-LoRA conditioning signal. Used once per stage, with the source frames resized to match each stage’s resolution.

Audio identity:

  • LTXVSetAudioRefTokens — Attaches an audio latent as ref_audio tokens on conditioning for speaker identity transfer. Also outputs a frozen_audio copy with noise_mask=0, ensuring Stage 1 audio passes through Stage 2 unchanged without needing a mask-by-time node. Used once per stage.

Text encoding:

  • LTXAVTextEncoderLoader — Loads Gemma locally for text encoding. The default path in the workflow.
  • GemmaAPITextEncode — Encodes prompts via the free LTX API, replacing local Gemma to reduce VRAM usage. Available as an alternative in the workflow.

Latent operations:

  • LTXVConcatAVLatent — Combines separate video and audio latents into a single audio-video latent for sampling.
  • LTXVSeparateAVLatent — Splits a combined audio-video latent back into separate video and audio components.
  • LTXVLatentUpsampler — Spatially upscales the video latent between Stage 1 and Stage 2.
  • LTXVCropGuides — Strips IC-LoRA guide tokens from the latent after sampling. Used once per stage.

Workflow Details

SettingValue
Samplereuler
CFG1
Stage 1 steps8
Stage 2 steps3
Negative promptpc game, console game, video game, cartoon, childish, ugly

Python Script

Setup

$# Install uv (if needed)
$curl -LsSf https://astral.sh/uv/install.sh | sh
$
$# Create venv and install packages
$uv venv --python 3.11
$source .venv/bin/activate
$uv pip install -e packages/ltx-core -e packages/ltx-pipelines -e packages/ltx-trainer

Files for the Python script:

FileDescription
ltx-2.3-22b-distilled-1.1.safetensorsDistilled model checkpoint (v1.1)
ltx-2.3-22b-ic-lora-lipdub-0.9.safetensorsLipDub IC-LoRA weights
ltx-2.3-spatial-upscaler-x2-1.1.safetensorsSpatial upscaler (v1.1)
google/gemma-3-12b-itText encoder — download the model directory and pass its path via --gemma-root

Usage

$python -m ltx_pipelines.lipdub \
> --reference-video ./source_video.mp4 \
> --prompt "A woman speaking in French saying: \"Bonjour, je teste les workflows de doublage avec LTX\"" \
> --distilled-checkpoint /path/to/ltx-2.3-22b-distilled-1.1.safetensors \
> --spatial-upsampler-path /path/to/ltx-2.3-spatial-upscaler-x2-1.1.safetensors \
> --lora /path/to/ltx-2.3-22b-ic-lora-lipdub-0.9.safetensors \
> --gemma-root /path/to/gemma/ \
> --height 720 \
> --width 1280 \
> --num-frames 161 \
> --seed 42

The --reference-video provides both the source video frames (for IC-LoRA conditioning) and the audio track (for speaker identity).

Options

FlagDefaultDescription
--reference-videorequiredSource video file — provides video frames and audio for identity
--reference-strength1.0IC-LoRA video reference conditioning strength
--seed NRandom seed
--height NOutput video height
--width NOutput video width
--num-frames NNumber of output frames
--frame-rate NOutput frame rate
--enhance-promptoffUse prompt enhancement
--quantization POLICYQuantization policy
--compileoffEnable torch.compile for faster inference

Prompting

The LipDub IC-LoRA follows the dialogue text in your prompt — it does not translate automatically. You must provide the translated or replacement dialogue directly.

Prompt template:

[Speaker] is speaking [Language/Accent], saying: "[Dialogue]"

Example:

A woman speaking in Russian saying: "Сегодня отличный день, чтобы протестировать рабочие процессы ComfyUI для дубляжа с использованием LTX."

You can add details about emotion or delivery style to the prompt.

Requirements

  • Provide the full dialogue text — the model will follow the content of the prompt. It does not translate dialogue for you.
  • Use native script — write dialogue in the alphabet of the target language (e.g., Cyrillic for Russian, Chinese characters for Mandarin).
  • Single speaker — the beta IC-LoRA does not distinguish between multiple speakers.

Best Practices

  • Match audio length — For best results, try to keep your prompt at same timing and syllable length of the original dialogue. Slightly longer is better than too short.
    • Prompt too long: The model might skip words.
    • Prompt too short: The output might sound slow and unnatural.