Training Modes

The trainer uses the flexible training strategy (name: "flexible") — a unified conditioning framework that supports all training modes through configuration. Every scenario is expressed by setting is_generated on each modality and adding optional conditions, rather than choosing a separate strategy class.

Key Concepts

Before diving into individual modes, here are the core ideas behind the flexible strategy:

  • is_generated: true — the modality is denoised during training and contributes to the loss. This is the modality the model learns to generate.
  • is_generated: false — the modality is frozen (sigma=0, no noise, no loss). It passes through the transformer clean and acts as cross-modal conditioning for the generated modality.
  • At least one modality must have is_generated: true.
  • Conditions are per-modality and can be composed (e.g. reference + first_frame together on the video modality).
  • Audio does not support first_frame or spatial_crop conditions — only prefix, suffix, mask, and reference.

Quick Reference

ModeVideoAudioConditionsConfig
T2VGeneratedGeneratedt2v_lora
I2VGeneratedGeneratedfirst_framei2v_lora
Video ExtensionGeneratedGeneratedprefix/suffixvideo_extend_lora
V2V IC-LoRAGeneratedreferencev2v_ic_lora
A2VGeneratedFrozena2v_lora
V2A (Foley)FrozenGeneratedv2a_lora
Video InpaintingGeneratedGeneratedmaskvideo_inpainting_lora
Video OutpaintingGeneratedGeneratedspatial_cropvideo_outpainting_lora
T2AGeneratedt2a_lora
Audio ExtensionGeneratedprefix/suffixaudio_extend_lora
Audio InpaintingGeneratedmaskaudio_inpainting_lora
A2A IC-LoRAGeneratedreferencea2a_ic_lora
AV2AV IC-LoRAGeneratedGeneratedreference (both)av2av_ic_lora

Text-to-Video (T2V)

Generate video and audio from text prompts. Both modalities are denoised with no additional conditions.

1training_strategy:
2 name: "flexible"
3 video:
4 is_generated: true
5 latents_dir: "latents"
6 audio:
7 is_generated: true
8 latents_dir: "audio_latents"

Example config: t2v_lora.yaml

Image-to-Video (I2V)

Generate video conditioned on a starting image. The first frame is provided as a clean conditioning signal — no noise, timestep=0, excluded from loss. The probability parameter controls how often first-frame conditioning is applied; remaining samples train in pure T2V mode.

1training_strategy:
2 name: "flexible"
3 video:
4 is_generated: true
5 latents_dir: "latents"
6 conditions:
7 - type: first_frame
8 probability: 0.5
9 audio:
10 is_generated: true
11 latents_dir: "audio_latents"

Example config: i2v_lora.yaml

Video Extension

Extend a video forward (or backward) in time. Prefix or suffix conditioning provides a span of existing latent frames as clean conditioning. The temporal_boundary sets the number of latent frames used as context (each latent frame = 8 pixel frames due to temporal compression).

1training_strategy:
2 name: "flexible"
3 video:
4 is_generated: true
5 latents_dir: "latents"
6 conditions:
7 - type: prefix # or "suffix" for backward extension
8 temporal_boundary: 8 # 8 latent frames = 64 pixel frames
9 probability: 1.0
10 audio:
11 is_generated: true
12 latents_dir: "audio_latents"

The prefix and suffix conditions also work on the audio modality for audio extension. Set temporal_boundary on the audio modality’s conditions list to condition on a prefix or suffix of the audio latents.

Example configs: video_extend_lora.yaml (forward), video_suffix_lora.yaml (backward)

IC-LoRA / Video-to-Video (V2V)

In-Context LoRA learns transformations from paired videos. Pre-encoded reference latents are concatenated to the target sequence — reference tokens participate in bidirectional self-attention but receive no noise and are excluded from loss. This enables control adapters (depth, pose), style transfer, deblurring, colorization, and more.

1training_strategy:
2 name: "flexible"
3 video:
4 is_generated: true
5 latents_dir: "latents"
6 conditions:
7 - type: reference
8 latents_dir: "reference_latents"
9 probability: 1.0
10 - type: first_frame # optional — composable with reference
11 probability: 0.2

IC-LoRA is video-only by default (no audio modality block). Conditions can be composed — the example above also applies first-frame conditioning with 20% probability alongside the reference.

Example config: v2v_ic_lora.yaml

Dataset Requirements

  • Paired videos — each target video has a corresponding reference video
  • Same frame count between reference and target
  • Reference videos can optionally be at lower spatial resolution (see Scaled Reference below)
  • Both must be preprocessed before training

Dataset structure:

preprocessed_data_root/
├── latents/ # Target video latents
├── conditions/ # Text embeddings
└── reference_latents/ # Reference video latents (conditioning input)

Generating Reference Videos

Use the compute_reference.py script to generate reference videos (e.g. Canny edge maps) for a dataset:

$uv run python scripts/compute_reference.py scenes_output_dir/ \
> --output scenes_output_dir/dataset.json

To compute a different condition (depth maps, pose skeletons, etc.), modify the compute_reference() function in the script.

Scaled Reference Conditioning

For more efficient training and inference, use downscaled reference videos while keeping targets at full resolution. The trainer automatically detects the scale factor from the dimension ratio between reference and target latents and adjusts positional encodings accordingly. This reduces conditioning tokens, leading to:

  • Faster training — shorter sequence lengths
  • Faster inference — reduced memory usage
  • Same aspect ratio maintained between reference and target

Preprocess with the --reference-downscale-factor option:

$uv run python scripts/process_dataset.py dataset.json \
> --resolution-buckets 768x768x25 \
> --model-path /path/to/ltx2.safetensors \
> --text-encoder-path /path/to/gemma \
> --reference-downscale-factor 2

The reference_video column is auto-detected by convention — no --reference-column flag needed.

Set downscale_factor on each reference validation condition to match:

1validation:
2 samples:
3 - prompt: "..."
4 conditions:
5 - type: reference
6 video: "/path/to/reference.mp4"
7 downscale_factor: 2
8 include_in_output: true

The scale factor must be a positive integer, and all dimensions must be divisible by 32. Common values are 1 (no scaling), 2 (half resolution), or 4 (quarter resolution).

Audio-to-Video (A2V)

Generate video conditioned on frozen audio. Audio passes through the transformer clean (sigma=0) and influences video via the built-in cross-modal attention. Only video is denoised.

1training_strategy:
2 name: "flexible"
3 video:
4 is_generated: true
5 latents_dir: "latents"
6 audio:
7 is_generated: false
8 latents_dir: "audio_latents"

Example config: a2v_lora.yaml

Video-to-Audio / Foley (V2A)

Generate audio (Foley) conditioned on frozen video. Video passes through the transformer clean (sigma=0) and conditions audio via cross-modal attention. Only audio is denoised.

1training_strategy:
2 name: "flexible"
3 video:
4 is_generated: false
5 latents_dir: "latents"
6 audio:
7 is_generated: true
8 latents_dir: "audio_latents"

Example config: v2a_lora.yaml

Video Inpainting

Fill in masked regions of a video. Per-sample masks loaded from disk define which tokens are conditioning and which must be generated. Masks are float-valued in [0, 1]: fully masked tokens (1.0) receive clean latents and timestep=0, partially masked tokens receive blended latents with proportionally scaled timesteps, and unmasked tokens (0.0) are denoised normally. All conditioned tokens (mask > 0) are excluded from the training loss.

1training_strategy:
2 name: "flexible"
3 video:
4 is_generated: true
5 latents_dir: "latents"
6 conditions:
7 - type: mask
8 mask_dir: "video_masks"
9 probability: 1.0
10 audio:
11 is_generated: true
12 latents_dir: "audio_latents"

Dataset structure:

preprocessed_data_root/
├── latents/ # Video latents
├── conditions/ # Text embeddings
├── audio_latents/ # Audio latents
└── video_masks/ # Per-sample masks, float [0,1] (1 -> conditioning, 0 -> generate)

Example config: video_inpainting_lora.yaml

Video Outpainting

Extend a video spatially beyond its original boundaries. A rectangular pixel region is provided as clean conditioning (no noise, timestep=0, excluded from loss) — the model learns to generate the surrounding content. The spatial_region is specified in pixel coordinates [y1, x1, y2, x2] and automatically converted to latent space.

1training_strategy:
2 name: "flexible"
3 video:
4 is_generated: true
5 latents_dir: "latents"
6 conditions:
7 - type: spatial_crop
8 spatial_region: [0, 0, 288, 576] # y1, x1, y2, x2 in pixels
9 probability: 1.0
10 audio:
11 is_generated: true
12 latents_dir: "audio_latents"

spatial_crop is a video-only condition — it is not supported on the audio modality.

Example config: video_outpainting_lora.yaml

Text-to-Audio (T2A)

Generate audio from text prompts with no video modality. Only the audio branch of the transformer is denoised. Since no video modality is configured, this mode uses audio-only LoRA targets — explicitly targeting audio_attn1, audio_attn2, and audio_ff modules.

1training_strategy:
2 name: "flexible"
3 audio:
4 is_generated: true
5 latents_dir: "audio_latents"

With no video block in the strategy, the trainer only loads audio latents and text embeddings. LoRA adapters should explicitly target audio modules (e.g. audio_attn1.to_k) rather than short patterns like to_k which would also match video modules. See LoRA Target Modules Guidance below.

Example config: t2a_lora.yaml

Audio Extension

Extend audio forward (prefix) or backward (suffix) in time — the audio equivalent of Video Extension. A span of existing audio latent frames is provided as clean conditioning, and the model generates the continuation. The temporal_boundary sets the number of latent frames used as context. This mode uses audio-only LoRA targets.

1training_strategy:
2 name: "flexible"
3 audio:
4 is_generated: true
5 latents_dir: "audio_latents"
6 conditions:
7 - type: prefix # or "suffix" for backward extension
8 temporal_boundary: 8
9 probability: 1.0

Example configs: audio_extend_lora.yaml, audio_suffix_lora.yaml

Audio Inpainting

Fill in masked regions of audio. Per-sample masks loaded from disk define which audio tokens are conditioning and which must be generated — the audio equivalent of Video Inpainting. Masks are float-valued in [0, 1] with the same semantics as video inpainting. This mode uses audio-only LoRA targets.

1training_strategy:
2 name: "flexible"
3 audio:
4 is_generated: true
5 latents_dir: "audio_latents"
6 conditions:
7 - type: mask
8 mask_dir: "audio_masks"
9 probability: 1.0

Dataset structure:

preprocessed_data_root/
├── conditions/ # Text embeddings
├── audio_latents/ # Audio latents
└── audio_masks/ # Per-sample masks, float [0,1] (1 -> conditioning, 0 -> generate)

Example config: audio_inpainting_lora.yaml

IC-LoRA / Audio-to-Audio (A2A)

In-Context LoRA for audio-to-audio transformations. Pre-encoded reference audio latents are concatenated to the target sequence — reference tokens participate in bidirectional self-attention but receive no noise and are excluded from loss. This enables audio style transfer, voice conversion, sound effect transformation, and more. This mode uses audio-only LoRA targets.

1training_strategy:
2 name: "flexible"
3 audio:
4 is_generated: true
5 latents_dir: "audio_latents"
6 conditions:
7 - type: reference
8 latents_dir: "reference_audio_latents"
9 probability: 1.0

Dataset structure:

preprocessed_data_root/
├── conditions/ # Text embeddings
├── audio_latents/ # Target audio latents
└── reference_audio_latents/ # Reference audio latents (conditioning input)

Example config: a2a_ic_lora.yaml

AV2AV IC-LoRA

Joint audio-video In-Context LoRA — both modalities have reference conditioning. Pre-encoded reference latents are concatenated to each modality’s target sequence independently. This enables joint audiovisual transformations such as synchronized style transfer across both video and audio.

1training_strategy:
2 name: "flexible"
3 video:
4 is_generated: true
5 latents_dir: "latents"
6 conditions:
7 - type: reference
8 latents_dir: "reference_latents"
9 probability: 1.0
10 audio:
11 is_generated: true
12 latents_dir: "audio_latents"
13 conditions:
14 - type: reference
15 latents_dir: "reference_audio_latents"
16 probability: 1.0

Unlike audio-only IC-LoRA (A2A), AV2AV uses short LoRA target patterns like "to_k" to match all branches (video, audio, and cross-modal attention), since both modalities are trained.

Dataset structure:

preprocessed_data_root/
├── latents/ # Target video latents
├── audio_latents/ # Target audio latents
├── conditions/ # Text embeddings
├── reference_latents/ # Reference video latents (conditioning input)
└── reference_audio_latents/ # Reference audio latents (conditioning input)

Example config: av2av_ic_lora.yaml

Full Model Fine-tuning

All modes above default to training_mode: "lora". For full fine-tuning, set training_mode: "full" — this updates all model parameters rather than adding LoRA adapters.

1model:
2 training_mode: "full"
3
4training_strategy:
5 name: "flexible"
6 video:
7 is_generated: true
8 latents_dir: "latents"
9 audio:
10 is_generated: true
11 latents_dir: "audio_latents"

Full fine-tuning requires multiple high-end GPUs (e.g. 4–8× H100 80GB) and distributed training with FSDP. See the Training Guide for multi-GPU setup instructions.

LoRA Target Modules Guidance

The target_modules configuration determines which transformer modules receive LoRA adapters. The right choice depends on whether your training involves cross-modal (audio ↔ video) interaction.

For T2V, I2V, A2V, V2A, or any mode involving both modalities — use short patterns to match all branches (video, audio, and cross-modal attention):

1target_modules:
2 - "to_k"
3 - "to_q"
4 - "to_v"
5 - "to_out.0"

Short patterns like "to_k" match video modules (attn1.to_k, attn2.to_k), audio modules (audio_attn1.to_k, audio_attn2.to_k), and cross-modal modules (audio_to_video_attn.to_k, video_to_audio_attn.to_k). The cross-modal attention modules enable bidirectional information flow between audio and video, which is critical for synchronized audiovisual generation. See Understanding Target Modules for detailed guidance.

For video-only IC-LoRA — explicitly target video modules (including FFN layers for better transformation quality):

1target_modules:
2 - "attn1.to_k"
3 - "attn1.to_q"
4 - "attn1.to_v"
5 - "attn1.to_out.0"
6 - "attn2.to_k"
7 - "attn2.to_q"
8 - "attn2.to_v"
9 - "attn2.to_out.0"
10 - "ff.net.0.proj"
11 - "ff.net.2"

For audio-only modes (T2A, Audio Extension, Audio Inpainting, A2A IC-LoRA) — explicitly target audio modules:

1target_modules:
2 - "audio_attn1.to_k"
3 - "audio_attn1.to_q"
4 - "audio_attn1.to_v"
5 - "audio_attn1.to_out.0"
6 - "audio_attn2.to_k"
7 - "audio_attn2.to_q"
8 - "audio_attn2.to_v"
9 - "audio_attn2.to_out.0"
10 - "audio_ff.net.0.proj"
11 - "audio_ff.net.2"

Audio-only modes have no video block in the strategy, so there is no need to train video or cross-modal attention modules. Targeting only audio_* modules keeps the LoRA small and focused.

Using Trained Models for Inference

After training, use the ltx-pipelines package for production inference with your trained LoRAs:

Training ModeRecommended Pipeline
T2V / I2V / A2V / Extension / Inpainting / OutpaintingTI2VidOneStagePipeline or TI2VidTwoStagesPipeline
IC-LoRA (V2V / A2A / AV2AV)ICLoraPipeline
V2A (Foley) / T2A / Audio Extension / Audio InpaintingTI2VidOneStagePipeline or TI2VidTwoStagesPipeline

All pipelines support loading custom LoRAs via the loras parameter.

You can generate audio during validation even if you’re not training the audio branch. Set validation.generate_audio: true independently of whether audio has is_generated: true.

Migration from Legacy Strategies

Legacy text_to_video and video_to_video strategy configs are forward-compatible and will continue to work (with a deprecation warning). We recommend migrating to flexible for access to all conditioning modes.