Training Modes
The trainer uses the flexible training strategy (name: "flexible") — a unified conditioning framework that
supports all training modes through configuration. Every scenario is expressed by setting is_generated on each
modality and adding optional conditions, rather than choosing a separate strategy class.
Key Concepts
Before diving into individual modes, here are the core ideas behind the flexible strategy:
is_generated: true— the modality is denoised during training and contributes to the loss. This is the modality the model learns to generate.is_generated: false— the modality is frozen (sigma=0, no noise, no loss). It passes through the transformer clean and acts as cross-modal conditioning for the generated modality.- At least one modality must have
is_generated: true. - Conditions are per-modality and can be composed (e.g.
reference+first_frametogether on the video modality). - Audio does not support
first_frameorspatial_cropconditions — onlyprefix,suffix,mask, andreference.
Quick Reference
Text-to-Video (T2V)
Generate video and audio from text prompts. Both modalities are denoised with no additional conditions.
Example config: t2v_lora.yaml
Image-to-Video (I2V)
Generate video conditioned on a starting image. The first frame is provided as a clean conditioning signal — no noise,
timestep=0, excluded from loss. The probability parameter controls how often first-frame conditioning is applied;
remaining samples train in pure T2V mode.
Example config: i2v_lora.yaml
Video Extension
Extend a video forward (or backward) in time. Prefix or suffix conditioning provides a span of existing latent frames
as clean conditioning. The temporal_boundary sets the number of latent frames used as context (each latent frame
= 8 pixel frames due to temporal compression).
The prefix and suffix conditions also work on the audio modality for audio extension. Set temporal_boundary on
the audio modality’s conditions list to condition on a prefix or suffix of the audio latents.
Example configs: video_extend_lora.yaml (forward), video_suffix_lora.yaml (backward)
IC-LoRA / Video-to-Video (V2V)
In-Context LoRA learns transformations from paired videos. Pre-encoded reference latents are concatenated to the target sequence — reference tokens participate in bidirectional self-attention but receive no noise and are excluded from loss. This enables control adapters (depth, pose), style transfer, deblurring, colorization, and more.
IC-LoRA is video-only by default (no audio modality block). Conditions can be composed — the example above also applies first-frame conditioning with 20% probability alongside the reference.
Example config: v2v_ic_lora.yaml
Dataset Requirements
- Paired videos — each target video has a corresponding reference video
- Same frame count between reference and target
- Reference videos can optionally be at lower spatial resolution (see Scaled Reference below)
- Both must be preprocessed before training
Dataset structure:
Generating Reference Videos
Use the compute_reference.py script to generate reference videos (e.g. Canny edge maps) for a dataset:
To compute a different condition (depth maps, pose skeletons, etc.), modify the compute_reference() function in the
script.
Scaled Reference Conditioning
For more efficient training and inference, use downscaled reference videos while keeping targets at full resolution. The trainer automatically detects the scale factor from the dimension ratio between reference and target latents and adjusts positional encodings accordingly. This reduces conditioning tokens, leading to:
- Faster training — shorter sequence lengths
- Faster inference — reduced memory usage
- Same aspect ratio maintained between reference and target
Preprocess with the --reference-downscale-factor option:
The reference_video column is auto-detected by convention — no --reference-column flag needed.
Set downscale_factor on each reference validation condition to match:
The scale factor must be a positive integer, and all dimensions must be divisible by 32. Common values are 1 (no scaling), 2 (half resolution), or 4 (quarter resolution).
Audio-to-Video (A2V)
Generate video conditioned on frozen audio. Audio passes through the transformer clean (sigma=0) and influences video via the built-in cross-modal attention. Only video is denoised.
Example config: a2v_lora.yaml
Video-to-Audio / Foley (V2A)
Generate audio (Foley) conditioned on frozen video. Video passes through the transformer clean (sigma=0) and conditions audio via cross-modal attention. Only audio is denoised.
Example config: v2a_lora.yaml
Video Inpainting
Fill in masked regions of a video. Per-sample masks loaded from disk define which tokens are conditioning and which must be generated. Masks are float-valued in [0, 1]: fully masked tokens (1.0) receive clean latents and timestep=0, partially masked tokens receive blended latents with proportionally scaled timesteps, and unmasked tokens (0.0) are denoised normally. All conditioned tokens (mask > 0) are excluded from the training loss.
Dataset structure:
Example config: video_inpainting_lora.yaml
Video Outpainting
Extend a video spatially beyond its original boundaries. A rectangular pixel region is provided as clean conditioning
(no noise, timestep=0, excluded from loss) — the model learns to generate the surrounding content. The spatial_region
is specified in pixel coordinates [y1, x1, y2, x2] and automatically converted to latent space.
spatial_crop is a video-only condition — it is not supported on the audio modality.
Example config: video_outpainting_lora.yaml
Text-to-Audio (T2A)
Generate audio from text prompts with no video modality. Only the audio branch of the transformer is denoised. Since
no video modality is configured, this mode uses audio-only LoRA targets — explicitly targeting audio_attn1,
audio_attn2, and audio_ff modules.
With no video block in the strategy, the trainer only loads audio latents and text embeddings. LoRA adapters should
explicitly target audio modules (e.g. audio_attn1.to_k) rather than short patterns like to_k which would also match
video modules. See LoRA Target Modules Guidance below.
Example config: t2a_lora.yaml
Audio Extension
Extend audio forward (prefix) or backward (suffix) in time — the audio equivalent of Video Extension. A span of
existing audio latent frames is provided as clean conditioning, and the model generates the continuation. The
temporal_boundary sets the number of latent frames used as context. This mode uses audio-only LoRA targets.
Example configs: audio_extend_lora.yaml, audio_suffix_lora.yaml
Audio Inpainting
Fill in masked regions of audio. Per-sample masks loaded from disk define which audio tokens are conditioning and which must be generated — the audio equivalent of Video Inpainting. Masks are float-valued in [0, 1] with the same semantics as video inpainting. This mode uses audio-only LoRA targets.
Dataset structure:
Example config: audio_inpainting_lora.yaml
IC-LoRA / Audio-to-Audio (A2A)
In-Context LoRA for audio-to-audio transformations. Pre-encoded reference audio latents are concatenated to the target sequence — reference tokens participate in bidirectional self-attention but receive no noise and are excluded from loss. This enables audio style transfer, voice conversion, sound effect transformation, and more. This mode uses audio-only LoRA targets.
Dataset structure:
Example config: a2a_ic_lora.yaml
AV2AV IC-LoRA
Joint audio-video In-Context LoRA — both modalities have reference conditioning. Pre-encoded reference latents are concatenated to each modality’s target sequence independently. This enables joint audiovisual transformations such as synchronized style transfer across both video and audio.
Unlike audio-only IC-LoRA (A2A), AV2AV uses short LoRA target patterns like "to_k" to match all branches (video,
audio, and cross-modal attention), since both modalities are trained.
Dataset structure:
Example config: av2av_ic_lora.yaml
Full Model Fine-tuning
All modes above default to training_mode: "lora". For full fine-tuning, set training_mode: "full" — this updates
all model parameters rather than adding LoRA adapters.
Full fine-tuning requires multiple high-end GPUs (e.g. 4–8× H100 80GB) and distributed training with FSDP. See the Training Guide for multi-GPU setup instructions.
LoRA Target Modules Guidance
The target_modules configuration determines which transformer modules receive LoRA adapters. The right choice depends
on whether your training involves cross-modal (audio ↔ video) interaction.
For T2V, I2V, A2V, V2A, or any mode involving both modalities — use short patterns to match all branches (video, audio, and cross-modal attention):
Short patterns like "to_k" match video modules (attn1.to_k, attn2.to_k), audio modules (audio_attn1.to_k,
audio_attn2.to_k), and cross-modal modules (audio_to_video_attn.to_k, video_to_audio_attn.to_k). The cross-modal
attention modules enable bidirectional information flow between audio and video, which is critical for synchronized
audiovisual generation. See Understanding Target Modules for detailed guidance.
For video-only IC-LoRA — explicitly target video modules (including FFN layers for better transformation quality):
For audio-only modes (T2A, Audio Extension, Audio Inpainting, A2A IC-LoRA) — explicitly target audio modules:
Audio-only modes have no video block in the strategy, so there is no need to train video or cross-modal attention
modules. Targeting only audio_* modules keeps the LoRA small and focused.
Using Trained Models for Inference
After training, use the ltx-pipelines
package for production inference with your trained LoRAs:
All pipelines support loading custom LoRAs via the loras parameter.
You can generate audio during validation even if you’re not training the audio branch. Set
validation.generate_audio: true independently of whether audio has is_generated: true.
Migration from Legacy Strategies
Legacy text_to_video and video_to_video strategy configs are forward-compatible and will continue to work (with a
deprecation warning). We recommend migrating to flexible for access to all conditioning modes.