Trainer Quick Start | LTX Documentation

Get up and running with LTX-2 training in a few steps.

New to training? Start with the agent. You don’t have to run these steps by hand — open the repo in Claude Code and run /train-model. It makes the same decisions described below and explains each step as it goes, pausing for your approval before any heavy work. See the train-model skill for the full phase-by-phase reference.

Prerequisites

Before you begin, ensure you have:

LTX-2 model checkpoint — a local .safetensors file with the model weights. Download ltx-2.3-22b-dev.safetensors from the LTX-2.3 collection on HuggingFace.
Gemma text encoder — a local directory with the Gemma model (required for LTX-2). Download from HuggingFace.
Linux with CUDA — the trainer requires triton, which is Linux-only.
A GPU with enough VRAM — 80GB recommended for the standard config.

For 32GB GPUs (e.g. RTX 5090), use the low-VRAM config, which enables INT8 quantization and other memory optimizations.

Installation

First install uv if you haven’t already, then clone the repository:

$ git clone https://github.com/Lightricks/LTX-2

The ltx-trainer package is part of the LTX-2 monorepo. Install dependencies from the repository root, then move into the trainer package:

$ # From the repository root
$ uv sync
$ cd packages/ltx-trainer

The trainer depends on the ltx-core and ltx-pipelines packages, which are installed automatically from the monorepo.

Training Workflow

1. Prepare your dataset

Organize your videos with captions, then preprocess them into cached latents and text embeddings:

$ uv run python scripts/process_dataset.py dataset.json \
>     --resolution-buckets "960x544x49" \
>     --model-path /path/to/ltx-2-model.safetensors \
>     --text-encoder-path /path/to/gemma-model

Audio latents are extracted from your videos automatically. Optional scene splitting (split_scenes.py) and automatic captioning (caption_videos.py) are also available. For the full preprocessing workflow — dataset format, captioning setup, resolution buckets, masks, and all CLI options — see the Dataset Preparation guide on GitHub.

2. Configure training

Create or modify a configuration YAML file. Start from one of the example configs:

t2v_lora.yaml — text-to-video LoRA
t2v_lora_low_vram.yaml — same, tuned for ~32GB VRAM (INT8 quantization and memory optimizations)
v2v_ic_lora.yaml — IC-LoRA video-to-video

Key settings to update:

1 model:
2   model_path: "/path/to/ltx-2-model.safetensors"
3   text_encoder_path: "/path/to/gemma-model"
4 
5 data:
6   preprocessed_data_root: "/path/to/preprocessed/data"
7 
8 output_dir: "outputs/my_training_run"

See the Configuration Reference for all available options.

3. Start training

$ uv run python scripts/train.py configs/t2v_lora.yaml

For multi-GPU training:

$ uv run accelerate launch scripts/train.py configs/t2v_lora.yaml

See the Training Guide for distributed training (DDP/FSDP), HuggingFace Hub uploads, and Weights & Biases logging.

Training Modes

First time? Start with t2v_lora.yaml — it’s the simplest mode and only requires videos with captions. Explore other modes once you’ve confirmed your setup works.

All modes are expressed through the single flexible training strategy. The trainer supports:

Mode	Description	Example Config
Text-to-Video	Generate video+audio from text prompts	`t2v_lora.yaml`
Image-to-Video	Animate from a starting image	`i2v_lora.yaml`
Video Extension	Extend videos temporally (forward/backward)	`video_extend_lora.yaml`
IC-LoRA (V2V)	Video-to-video transformations	`v2v_ic_lora.yaml`
Audio-to-Video	Generate video conditioned on audio	`a2v_lora.yaml`
Video-to-Audio	Generate audio/foley from video	`v2a_lora.yaml`
Video Inpainting	Fill in masked regions of video	`video_inpainting_lora.yaml`
Video Outpainting	Extend video spatially	`video_outpainting_lora.yaml`
Text-to-Audio	Generate audio from text prompts	`t2a_lora.yaml`
Audio Extension	Extend audio temporally	`audio_extend_lora.yaml`
Audio Inpainting	Fill in masked regions of audio	`audio_inpainting_lora.yaml`
IC-LoRA (A2A)	Audio-to-audio transformations	`a2a_ic_lora.yaml`
AV2AV IC-LoRA	Audio+video IC-LoRA transformations	`av2av_ic_lora.yaml`
Full Fine-tuning	Full model training (any mode above)	Set `model.training_mode: "full"`

See Training Modes for detailed explanations of each mode.

Reference docs on GitHub

Dataset Preparation

Preprocess videos, generate captions, and build resolution buckets.

Configuration Reference

Every available training parameter.

Training Guide

Distributed training (DDP/FSDP), HuggingFace Hub, and W&B logging.

Utility Scripts

Tools for dataset management and debugging.

Custom Training Strategies

Go beyond the flexible strategy with your own training logic.

Troubleshooting

Solutions to common training problems.

🎬 Happy training! May your loss curves trend down and your VRAM never run out.