Custom Workflows for More Control

This guide walks you through the Two-Stage Distilled workflow: a sample workflow for ComfyUI from LTX that generates video with synchronized audio using a two-pass pipeline (first at low resolution, then upscaled and refined at full resolution). It supports both Text-to-Video and Image-to-Video in a single workflow.

Compared to the default ComfyUI templates covered in the Text-to-Video and Image-to-Video beginner guides, this workflow uses higher-precision model files and additional nodes that give you more control over your video generation. Use it when you want higher-quality output, more control over resolution and VRAM usage, or a starting point for building your own custom workflows.

If you haven’t used the default templates yet, start there first — this guide assumes you’re familiar with the basics of prompting, generating, and iterating in ComfyUI.

What’s Different from the Default ComfyUI Templates

The default templates are designed to get you generating as quickly as possible with minimal setup. This workflow uses the same two-stage pipeline architecture but upgrades three of the four model files and adds nodes that give you direct control over settings the templates handle automatically.

Component	Default Template	This Workflow
Model checkpoint	FP8-quantized (`ltx-2.3-22b-dev-fp8.safetensors`)	Full-precision (`ltx-2.3-22b-dev.safetensors`)
Distilled LoRA	`ltx-2.3-22b-distilled-lora-384.safetensors`	v1.1 (`ltx-2.3-22b-distilled-lora-384-1.1.safetensors`)
Text encoder	FP4-quantized Gemma 3 12B	Standard-precision Gemma 3 12B
Spatial upscaler	`ltx-2.3-spatial-upscaler-x2-1.1.safetensors`	Same

The full-precision checkpoint and higher-precision text encoder produce more detailed output with better color fidelity, and the v1.1 LoRA improves motion quality. The trade-off is higher VRAM usage and larger model file downloads.

Beyond the model upgrades, the workflow also exposes:

Tiled VAE decoding — adjust tile count and overlap to balance VRAM usage against decode speed
API text encoding — optionally offload text encoding to a free API, freeing local VRAM for generation
Direct frame count control — set the exact number of frames rather than a duration in seconds

Step-by-Step Guide

Check the system requirements to make sure you have sufficient hardware to support running this workflow.

1. Download and Load the Workflow

Download the Two-Stage Distilled workflow JSON from our GitHub repository and drag it into ComfyUI to load it.

2. Install Custom Nodes and Download Models

This workflow requires the ComfyUI-LTXVideo custom node package. Open the Workflow Overview panel (right sidebar). If you’re missing any custom nodes or model files, it will list them and let you install or download them directly.

The workflow uses these model files:

File	Description	Placement
`ltx-2.3-22b-dev.safetensors`	Full-precision model checkpoint	`ComfyUI/models/checkpoints/`
`ltx-2.3-22b-distilled-lora-384-1.1.safetensors`	Distilled LoRA (v1.1)	`ComfyUI/models/loras/`
`comfy_gemma_3_12B_it.safetensors`	Text encoder (Gemma 3 12B)	`ComfyUI/models/text_encoders/`
`ltx-2.3-spatial-upscaler-x2-1.1.safetensors`	Spatial upscaler (2x)	`ComfyUI/models/latent_upscale_models/`

Model downloads may take some time depending on your connection. You only need to download these once.

All files are also available in the LTX-2.3 HuggingFace collection if you prefer to download them manually.

3. Choose Text-to-Video or Image-to-Video

This workflow supports both modes. Find the bypass_i2v toggle:

true = Text-to-Video mode (default). The model generates the video entirely from your prompt.
false = Image-to-Video mode. The model uses your source image as the first frame.

For Image-to-Video, also find the LoadImage node and select your source image. The workflow includes a resize node that scales your image to fit the configured resolution automatically.

4. Write Your Prompt

Find the CLIP Text Encode (Positive Prompt) node and write your prompt.

For Text-to-Video, describe the full scene: setting, characters, camera movement, and audio
For Image-to-Video, focus on motion, action, and audio: the image already provides the visual context

See the Prompting Guide for detailed tips and examples.

The workflow also includes a CLIP Text Encode (Negative Prompt) node, pre-filled with "pc game, console game, video game, cartoon, childish, ugly". This steers the model away from common visual artifacts. You can leave it as-is or edit it to exclude specific styles or qualities from your output.

5. Set Frame Count and Resolution

This workflow uses frame count directly rather than duration in seconds. The default is 121 frames at 24 fps, which produces roughly 5 seconds of video. To calculate frames for a different duration: frames = duration in seconds × frame rate + 1.

The default resolution is 960×544, which is upscaled to approximately 1920×1088 by the spatial upscaler. Video dimensions must be divisible by 32.

Parameter	Node	Default
Frame count	`number of frames`	121
Frame rate	`fps`	24
Width	`EmptyLTXVLatentVideo`	960
Height	`EmptyLTXVLatentVideo`	544

The full-precision checkpoint requires more VRAM than the FP8 version. If you run into memory issues, try reducing the resolution, lowering the frame count, or enabling API text encoding (see below).

6. Generate

Click Run to start generation. The pipeline runs the same two-stage process as the default templates:

Stage 1 — Generates video and audio jointly at the base resolution (960×544 at default settings) in 8 steps
Upscale — The spatial upscaler doubles the video resolution
Stage 2 — Refines the upscaled video at full resolution (~1920×1088) in 3 steps

For a detailed explanation of how each stage works, see the Text-to-Video or Image-to-Video guides.

7. Review and Iterate

The output is saved as an MP4 with synchronized audio. To iterate:

Adjust the prompt to change the motion, action, or audio
Change the frame count for longer or shorter video
Switch between T2V and I2V using the bypass toggle
Try different resolutions to match your target format

Customization Options

This workflow exposes more settings than the default ComfyUI templates, but you don’t need to change most of them. The defaults are tuned to produce good results out of the box. The settings in the step-by-step guide above (prompt, mode, frame count, and resolution) are the ones you’ll adjust for every generation. The options below are worth exploring once you’re comfortable with the basics.

Negative Prompt

The negative prompt is pre-filled with defaults that work well for most use cases. If you’re seeing specific unwanted qualities in your output (a particular visual style, lighting issue, or motion artifact), try adding descriptive terms to the negative prompt to steer the model away from them.

Tiled VAE Decode

The LTXVTiledVAEDecode node splits the video decoding into tiles, reducing peak VRAM usage at the cost of slightly slower decoding. You can adjust the tile count and overlap between tiles. The defaults work well for most hardware — adjust these only if you’re running into memory issues during the decode step, in which case try increasing the tile count.

CFG

Both stages use CFG 1. The distilled model was trained to produce good results at this value because guidance is baked into the distillation process. Raising CFG does not improve output the way it would with a standard diffusion model. Increasing it above 1 adds computational overhead (doubling the forward passes per step) and can cause oversaturation or distortion. If you want to experiment, stay in the 1.0–1.5 range.

API Text Encoding

If you’re running low on VRAM, the workflow includes two GemmaAPITextEncode nodes (one for the positive prompt, one for the negative prompt) that are bypassed by default. When enabled, these offload text encoding to a free LTX API instead of running the Gemma 3 model locally, freeing significant VRAM for generation.

To enable API text encoding, right-click each GemmaAPITextEncode node and set it to active, then bypass the local LTXAVTextEncoderLoader and CLIPTextEncode nodes. You’ll need an API key from the LTX API Console — enter it in the LTX API KEY node.

Using LoRAs

LoRAs can be added to further customize the model’s output style, motion characteristics, or character appearance. Add a LoRALoader node to your workflow to apply:

Style LoRAs for artistic or visual aesthetics
Motion LoRAs for specific types of movement
Character LoRAs for consistent character appearance

See the LoRA guide for usage instructions.

Python

Video generation is also available through the PyTorch API for programmatic use and custom pipelines. See the PyTorch API documentation for setup and usage.

If you haven’t used the default templates yet, start there first — this guide assumes you’re familiar with the basics of prompting, generating, and iterating in ComfyUI.

What’s Different from the Default ComfyUI Templates

Component	Default Template	This Workflow
Model checkpoint	FP8-quantized (`ltx-2.3-22b-dev-fp8.safetensors`)	Full-precision (`ltx-2.3-22b-dev.safetensors`)
Distilled LoRA	`ltx-2.3-22b-distilled-lora-384.safetensors`	v1.1 (`ltx-2.3-22b-distilled-lora-384-1.1.safetensors`)
Text encoder	FP4-quantized Gemma 3 12B	Standard-precision Gemma 3 12B
Spatial upscaler	`ltx-2.3-spatial-upscaler-x2-1.1.safetensors`	Same

Beyond the model upgrades, the workflow also exposes:

Tiled VAE decoding — adjust tile count and overlap to balance VRAM usage against decode speed
API text encoding — optionally offload text encoding to a free API, freeing local VRAM for generation
Direct frame count control — set the exact number of frames rather than a duration in seconds

Step-by-Step Guide

Check the system requirements to make sure you have sufficient hardware to support running this workflow.

1. Download and Load the Workflow

Download the Two-Stage Distilled workflow JSON from our GitHub repository and drag it into ComfyUI to load it.

2. Install Custom Nodes and Download Models

The workflow uses these model files:

File	Description	Placement
`ltx-2.3-22b-dev.safetensors`	Full-precision model checkpoint	`ComfyUI/models/checkpoints/`
`ltx-2.3-22b-distilled-lora-384-1.1.safetensors`	Distilled LoRA (v1.1)	`ComfyUI/models/loras/`
`comfy_gemma_3_12B_it.safetensors`	Text encoder (Gemma 3 12B)	`ComfyUI/models/text_encoders/`
`ltx-2.3-spatial-upscaler-x2-1.1.safetensors`	Spatial upscaler (2x)	`ComfyUI/models/latent_upscale_models/`

Model downloads may take some time depending on your connection. You only need to download these once.

All files are also available in the LTX-2.3 HuggingFace collection if you prefer to download them manually.

3. Choose Text-to-Video or Image-to-Video

This workflow supports both modes. Find the bypass_i2v toggle:

true = Text-to-Video mode (default). The model generates the video entirely from your prompt.
false = Image-to-Video mode. The model uses your source image as the first frame.

For Image-to-Video, also find the LoadImage node and select your source image. The workflow includes a resize node that scales your image to fit the configured resolution automatically.

4. Write Your Prompt

Find the CLIP Text Encode (Positive Prompt) node and write your prompt.

For Text-to-Video, describe the full scene: setting, characters, camera movement, and audio
For Image-to-Video, focus on motion, action, and audio: the image already provides the visual context

See the Prompting Guide for detailed tips and examples.

5. Set Frame Count and Resolution

The default resolution is 960×544, which is upscaled to approximately 1920×1088 by the spatial upscaler. Video dimensions must be divisible by 32.

Parameter	Node	Default
Frame count	`number of frames`	121
Frame rate	`fps`	24
Width	`EmptyLTXVLatentVideo`	960
Height	`EmptyLTXVLatentVideo`	544

The full-precision checkpoint requires more VRAM than the FP8 version. If you run into memory issues, try reducing the resolution, lowering the frame count, or enabling API text encoding (see below).

6. Generate

Click Run to start generation. The pipeline runs the same two-stage process as the default templates:

Stage 1 — Generates video and audio jointly at the base resolution (960×544 at default settings) in 8 steps
Upscale — The spatial upscaler doubles the video resolution
Stage 2 — Refines the upscaled video at full resolution (~1920×1088) in 3 steps

For a detailed explanation of how each stage works, see the Text-to-Video or Image-to-Video guides.

7. Review and Iterate

The output is saved as an MP4 with synchronized audio. To iterate:

Adjust the prompt to change the motion, action, or audio
Change the frame count for longer or shorter video
Switch between T2V and I2V using the bypass toggle
Try different resolutions to match your target format

Customization Options

Negative Prompt

Tiled VAE Decode

CFG

API Text Encoding

Using LoRAs

LoRAs can be added to further customize the model’s output style, motion characteristics, or character appearance. Add a LoRALoader node to your workflow to apply:

Style LoRAs for artistic or visual aesthetics
Motion LoRAs for specific types of movement
Character LoRAs for consistent character appearance

See the LoRA guide for usage instructions.

Python

Video generation is also available through the PyTorch API for programmatic use and custom pipelines. See the PyTorch API documentation for setup and usage.