This guide walks you through the Two-Stage Distilled workflow: a sample workflow for ComfyUI from LTX that generates video with synchronized audio using a two-pass pipeline (first at low resolution, then upscaled and refined at full resolution). It supports both Text-to-Video and Image-to-Video in a single workflow.
Compared to the default ComfyUI templates covered in the Text-to-Video and Image-to-Video beginner guides, this workflow uses higher-precision model files and additional nodes that give you more control over your video generation. Use it when you want higher-quality output, more control over resolution and VRAM usage, or a starting point for building your own custom workflows.
If you haven’t used the default templates yet, start there first — this guide assumes you’re familiar with the basics of prompting, generating, and iterating in ComfyUI.
The default templates are designed to get you generating as quickly as possible with minimal setup. This workflow uses the same two-stage pipeline architecture but upgrades three of the four model files and adds nodes that give you direct control over settings the templates handle automatically.
The full-precision checkpoint and higher-precision text encoder produce more detailed output with better color fidelity, and the v1.1 LoRA improves motion quality. The trade-off is higher VRAM usage and larger model file downloads.
Beyond the model upgrades, the workflow also exposes:
Check the system requirements to make sure you have sufficient hardware to support running this workflow.
Download the Two-Stage Distilled workflow JSON from our GitHub repository and drag it into ComfyUI to load it.
This workflow requires the ComfyUI-LTXVideo custom node package. Open the Workflow Overview panel (right sidebar). If you’re missing any custom nodes or model files, it will list them and let you install or download them directly.
The workflow uses these model files:
Model downloads may take some time depending on your connection. You only need to download these once.
All files are also available in the LTX-2.3 HuggingFace collection if you prefer to download them manually.
This workflow supports both modes. Find the bypass_i2v toggle:
true = Text-to-Video mode (default). The model generates the video entirely from your prompt.false = Image-to-Video mode. The model uses your source image as the first frame.For Image-to-Video, also find the LoadImage node and select your source image. The workflow includes a resize node that scales your image to fit the configured resolution automatically.
Find the CLIP Text Encode (Positive Prompt) node and write your prompt.
See the Prompting Guide for detailed tips and examples.
The workflow also includes a CLIP Text Encode (Negative Prompt) node, pre-filled with "pc game, console game, video game, cartoon, childish, ugly". This steers the model away from common visual artifacts. You can leave it as-is or edit it to exclude specific styles or qualities from your output.
This workflow uses frame count directly rather than duration in seconds. The default is 121 frames at 24 fps, which produces roughly 5 seconds of video. To calculate frames for a different duration: frames = duration in seconds × frame rate + 1.
The default resolution is 960×544, which is upscaled to approximately 1920×1088 by the spatial upscaler. Video dimensions must be divisible by 32.
The full-precision checkpoint requires more VRAM than the FP8 version. If you run into memory issues, try reducing the resolution, lowering the frame count, or enabling API text encoding (see below).
Click Run to start generation. The pipeline runs the same two-stage process as the default templates:
For a detailed explanation of how each stage works, see the Text-to-Video or Image-to-Video guides.
The output is saved as an MP4 with synchronized audio. To iterate:
This workflow exposes more settings than the default ComfyUI templates, but you don’t need to change most of them. The defaults are tuned to produce good results out of the box. The settings in the step-by-step guide above (prompt, mode, frame count, and resolution) are the ones you’ll adjust for every generation. The options below are worth exploring once you’re comfortable with the basics.
The negative prompt is pre-filled with defaults that work well for most use cases. If you’re seeing specific unwanted qualities in your output (a particular visual style, lighting issue, or motion artifact), try adding descriptive terms to the negative prompt to steer the model away from them.
The LTXVTiledVAEDecode node splits the video decoding into tiles, reducing peak VRAM usage at the cost of slightly slower decoding. You can adjust the tile count and overlap between tiles. The defaults work well for most hardware — adjust these only if you’re running into memory issues during the decode step, in which case try increasing the tile count.
Both stages use CFG 1. The distilled model was trained to produce good results at this value because guidance is baked into the distillation process. Raising CFG does not improve output the way it would with a standard diffusion model. Increasing it above 1 adds computational overhead (doubling the forward passes per step) and can cause oversaturation or distortion. If you want to experiment, stay in the 1.0–1.5 range.
If you’re running low on VRAM, the workflow includes two GemmaAPITextEncode nodes (one for the positive prompt, one for the negative prompt) that are bypassed by default. When enabled, these offload text encoding to a free LTX API instead of running the Gemma 3 model locally, freeing significant VRAM for generation.
To enable API text encoding, right-click each GemmaAPITextEncode node and set it to active, then bypass the local LTXAVTextEncoderLoader and CLIPTextEncode nodes. You’ll need an API key from the LTX API Console — enter it in the LTX API KEY node.
LoRAs can be added to further customize the model’s output style, motion characteristics, or character appearance. Add a LoRALoader node to your workflow to apply:
See the LoRA guide for usage instructions.
Video generation is also available through the PyTorch API for programmatic use and custom pipelines. See the PyTorch API documentation for setup and usage.