Text-to-Video
Generate synchronized video and audio entirely from text prompts.
What This Workflow Is For
Use Text-to-Video when you want:
- Full creative freedom to explore concepts from scratch
- No fixed reference - when there’s no specific character or frame to anchor the scene
- Narrative exploration to test styles, moods, or story ideas
Workflow Video Walkthrough
Before Starting
ComfyUI and the LTX-2 nodes need to be installed. See our Installation guide for full instructions.
Step-by-Step Guide
1. Load the Workflow
- Open ComfyUI
- Load the Text-to-Video workflow JSON from the LTX-2 repository
- The workflow will display as a node graph
2. Load Model Checkpoints
Locate the LTXVCheckpointLoader node and configure:
- Model selection - Choose between full or distilled checkpoint
- Full model: Higher quality, slower generation
- Distilled model: Faster generation, slightly reduced quality
- VAE: Automatically loaded with the checkpoint
- Text encoder: Gemma 3 encoder for processing prompts
3. Configure Parameters
Find the LTXVImgToVideoConditioning node (the same node is used for T2V) and set:
Resolution
Choose from standard aspect ratios:
- 768×512 (3:2 landscape)
- 512×768 (2:3 portrait)
- 704×512 (4:3 standard)
- 512×704 (3:4 vertical)
- 640×640 (1:1 square)
Important: Higher resolutions require more VRAM. Start with lower resolutions for testing.
Frame Count
- Maximum: 257 frames (~10 seconds at 25fps)
- Recommended range: 121-161 frames for balanced quality and memory usage
- Shorter videos: Use 65-97 frames for quick iterations
Frame Rate
- Standard: 25 fps (default, works for most content)
- Smooth motion: 30 fps (better for fast action)
- Cinematic: 24 fps (film-like feel)
Critical: The frame rate must match across all nodes in your workflow, including the upscaler and video decoder.
4. Write Your Prompt
Prompt quality matters Without a starting image, the model relies entirely on your description.
Strong Prompts Include
- Scene description - Environment, lighting, time of day
- Example: “A misty forest at dawn with rays of sunlight filtering through tall pine trees”
- Character details - Appearance, behavior, actions
- Example: “A red fox with bright amber eyes carefully stepping through fallen leaves”
- Camera motion - Shot type, movement, angles
- Example: “Slow dolly shot moving forward, low angle perspective”
- Audio cues - Dialogue, music, ambient sound effects
- Example: “Gentle rustling of leaves, distant bird calls, soft wind”
Longer, descriptive prompts consistently produce better motion, coherence, and audio alignment. See our Prompting guide for additional tips on writing good prompts.
5. Prompt Enhancement
Locate the Prompt Enhancer node:
- Enabled by default: Automatically expands and improves your prompt
- Bypass option: Disable when you need exact control over phrasing
- Best for: Shorter prompts that need more detail
The enhancer adds:
- Visual details and scene elements
- Motion descriptions and dynamics
- Audio cues and atmospheric details
6. Configure Sampling
Find the KSampler or LTXVSampler node:
Steps
- Distilled model: 4-8 steps (optimized for speed)
- Full model: 20-50 steps (higher quality)
- Start with lower values and increase if quality is insufficient
CFG (Classifier-Free Guidance)
- Range: 2.0-5.0
- Lower values (2.0-3.0): More creative, less prompt adherence
- Higher values (4.0-5.0): Stronger prompt adherence, less variation
- Recommended: 3.0-3.5 for balanced results
Sampler Type
- euler: Fast, good for testing
- dpmpp_2m: Higher quality, slightly slower
- Experiment to find your preference
Seed
- Fixed seed: Reproducible results for iteration
- Random seed: Explore variations
7. Two-Stage Generation
The workflow uses a multi-scale approach:
Base Generation
- Generates at your specified resolution
- Fast iteration for testing prompts and parameters
Upscale Pass
- Increases resolution and refines details
- Uses the LTXVUpscale node
- Scale factor: Typically 2x
- Frame rate: Must match base generation
This approach provides:
- Faster experimentation at lower resolution
- High-quality final output without generating everything at max resolution from the start
8. Decoding
The workflow uses LTXVDecoder nodes:
Audio Decoder
- Processes audio latents separately
- Outputs synchronized audio stream
- Supports dialogue, music, and ambient sound
Video Decoder
- Uses tiled decoding to minimize VRAM usage
- Processes video latents in manageable chunks
- Maintains quality while reducing memory requirements
Note: Audio and video are generated separately, then merged during decoding for synchronized output.
9. Save Output
Configure the SaveVideo node:
- Format: MP4 (default), MOV, or WebM
- Codec: H.264 (compatibility) or H.265 (smaller files)
- Audio: Automatically embedded from audio decoder
- Filename: Use descriptive names for organization
Advanced Techniques
Full Model Variant
The full model workflow provides higher quality at the cost of longer generation time.
Key Differences
- Uses the full LTX-2 checkpoint and specialized VAE
- Stage 1: 15-20 steps (up to 40 for experimentation)
- Uses LTXV Scheduler instead of manual sigmas
- Applies distilled LoRA in Stage 2 (recommended strength: 0.6)
Using LoRAs
Add LoRALoader nodes to customize:
- Style LoRAs: Apply artistic styles or visual aesthetics
- Motion LoRAs: Enhance specific types of movement
- Character LoRAs: Maintain consistent character appearance
See the LoRA guide for training and usage.