Text-to-Audio Workflow

This guide walks you through the Text-to-Audio (T2A) workflow: a ComfyUI workflow that generates audio from a text prompt using the LTX-2.3 model in audio-only mode. No video input or output is involved. The model generates audio directly from your description.

The workflow uses LTXVAudioOnlyModel to disable the model’s video pathway, running only the audio stream of the joint audio-video transformer. This produces audio with zero dependence on visual content, while reusing the same model checkpoint and text encoder as standard video generation.

When to Use

Text-to-Audio is the right workflow when you want to generate audio from a text description without any video. If you need audio synchronized to video, use the standard Text-to-Video workflow instead, which generates audio and video jointly.

Prerequisites

This guide assumes you’re familiar with ComfyUI basics. If you’re new to LTX-2 in ComfyUI, start with the ComfyUI setup guide.

Model Files

FileDescriptionPlacement
ltx-2.3-22b-dev.safetensorsFull-precision model checkpointComfyUI/models/checkpoints/
ltx-2.3-22b-distilled-lora-384-1.1.safetensorsDistilled LoRA (v1.1)ComfyUI/models/loras/
comfy_gemma_3_12B_it.safetensorsText encoder (Gemma 3 12B)ComfyUI/models/text_encoders/

All files are available in the LTX-2.3 HuggingFace collection.

No additional audio-specific model files are required. The audio VAE is loaded from the same ltx-2.3-22b-dev.safetensors checkpoint used for video generation.

Step-by-Step

1. Download and Load the Workflow

[Download the Text-to-Audio workflow JSON][https://github.com/Lightricks/ComfyUI-LTXVideo/tree/master/example_workflows/2.3] from our GitHub repository and drag it into ComfyUI to load it.

2. Install Custom Nodes and Download Models

This workflow requires the ComfyUI-LTXVideo custom node package. Open the Workflow Overview panel (right sidebar) to check for missing nodes or model files.

3. Write Your Prompt

Find the CLIP Text Encode (Positive Prompt) node and describe the audio you want to generate. The prompt should describe the sound, speech, or audio scene, as the model generates audio to match your description.

Example prompt:

A woman saying: “Oh, what a lovely day we are having!“

4. Set Duration

Find the number of frames node and set the frame count for your desired audio length. At the default 24 fps:

  • 121 frames ≈ 5 seconds (default)
  • 97 frames ≈ 4 seconds
  • 49 frames ≈ 2 seconds

5. Generate

Click Run to start generation. The 8-step distilled schedule produces audio in seconds.

6. Preview

The PreviewAudio node at the end of the workflow plays the generated audio directly in ComfyUI.

How the Pipeline Works

Understanding the pipeline helps when troubleshooting or fine-tuning results.

LTX-2.3 is a joint audio-video transformer that processes both modalities in a single model. The T2A workflow puts this model into audio-only mode using LTXVAudioOnlyModel, which disables the video stream and all cross-modal attention:

  1. Load model — The standard LTX-2.3 checkpoint and distilled LoRA are loaded, then LTXVAudioOnlyModel disables the video stream, audio→video cross-attention, and video→audio cross-attention. Only the audio pathway runs.
  2. Encode prompt — Your text description is encoded with Gemma 3 12B, the same text encoder used for video generation.
  3. Prepare latents — An empty audio latent is created for the target duration. A minimal dummy video latent (64×64, 1 frame) is also created — the model architecture expects a video input, but with audio-only mode enabled, it is never processed and adds negligible cost.
  4. Sample — The audio is denoised using the distilled 8-step schedule with euler_ancestral_cfg_pp and CFG 1.
  5. Decode — The audio latent is separated from the dummy video and decoded through the audio VAE.

Tips & Troubleshooting

  • Dummy video latent — Don’t remove the EmptyLTXVLatentVideo node (64×64, 1 frame). The model architecture requires a video input even in audio-only mode. It’s never processed and adds negligible cost.
  • Audio length — Adjust the number of frames node to control duration. The audio latent length is determined by frames and frame rate.

Technical Notes

  • The LTX-2.3 model is a single joint audio-video transformer that splits its input into [video, audio] streams. LTXVAudioOnlyModel disables the video stream entirely rather than creating a separate audio model.
  • The dummy video latent (64×64, 1 frame ≈ 4 tokens) must be concatenated with the audio latent via LTXVConcatAVLatent before sampling. With run_vx=False, these tokens are never attended to.
  • After sampling, LTXVSeparateAVLatent splits the output and only the audio latent is decoded through LTXVAudioVAEDecode.
  • The audio VAE is loaded from the main checkpoint via LTXVAudioVAELoader. No separate audio model file is needed.