From 31f7df3b3adcbfdc5174b3d3109dcb64ee17e6c6 Mon Sep 17 00:00:00 2001 From: Kohya S <52813779+kohya-ss@users.noreply.github.com> Date: Tue, 23 Sep 2025 18:53:36 +0900 Subject: [PATCH] doc: add --network_train_unet_only option for HunyuanImage-2.1 training --- docs/hunyuan_image_train_network.md | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/docs/hunyuan_image_train_network.md b/docs/hunyuan_image_train_network.md index 165c3df4..b2bf113d 100644 --- a/docs/hunyuan_image_train_network.md +++ b/docs/hunyuan_image_train_network.md @@ -123,6 +123,7 @@ accelerate launch --num_cpu_threads_per_process 1 hunyuan_image_train_network.py --network_module=networks.lora_hunyuan_image \ --network_dim=16 \ --network_alpha=1 \ + --network_train_unet_only \ --learning_rate=1e-4 \ --optimizer_type="AdamW8bit" \ --lr_scheduler="constant" \ @@ -139,6 +140,8 @@ accelerate launch --num_cpu_threads_per_process 1 hunyuan_image_train_network.py --cache_latents ``` +**HunyuanImage-2.1 training does not support LoRA modules for Text Encoders, so `--network_train_unet_only` is required.** +
日本語 @@ -165,6 +168,8 @@ The script adds HunyuanImage-2.1 specific arguments. For common arguments (like #### HunyuanImage-2.1 Training Parameters +* `--network_train_unet_only` **[Required]** + - Specifies that only the DiT model will be trained. LoRA modules for Text Encoders are not supported. * `--discrete_flow_shift=` - Specifies the shift value for the scheduler used in Flow Matching. Default is `5.0`. * `--model_prediction_type=` @@ -181,7 +186,7 @@ The script adds HunyuanImage-2.1 specific arguments. For common arguments (like * `--split_attn` - Splits the batch during attention computation to process one item at a time, reducing VRAM usage by avoiding attention mask computation. Can improve speed when using `torch`. Required when using `xformers` with batch size greater than 1. * `--fp8_scaled` - - Enables training the DiT model in scaled FP8 format. This can significantly reduce VRAM usage (can run with as little as 8GB VRAM when combined with `--blocks_to_swap`), but the training results may vary. This is a newer alternative to the unsupported `--fp8_base` option. + - Enables training the DiT model in scaled FP8 format. This can significantly reduce VRAM usage (can run with as little as 8GB VRAM when combined with `--blocks_to_swap`), but the training results may vary. This is a newer alternative to the unsupported `--fp8_base` option. See [Musubi Tuner's documentation](https://github.com/kohya-ss/musubi-tuner/blob/main/docs/advanced_config.md#fp8-weight-optimization-for-models--%E3%83%A2%E3%83%87%E3%83%AB%E3%81%AE%E9%87%8D%E3%81%BF%E3%81%AEfp8%E3%81%B8%E3%81%AE%E6%9C%80%E9%81%A9%E5%8C%96) for details. * `--fp8_vl` - Use FP8 for the VLM (Qwen2.5-VL) text encoder. * `--text_encoder_cpu` @@ -449,7 +454,7 @@ python hunyuan_image_minimal_inference.py \ **Key Options:** - `--fp8_scaled`: Use scaled FP8 format for reduced VRAM usage during inference - `--blocks_to_swap`: Swap blocks to CPU to reduce VRAM usage -- `--image_size`: Resolution (inference is most stable at 2048x2048) +- `--image_size`: Resolution in **height width** (inference is most stable at 2560x1536, 2304x1792, 2048x2048, 1792x2304, 1536x2560 according to the official repo) - `--guidance_scale`: CFG scale (default: 3.5) - `--flow_shift`: Flow matching shift parameter (default: 5.0) - `--text_encoder_cpu`: Run the text encoders on CPU to reduce VRAM usage