update README

2026-04-06 21:52:27 +00:00 · 2024-09-02 13:05:26 +09:00
parent 4f6d915d15
commit 6abacf04da
1 changed files with 14 additions and 6 deletions
--- a/README.md
+++ b/README.md
@@ -184,7 +184,7 @@ Options are almost the same as LoRA training. The difference is `--full_bf16`, `

 `--blockwise_fused_optimizers` enables the fusing of the optimizer step into the backward pass for each block. This is similar to `--fused_backward_pass`. Any optimizer can be used, but Adafactor is recommended for memory efficiency. `--blockwise_fused_optimizers` cannot be used with `--fused_backward_pass`. Stochastic rounding is not supported for now.

-`--double_blocks_to_swap` and `--single_blocks_to_swap` are the number of double blocks and single blocks to swap. The default is None (no swap). These options must be combined with `--fused_backward_pass` or `--blockwise_fused_optimizers`. `--double_blocks_to_swap` can be specified with `--single_blocks_to_swap`. The recommended maximum number of blocks to swap is 9 for double blocks and 18 for single blocks.
+`--double_blocks_to_swap` and `--single_blocks_to_swap` are the number of double blocks and single blocks to swap. The default is None (no swap). These options must be combined with `--fused_backward_pass` or `--blockwise_fused_optimizers`. `--double_blocks_to_swap` can be specified with `--single_blocks_to_swap`. The recommended maximum number of blocks to swap is 9 for double blocks and 18 for single blocks. Please see the next chapter for details.

 `--cpu_offload_checkpointing` is to offload the gradient checkpointing to CPU. This reduces about 2GB of VRAM usage.

@@ -198,24 +198,32 @@ The learning rate and the number of epochs are not optimized yet. Please adjust

 #### Key Features for FLUX.1 fine-tuning

-1. Sample Image Generation:
+1.  Technical details of double/single block swap:
+    - Reduce memory usage by transferring double and single blocks of FLUX.1 from GPU to CPU when they are not needed.
+    - During forward pass, the weights of the blocks that have finished calculation are transferred to CPU, and the weights of the blocks to be calculated are transferred to GPU.
+    - The same is true for the backward pass, but the order is reversed. The gradients remain on the GPU.
+    - Since the transfer between CPU and GPU takes time, the training will be slower.
+    - `--double_blocks_to_swap` and `--single_blocks_to_swap` specify the number of blocks to swap. For example, `--double_blocks_to_swap 6` swaps 6 blocks at each step of training, but the remaining 13 blocks are always on the GPU.
+    - About 640MB of memory can be saved per double block, and about 320MB of memory can be saved per single block.
+
+2. Sample Image Generation:
   - Sample image generation during training is now supported.
   - The prompts are cached and used for generation if `--cache_latents` is specified. So changing the prompts during training will not affect the generated images.
   - Specify options such as `--sample_prompts` and `--sample_every_n_epochs`.
   - Note: It will be very slow when `--split_mode` is specified.

-2. Experimental Memory-Efficient Saving:
+3. Experimental Memory-Efficient Saving:
   - `--mem_eff_save` option can further reduce memory consumption during model saving (about 22GB).
   - This is a custom implementation and may cause unexpected issues. Use with caution.

-3. T5XXL Token Length Control:
+4. T5XXL Token Length Control:
   - Added `--t5xxl_max_token_length` option to specify the maximum token length of T5XXL.
   - Default is 512 in dev and 256 in schnell models.

-4. Multi-GPU Training Support:
+5. Multi-GPU Training Support:
   - Note: `--double_blocks_to_swap` and `--single_blocks_to_swap` cannot be used in multi-GPU training.

-5. Disable mmap Load for Safetensors:
+6. Disable mmap Load for Safetensors:
   - `--disable_mmap_load_safetensors` option now works in `flux_train.py`.
   - Speeds up model loading during training in WSL2.
   - Effective in reducing memory usage when loading models during multi-GPU training.