diff --git a/README.md b/README.md index 331951ef..5dd916aa 100644 --- a/README.md +++ b/README.md @@ -184,7 +184,7 @@ Options are almost the same as LoRA training. The difference is `--full_bf16`, ` `--blockwise_fused_optimizers` enables the fusing of the optimizer step into the backward pass for each block. This is similar to `--fused_backward_pass`. Any optimizer can be used, but Adafactor is recommended for memory efficiency. `--blockwise_fused_optimizers` cannot be used with `--fused_backward_pass`. Stochastic rounding is not supported for now. -`--double_blocks_to_swap` and `--single_blocks_to_swap` are the number of double blocks and single blocks to swap. The default is None (no swap). These options must be combined with `--fused_backward_pass` or `--blockwise_fused_optimizers`. `--double_blocks_to_swap` can be specified with `--single_blocks_to_swap`. The recommended maximum number of blocks to swap is 9 for double blocks and 18 for single blocks. +`--double_blocks_to_swap` and `--single_blocks_to_swap` are the number of double blocks and single blocks to swap. The default is None (no swap). These options must be combined with `--fused_backward_pass` or `--blockwise_fused_optimizers`. `--double_blocks_to_swap` can be specified with `--single_blocks_to_swap`. The recommended maximum number of blocks to swap is 9 for double blocks and 18 for single blocks. Please see the next chapter for details. `--cpu_offload_checkpointing` is to offload the gradient checkpointing to CPU. This reduces about 2GB of VRAM usage. @@ -198,24 +198,32 @@ The learning rate and the number of epochs are not optimized yet. Please adjust #### Key Features for FLUX.1 fine-tuning -1. Sample Image Generation: +1. Technical details of double/single block swap: + - Reduce memory usage by transferring double and single blocks of FLUX.1 from GPU to CPU when they are not needed. + - During forward pass, the weights of the blocks that have finished calculation are transferred to CPU, and the weights of the blocks to be calculated are transferred to GPU. + - The same is true for the backward pass, but the order is reversed. The gradients remain on the GPU. + - Since the transfer between CPU and GPU takes time, the training will be slower. + - `--double_blocks_to_swap` and `--single_blocks_to_swap` specify the number of blocks to swap. For example, `--double_blocks_to_swap 6` swaps 6 blocks at each step of training, but the remaining 13 blocks are always on the GPU. + - About 640MB of memory can be saved per double block, and about 320MB of memory can be saved per single block. + +2. Sample Image Generation: - Sample image generation during training is now supported. - The prompts are cached and used for generation if `--cache_latents` is specified. So changing the prompts during training will not affect the generated images. - Specify options such as `--sample_prompts` and `--sample_every_n_epochs`. - Note: It will be very slow when `--split_mode` is specified. -2. Experimental Memory-Efficient Saving: +3. Experimental Memory-Efficient Saving: - `--mem_eff_save` option can further reduce memory consumption during model saving (about 22GB). - This is a custom implementation and may cause unexpected issues. Use with caution. -3. T5XXL Token Length Control: +4. T5XXL Token Length Control: - Added `--t5xxl_max_token_length` option to specify the maximum token length of T5XXL. - Default is 512 in dev and 256 in schnell models. -4. Multi-GPU Training Support: +5. Multi-GPU Training Support: - Note: `--double_blocks_to_swap` and `--single_blocks_to_swap` cannot be used in multi-GPU training. -5. Disable mmap Load for Safetensors: +6. Disable mmap Load for Safetensors: - `--disable_mmap_load_safetensors` option now works in `flux_train.py`. - Speeds up model loading during training in WSL2. - Effective in reducing memory usage when loading models during multi-GPU training.