Merge pull request #2192 from kohya-ss/doc-update-for-latest-features

Doc update for latest features
2026-04-06 13:47:06 +00:00 · 2025-09-12 20:28:42 +09:00
parent f8337726cf ee8e670765
commit 419a9c4af4
11 changed files with 1341 additions and 790 deletions
--- a/README.md
+++ b/README.md
@@ -18,75 +18,27 @@ If you are using DeepSpeed, please install DeepSpeed with `pip install deepspeed

 ### Recent Updates

+Sep 4, 2025:
+- The information about FLUX.1 and SD3/SD3.5 training that was described in the README has been organized and divided into the following documents:
+    - [LoRA Training Overview](./docs/train_network.md)
+    - [SDXL Training](./docs/sdxl_train_network.md)
+    - [Advanced Training](./docs/train_network_advanced.md)
+    - [FLUX.1 Training](./docs/flux_train_network.md)
+    - [SD3 Training](./docs/sd3_train_network.md)
+    - [LUMINA Training](./docs/lumina_train_network.md)
+    - [Validation](./docs/validation.md)
+    - [Fine-tuning](./docs/fine_tune.md)
+    - [Textual Inversion Training](./docs/train_textual_inversion.md)
+
 Aug 28, 2025:
 - In order to support the latest GPUs and features, we have updated the **PyTorch and library versions**. PR [#2178](https://github.com/kohya-ss/sd-scripts/pull/2178) There are many changes, so please let us know if you encounter any issues.
 - The PyTorch version used for testing has been updated to 2.6.0. We have confirmed that it works with PyTorch 2.6.0 and later.
 - The `requirements.txt` has been updated, so please update your dependencies.
-  - You can update the dependencies with `pip install -r requirements.txt`.
-  - The version specification for `bitsandbytes` has been removed. If you encounter errors on RTX 50 series GPUs, please update it with `pip install -U bitsandbytes`.
+    - You can update the dependencies with `pip install -r requirements.txt`.
+    - The version specification for `bitsandbytes` has been removed. If you encounter errors on RTX 50 series GPUs, please update it with `pip install -U bitsandbytes`.
 - We have modified each script to minimize warnings as much as possible.
-  - The modified scripts will work in the old environment (library versions), but please update them when convenient.
+    - The modified scripts will work in the old environment (library versions), but please update them when convenient.

-Jul 30, 2025:
- **Breaking Change**: For FLUX.1 and Chroma training, the CFG (Classifier-Free Guidance, using negative prompts) scale option for sample image generation during training has been changed from `--g` to `--l`. The `--g` option is now used for the embedded guidance scale. Please update your prompts accordingly. See [Sample Image Generation During Training](#sample-image-generation-during-training) for details.
-
- Support for [Chroma](https://huggingface.co/lodestones/Chroma) has been added in PR [#2157](https://github.com/kohya-ss/sd-scripts/pull/2157). Thank you to lodestones for the high-quality model.
-    - Chroma is a new model based on FLUX.1 schnell. In this repository, `flux_train_network.py` is used for training LoRAs for Chroma with `--model_type chroma`. `--apply_t5_attn_mask` is also needed for Chroma training.
-    - Please refer to the [FLUX.1 LoRA training documentation](./docs/flux_train_network.md) for more details.
-
-Jul 21, 2025:
- Support for [Lumina-Image 2.0](https://github.com/Alpha-VLLM/Lumina-Image-2.0) has been added in PR [#1927](https://github.com/kohya-ss/sd-scripts/pull/1927) and [#2138](https://github.com/kohya-ss/sd-scripts/pull/2138). Special thanks to sdbds and RockerBOO for their contributions.
-    - Please refer to the [Lumina-Image 2.0 documentation](./docs/lumina_train_network.md) for more details.
- We have started adding comprehensive training-related documentation to [docs](./docs). These documents are being created with the help of generative AI and will be updated over time. While there are still many gaps at this stage, we plan to improve them gradually.
-
-    Currently, the following documents are available:
-    - train_network.md
-    - sdxl_train_network.md
-    - sdxl_train_network_advanced.md
-    - flux_train_network.md
-    - sd3_train_network.md
-    - lumina_train_network.md
-    
-Jul 10, 2025:
- [AI Coding Agents](#for-developers-using-ai-coding-agents) section is added to the README. This section provides instructions for developers using AI coding agents like Claude and Gemini to understand the project context and coding standards.
-
-May 1, 2025:
- The error when training FLUX.1 with mixed precision in flux_train.py with DeepSpeed enabled has been resolved. Thanks to sharlynxy for PR [#2060](https://github.com/kohya-ss/sd-scripts/pull/2060). Please refer to the PR for details.
-  - If you enable DeepSpeed, please install DeepSpeed with `pip install deepspeed==0.16.7`.
-
-Apr 27, 2025:
- FLUX.1 training now supports CFG scale in the sample generation during training. Please use `--g` option, to specify the CFG scale (note that `--l` is used as the embedded guidance scale.) PR [#2064](https://github.com/kohya-ss/sd-scripts/pull/2064).
-    - See [here](#sample-image-generation-during-training) for details.
-    - If you have any issues with this, please let us know.
-
-Apr 6, 2025:
- IP noise gamma has been enabled in FLUX.1. Thanks to rockerBOO for PR [#1992](https://github.com/kohya-ss/sd-scripts/pull/1992). See the PR for details.
-    - `--ip_noise_gamma` and `--ip_noise_gamma_random_strength` are available.
-  
-Mar 30, 2025:
- LoRA-GGPO is added for FLUX.1 LoRA training. Thank you to rockerBOO for PR [#1974](https://github.com/kohya-ss/sd-scripts/pull/1974). 
-  - Specify `--network_args ggpo_sigma=0.03 ggpo_beta=0.01` in the command line or `network_args = ["ggpo_sigma=0.03", "ggpo_beta=0.01"]` in .toml file. See PR for details.
- The interpolation method for resizing the original image to the training size can now be specified. Thank you to rockerBOO for PR [#1936](https://github.com/kohya-ss/sd-scripts/pull/1936).
-
-Mar 20, 2025:
- `pytorch-optimizer` is added to requirements.txt. Thank you to gesen2egee for PR [#1985](https://github.com/kohya-ss/sd-scripts/pull/1985). 
-  - For example, you can use CAME optimizer with `--optimizer_type "pytorch_optimizer.CAME" --optimizer_args "weight_decay=0.01"`.
-
-Mar 6, 2025:
-
- Added a utility script to merge the weights of SD3's DiT, VAE (optional), CLIP-L, CLIP-G, and T5XXL into a single .safetensors file. Run `tools/merge_sd3_safetensors.py`. See `--help` for usage. PR [#1960](https://github.com/kohya-ss/sd-scripts/pull/1960)
-
-Feb 26, 2025:
-
- Improve the validation loss calculation in `train_network.py`, `sdxl_train_network.py`, `flux_train_network.py`, and `sd3_train_network.py`. PR [#1903](https://github.com/kohya-ss/sd-scripts/pull/1903)
-  - The validation loss uses the fixed timestep sampling and the fixed random seed. This is to ensure that the validation loss is not fluctuated by the random values.
-
-Jan 25, 2025:
-
- `train_network.py`, `sdxl_train_network.py`, `flux_train_network.py`, and `sd3_train_network.py` now support validation loss. PR [#1864](https://github.com/kohya-ss/sd-scripts/pull/1864) Thank you to rockerBOO!
-  - For details on how to set it up, please refer to the PR. The documentation will be updated as needed.
-  - It will be added to other scripts as well.
-  - As a current limitation, validation loss is not supported when `--block_to_swap` is specified, or when schedule-free optimizer is used.

 ## For Developers Using AI Coding Agents

@@ -113,678 +65,6 @@ To use them, you need to opt-in by creating your own configuration file in the p

 This approach ensures that you have full control over the instructions given to your agent while benefiting from the shared project context. Your `CLAUDE.md` and `GEMINI.md` are already listed in `.gitignore`, so it won't be committed to the repository.

-## FLUX.1 training
-
- [FLUX.1 LoRA training](#flux1-lora-training)
-  - [Key Options for FLUX.1 LoRA training](#key-options-for-flux1-lora-training)
-  - [Distribution of timesteps](#distribution-of-timesteps)
-  - [Key Features for FLUX.1 LoRA training](#key-features-for-flux1-lora-training)
-  - [Specify rank for each layer in FLUX.1](#specify-rank-for-each-layer-in-flux1)
-  - [Specify blocks to train in FLUX.1 LoRA training](#specify-blocks-to-train-in-flux1-lora-training)
- [FLUX.1 ControlNet training](#flux1-controlnet-training)
- [FLUX.1 OFT training](#flux1-oft-training)
- [Inference for FLUX.1 with LoRA model](#inference-for-flux1-with-lora-model)
- [FLUX.1 fine-tuning](#flux1-fine-tuning)
-  - [Key Features for FLUX.1 fine-tuning](#key-features-for-flux1-fine-tuning)
- [Extract LoRA from FLUX.1 Models](#extract-lora-from-flux1-models)
- [Convert FLUX LoRA](#convert-flux-lora)
- [Merge LoRA to FLUX.1 checkpoint](#merge-lora-to-flux1-checkpoint)
- [FLUX.1 Multi-resolution training](#flux1-multi-resolution-training)
- [Convert Diffusers to FLUX.1](#convert-diffusers-to-flux1)
-
-### FLUX.1 LoRA training
-
-We have added a new training script for LoRA training. The script is `flux_train_network.py`. See `--help` for options. 
-
-FLUX.1 model, CLIP-L, and T5XXL models are recommended to be in bf16/fp16 format. If you specify `--fp8_base`, you can use fp8 models for FLUX.1. The fp8 model is only compatible with `float8_e4m3fn` format.
-
-Sample command is below. It will work with 24GB VRAM GPUs. 
-
-```
-accelerate launch  --mixed_precision bf16 --num_cpu_threads_per_process 1 flux_train_network.py 
--pretrained_model_name_or_path flux1-dev.safetensors --clip_l sd3/clip_l.safetensors --t5xxl sd3/t5xxl_fp16.safetensors 
--ae ae.safetensors --cache_latents_to_disk --save_model_as safetensors --sdpa --persistent_data_loader_workers 
--max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 
--network_module networks.lora_flux --network_dim 4 --network_train_unet_only 
--optimizer_type adamw8bit --learning_rate 1e-4 
--cache_text_encoder_outputs --cache_text_encoder_outputs_to_disk --fp8_base 
--highvram --max_train_epochs 4 --save_every_n_epochs 1 --dataset_config dataset_1024_bs2.toml 
--output_dir path/to/output/dir --output_name flux-lora-name 
--timestep_sampling shift --discrete_flow_shift 3.1582 --model_prediction_type raw --guidance_scale 1.0 
-```
-(The command is multi-line for readability. Please combine it into one line.)
-
-We also not sure how many epochs are needed for convergence, and how the learning rate should be adjusted.
-
-The trained LoRA model can be used with ComfyUI. 
-
-When training LoRA for Text Encoder (without `--network_train_unet_only`), more VRAM is required. Please refer to the settings below to reduce VRAM usage.
-
-__Options for GPUs with less VRAM:__
-
-By specifying `--blocks_to_swap`, you can save VRAM by swapping some blocks between CPU and GPU. See [FLUX.1 fine-tuning](#flux1-fine-tuning) for details.
-
-Specify a number like `--blocks_to_swap 10`. A larger number will swap more blocks, saving more VRAM, but training will be slower. In FLUX.1, you can swap up to 35 blocks.
-
-`--cpu_offload_checkpointing` offloads gradient checkpointing to CPU. This reduces up to 1GB of VRAM usage but slows down the training by about 15%. Cannot be used with `--blocks_to_swap`.
-
-Adafactor optimizer may reduce the VRAM usage than 8bit AdamW. Please use settings like below:
-
-```
--optimizer_type adafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" --lr_scheduler constant_with_warmup --max_grad_norm 0.0
-```
-
-The training can be done with 16GB VRAM GPUs with the batch size of 1. Please change your dataset configuration.
-
-The training can be done with 12GB VRAM GPUs with `--blocks_to_swap 16` with 8bit AdamW. Please use settings like below:
-
-```
--blocks_to_swap 16 
-```
-
-For GPUs with less than 10GB of VRAM, it is recommended to use an fp8 checkpoint for T5XXL. You can download `t5xxl_fp8_e4m3fn.safetensors` from [comfyanonymous/flux_text_encoders](https://huggingface.co/comfyanonymous/flux_text_encoders) (please use without `scaled`).
-
-10GB VRAM GPUs will work with 22 blocks swapped, and 8GB VRAM GPUs will work with 28 blocks swapped.
-
-__`--split_mode` is deprecated. This option is still available, but they will be removed in the future. Please use `--blocks_to_swap` instead. If this option is specified and `--blocks_to_swap` is not specified, `--blocks_to_swap 18` is automatically enabled.__
-
-#### Key Options for FLUX.1 LoRA training
-
-There are many unknown points in FLUX.1 training, so some settings can be specified by arguments. Here are the arguments. The arguments and sample settings are still experimental and may change in the future. Feedback on the settings is welcome.
-
- `--pretrained_model_name_or_path` is the path to the pretrained model (FLUX.1). bf16 (original BFL model) is recommended (`flux1-dev.safetensors` or `flux1-dev.sft`). If you specify `--fp8_base`, you can use fp8 models for FLUX.1. The fp8 model is only compatible with `float8_e4m3fn` format.
- `--clip_l` is the path to the CLIP-L model. 
- `--t5xxl` is the path to the T5XXL model. If you specify `--fp8_base`, you can use fp8 (float8_e4m3fn) models for T5XXL. However, it is recommended to use fp16 models for caching.
- `--ae` is the path to the autoencoder model (`ae.safetensors` or `ae.sft`).
-
- `--timestep_sampling` is the method to sample timesteps (0-1):
-  - `sigma`: sigma-based, same as SD3
-  - `uniform`: uniform random
-  - `sigmoid`: sigmoid of random normal, same as x-flux, AI-toolkit etc.
-  - `shift`: shifts the value of sigmoid of normal distribution random number
-  - `flux_shift`: shifts the value of sigmoid of normal distribution random number, depending on the resolution (same as FLUX.1 dev inference). `--discrete_flow_shift` is ignored when `flux_shift` is specified.
- `--sigmoid_scale` is the scale factor for sigmoid timestep sampling (only used when timestep-sampling is "sigmoid"). The default is 1.0. Larger values will make the sampling more uniform.
-  - This option is effective even when`--timestep_sampling shift` is specified.
-  - Normally, leave it at 1.0. Larger values make the value before shift closer to a uniform distribution.
- `--model_prediction_type` is how to interpret and process the model prediction:
-  - `raw`: use as is, same as x-flux
-  - `additive`: add to noisy input
-  - `sigma_scaled`: apply sigma scaling, same as SD3
- `--discrete_flow_shift` is the discrete flow shift for the Euler Discrete Scheduler, default is 3.0 (same as SD3).
- `--blocks_to_swap`. See [FLUX.1 fine-tuning](#flux1-fine-tuning) for details.
-
-The existing `--loss_type` option may be useful for FLUX.1 training. The default is `l2`.
-
-~~In our experiments, `--timestep_sampling sigma --model_prediction_type raw --discrete_flow_shift 1.0` with `--loss_type l2` seems to work better than the default (SD3) settings. The multiplier of LoRA should be adjusted.~~
-
-In our experiments, `--timestep_sampling shift --discrete_flow_shift 3.1582 --model_prediction_type raw --guidance_scale 1.0` (with the default `l2` loss_type) seems to work better. 
-
-The settings in [AI Toolkit by Ostris](https://github.com/ostris/ai-toolkit) seems to be equivalent to `--timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0` (with the default `l2` loss_type). 
-
-Other settings may work better, so please try different settings.
-
-Other options are described below.
-
-#### Distribution of timesteps
-
-`--timestep_sampling` and `--sigmoid_scale`, `--discrete_flow_shift` adjust the distribution of timesteps. The distribution is shown in the figures below.
-
-The effect of `--discrete_flow_shift` with `--timestep_sampling shift` (when `--sigmoid_scale` is not specified, the default is 1.0):
-![Figure_2](https://github.com/user-attachments/assets/d9de42f9-f17d-40da-b88d-d964402569c6)
-
-The difference between `--timestep_sampling sigmoid` and `--timestep_sampling uniform` (when `--timestep_sampling sigmoid` or `uniform` is specified, `--discrete_flow_shift` is ignored):
-![Figure_3](https://github.com/user-attachments/assets/27029009-1f5d-4dc0-bb24-13d02ac4fdad)
-
-The effect of `--timestep_sampling sigmoid` and `--sigmoid_scale` (when `--timestep_sampling sigmoid` is specified, `--discrete_flow_shift` is ignored):
-![Figure_4](https://github.com/user-attachments/assets/08a2267c-e47e-48b7-826e-f9a080787cdc)
-
-#### Key Features for FLUX.1 LoRA training
-
-1. CLIP-L and T5XXL LoRA Support:
-   - FLUX.1 LoRA training now supports CLIP-L and T5XXL LoRA training.
-   - Remove `--network_train_unet_only` from your command.
-   - Add `train_t5xxl=True` to `--network_args` to train T5XXL LoRA. CLIP-L is also trained at the same time.
-   - T5XXL output can be cached for CLIP-L LoRA training. So, `--cache_text_encoder_outputs` or `--cache_text_encoder_outputs_to_disk` is also available.
-   - The learning rates for CLIP-L and T5XXL can be specified separately. Multiple numbers can be specified in `--text_encoder_lr`. For example, `--text_encoder_lr 1e-4 1e-5`. The first value is the learning rate for CLIP-L, and the second value is for T5XXL. If you specify only one, the learning rates for CLIP-L and T5XXL will be the same. If `--text_encoder_lr` is not specified, the default learning rate `--learning_rate` is used for both CLIP-L and T5XXL.
-   - The trained LoRA can be used with ComfyUI.
-   - Note: `flux_extract_lora.py`, `convert_flux_lora.py`and `merge_flux_lora.py` do not support CLIP-L and T5XXL LoRA yet.
-
-    | trained LoRA|option|network_args|cache_text_encoder_outputs (*1)|
-    |---|---|---|---|
-    |FLUX.1|`--network_train_unet_only`|-|o|
-    |FLUX.1 + CLIP-L|-|-|o (*2)|
-    |FLUX.1 + CLIP-L + T5XXL|-|`train_t5xxl=True`|-|
-    |CLIP-L (*3)|`--network_train_text_encoder_only`|-|o (*2)|
-    |CLIP-L + T5XXL (*3)|`--network_train_text_encoder_only`|`train_t5xxl=True`|-|
-
-    - *1: `--cache_text_encoder_outputs` or `--cache_text_encoder_outputs_to_disk` is also available.
-    - *2: T5XXL output can be cached for CLIP-L LoRA training.
-    - *3: Not tested yet.
-
-2. Experimental FP8/FP16 mixed training:
-   - `--fp8_base_unet` enables training with fp8 for FLUX and bf16/fp16 for CLIP-L/T5XXL.
-   - FLUX can be trained with fp8, and CLIP-L/T5XXL can be trained with bf16/fp16.
-   - When specifying this option, the `--fp8_base` option is automatically enabled.
-
-3. Split Q/K/V Projection Layers (Experimental):
-   - Added an option to split the projection layers of q/k/v/txt in the attention and apply LoRA to each of them.
-   - Specify `"split_qkv=True"` in network_args like `--network_args "split_qkv=True"` (`train_blocks` is also available).
-   - May increase expressiveness but also training time.
-   - The trained model is compatible with normal LoRA models in sd-scripts and can be used in environments like ComfyUI.
-   - Converting to AI-toolkit (Diffusers) format with `convert_flux_lora.py` will reduce the size.
-   
-4. T5 Attention Mask Application:
-   - T5 attention mask is applied when `--apply_t5_attn_mask` is specified.
-   - Now applies mask when encoding T5 and in the attention of Double and Single Blocks
-   - Affects fine-tuning, LoRA training, and inference in `flux_minimal_inference.py`.
-
-5. Multi-resolution Training Support:
-   - FLUX.1 now supports multi-resolution training, even with caching latents to disk.
-
-
-Technical details of Q/K/V split: 
-
-In the implementation of Black Forest Labs' model, the projection layers of q/k/v (and txt in single blocks) are concatenated into one. If LoRA is added there as it is, the LoRA module is only one, and the dimension is large. In contrast, in the implementation of Diffusers, the projection layers of q/k/v/txt are separated. Therefore, the LoRA module is applied to q/k/v/txt separately, and the dimension is smaller. This option is for training LoRA similar to the latter.
-
-The compatibility of the saved model (state dict) is ensured by concatenating the weights of multiple LoRAs. However, since there are zero weights in some parts, the model size will be large.
-
-#### Specify rank for each layer in FLUX.1
-
-You can specify the rank for each layer in FLUX.1 by specifying the following network_args. If you specify `0`, LoRA will not be applied to that layer.
-
-When network_args is not specified, the default value (`network_dim`) is applied, same as before.
-
-|network_args|target layer|
-|---|---|
-|img_attn_dim|img_attn in DoubleStreamBlock|
-|txt_attn_dim|txt_attn in DoubleStreamBlock|
-|img_mlp_dim|img_mlp in DoubleStreamBlock|
-|txt_mlp_dim|txt_mlp in DoubleStreamBlock|
-|img_mod_dim|img_mod in DoubleStreamBlock|
-|txt_mod_dim|txt_mod in DoubleStreamBlock|
-|single_dim|linear1 and linear2 in SingleStreamBlock|
-|single_mod_dim|modulation in SingleStreamBlock|
-
-`"verbose=True"` is also available for debugging. It shows the rank of each layer.
-
-example: 
-```
--network_args "img_attn_dim=4" "img_mlp_dim=8" "txt_attn_dim=2" "txt_mlp_dim=2" 
-"img_mod_dim=2" "txt_mod_dim=2" "single_dim=4" "single_mod_dim=2" "verbose=True"
-```
-
-You can apply LoRA to the conditioning layers of Flux by specifying `in_dims` in network_args. When specifying, be sure to specify 5 numbers in `[]` as a comma-separated list.
-
-example: 
-```
--network_args "in_dims=[4,2,2,2,4]"
-```
-
-Each number corresponds to `img_in`, `time_in`, `vector_in`, `guidance_in`, `txt_in`. The above example applies LoRA to all conditioning layers, with rank 4 for `img_in`, 2 for `time_in`, `vector_in`, `guidance_in`, and 4 for `txt_in`.
-
-If you specify `0`, LoRA will not be applied to that layer. For example, `[4,0,0,0,4]` applies LoRA only to `img_in` and `txt_in`.
-
-#### Specify blocks to train in FLUX.1 LoRA training
-
-You can specify the blocks to train in FLUX.1 LoRA training by specifying `train_double_block_indices` and `train_single_block_indices` in network_args. The indices are 0-based. The default (when omitted) is to train all blocks. The indices are specified as a list of integers or a range of integers, like `0,1,5,8` or `0,1,4-5,7`. The number of double blocks is 19, and the number of single blocks is 38, so the valid range is 0-18 and 0-37, respectively. `all` is also available to train all blocks, `none` is also available to train no blocks.
-
-example: 
-```
--network_args "train_double_block_indices=0,1,8-12,18" "train_single_block_indices=3,10,20-25,37"
-```
-
-```
--network_args "train_double_block_indices=none" "train_single_block_indices=10-15"
-```
-
-If you specify one of `train_double_block_indices` or `train_single_block_indices`, the other will be trained as usual. 
-
-### FLUX.1 ControlNet training
-We have added a new training script for ControlNet training. The script is flux_train_control_net.py. See --help for options.
-
-Sample command is below. It will work with 80GB VRAM GPUs.
-```
-accelerate launch --mixed_precision bf16 --num_cpu_threads_per_process 1 flux_train_control_net.py
--pretrained_model_name_or_path flux1-dev.safetensors --clip_l clip_l.safetensors --t5xxl t5xxl_fp16.safetensors
--ae ae.safetensors --save_model_as safetensors --sdpa --persistent_data_loader_workers
--max_data_loader_n_workers 1 --seed 42 --gradient_checkpointing --mixed_precision bf16
--optimizer_type adamw8bit --learning_rate 2e-5 
--highvram --max_train_epochs 1 --save_every_n_steps 1000 --dataset_config dataset.toml
--output_dir /path/to/output/dir --output_name flux-cn
--timestep_sampling shift --discrete_flow_shift 3.1582 --model_prediction_type raw --guidance_scale 1.0 --deepspeed
-```
-
-For 24GB VRAM GPUs, you can train with 16 blocks swapped and caching latents and text encoder outputs with the batch size of 1. Remove `--deepspeed` . Sample command is below. Not fully tested.
-```
- --blocks_to_swap 16 --cache_latents_to_disk --cache_text_encoder_outputs_to_disk 
-```
-
-The training can be done with 16GB VRAM GPUs with around 30 blocks swapped. 
-
-`--gradient_accumulation_steps` is also available. The default value is 1 (no accumulation), but according to the original PR, 8 is used.
-
-### FLUX.1 OFT training
-
-You can train OFT with almost the same options as LoRA, such as `--timestamp_sampling`. The following points are different.
-
- Change `--network_module` from `networks.lora_flux` to `networks.oft_flux`.
- `--network_dim` is the number of OFT blocks. Unlike LoRA rank, the smaller the dim, the larger the model. We recommend about 64 or 128. Please make the output dimension of the target layer of OFT divisible by the value of `--network_dim` (an error will occur if it is not divisible). Valid values are 64, 128, 256, 512, 1024, etc.
- `--network_alpha` is treated as a constraint for OFT. We recommend about 1e-2 to 1e-4. The default value when omitted is 1, which is too large, so be sure to specify it.
- CLIP/T5XXL is not supported. Specify `--network_train_unet_only`.
- `--network_args` specifies the hyperparameters of OFT. The following are valid:
-    - Specify `enable_all_linear=True` to target all linear connections in the MLP layer. The default is False, which targets only attention.
-
-Currently, there is no environment to infer FLUX.1 OFT. Inference is only possible with `flux_minimal_inference.py` (specify OFT model with `--lora`).
-
-Sample command is below. It will work with 24GB VRAM GPUs with the batch size of 1.
-
-```
--network_module networks.oft_flux  --network_dim 128 --network_alpha 1e-3 
--network_args "enable_all_linear=True" --learning_rate 1e-5 
-```
-
-The training can be done with 16GB VRAM GPUs without `--enable_all_linear` option and with Adafactor optimizer. 
-
-### Inference for FLUX.1 with LoRA model
-
-The inference script is also available. The script is `flux_minimal_inference.py`. See `--help` for options. 
-
-```
-python flux_minimal_inference.py --ckpt flux1-dev.safetensors --clip_l sd3/clip_l.safetensors --t5xxl sd3/t5xxl_fp16.safetensors --ae ae.safetensors --dtype bf16 --prompt "a cat holding a sign that says hello world" --out path/to/output/dir --seed 1 --flux_dtype fp8 --offload --lora lora-flux-name.safetensors;1.0
-```
-
-### FLUX.1 fine-tuning
-
-The memory-efficient training with block swap is based on 2kpr's implementation. Thanks to 2kpr!
-
-__`--double_blocks_to_swap` and `--single_blocks_to_swap` are deprecated. These options is still available, but they will be removed in the future. Please use `--blocks_to_swap` instead. These options are equivalent to specifying `double_blocks_to_swap + single_blocks_to_swap // 2` in `--blocks_to_swap`.__
-
-Sample command for FLUX.1 fine-tuning is below. This will work with 24GB VRAM GPUs, and 64GB main memory is recommended. 
-
-```
-accelerate launch  --mixed_precision bf16 --num_cpu_threads_per_process 1 flux_train.py   
--pretrained_model_name_or_path flux1-dev.safetensors  --clip_l clip_l.safetensors --t5xxl t5xxl_fp16.safetensors --ae ae_dev.safetensors 
--save_model_as safetensors --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 
--seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 
--dataset_config dataset_1024_bs1.toml  --output_dir path/to/output/dir --output_name output-name 
--learning_rate 5e-5 --max_train_epochs 4  --sdpa --highvram --cache_text_encoder_outputs_to_disk --cache_latents_to_disk --save_every_n_epochs 1 
--optimizer_type adafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" 
--lr_scheduler constant_with_warmup --max_grad_norm 0.0 
--timestep_sampling shift --discrete_flow_shift 3.1582 --model_prediction_type raw --guidance_scale 1.0 
--fused_backward_pass  --blocks_to_swap 8 --full_bf16 
-```
-(The command is multi-line for readability. Please combine it into one line.)
-
-Options are almost the same as LoRA training. The difference is `--full_bf16`, `--fused_backward_pass` and  `--blocks_to_swap`. `--cpu_offload_checkpointing` is also available.
-
-`--full_bf16` enables the training with bf16 (weights and gradients). 
-
-`--fused_backward_pass` enables the fusing of the optimizer step into the backward pass for each parameter. This reduces the memory usage during training. Only Adafactor optimizer is supported for now. Stochastic rounding is also enabled when `--fused_backward_pass` and `--full_bf16` are specified.
-
-`--blockwise_fused_optimizers` enables the fusing of the optimizer step into the backward pass for each block. This is similar to `--fused_backward_pass`. Any optimizer can be used, but Adafactor is recommended for memory efficiency and stochastic rounding. `--blockwise_fused_optimizers` cannot be used with `--fused_backward_pass`. Stochastic rounding is not supported for now.
-
-`--blocks_to_swap` is the number of blocks to swap. The default is None (no swap). The maximum value is 35.
-
-`--cpu_offload_checkpointing` is to offload the gradient checkpointing to CPU. This reduces about 2GB of VRAM usage. This option cannot be used with `--blocks_to_swap`.
-
-All these options are experimental and may change in the future.
-
-The increasing the number of blocks to swap may reduce the memory usage, but the training speed will be slower. `--cpu_offload_checkpointing` also slows down the training.
-
-Swap 8 blocks without cpu offload checkpointing may be a good starting point for 24GB VRAM GPUs. Please try different settings according to VRAM usage and training speed.
-
-The learning rate and the number of epochs are not optimized yet. Please adjust them according to the training results.
-
-#### How to use block swap
-
-There are two possible ways to use block swap. It is unknown which is better.
-
-1. Swap the minimum number of blocks that fit in VRAM with batch size 1 and shorten the training speed of one step.
-
-    The above command example is for this usage.
-
-2. Swap many blocks to increase the batch size and shorten the training speed per data.
-
-    For example, swapping 35 blocks seems to increase the batch size to about 5. In this case, the training speed per data will be relatively faster than 1.
-  
-#### Training with <24GB VRAM GPUs
-
-Swap 28 blocks without cpu offload checkpointing may be working with 12GB VRAM GPUs. Please try different settings according to VRAM size of your GPU.
-
-T5XXL requires about 10GB of VRAM, so 10GB of VRAM will be minimum requirement for FLUX.1 fine-tuning. 
-
-#### Key Features for FLUX.1 fine-tuning
-
-1.  Technical details of block swap:
-    - Reduce memory usage by transferring double and single blocks of FLUX.1 from GPU to CPU when they are not needed.
-    - During forward pass, the weights of the blocks that have finished calculation are transferred to CPU, and the weights of the blocks to be calculated are transferred to GPU.
-    - The same is true for the backward pass, but the order is reversed. The gradients remain on the GPU.
-    - Since the transfer between CPU and GPU takes time, the training will be slower.
-    - `--blocks_to_swap` specify the number of blocks to swap. 
-    - About 640MB of memory can be saved per block.
-  - (Update 1: Nov 12, 2024) 
-    - The maximum number of blocks that can be swapped is 35.
-    - We are exchanging only the data of the weights (weight.data) in reference to the implementation of OneTrainer (thanks to OneTrainer). However, the mechanism of the exchange is a custom implementation.
-    - Since it takes time to free CUDA memory (torch.cuda.empty_cache()), we reuse the CUDA memory allocated to weight.data as it is and exchange the weights between modules.
-    - This shortens the time it takes to exchange weights between modules.
-    - Since the weights must be almost identical to be exchanged, FLUX.1 exchanges the weights between double blocks and single blocks.
-    - In SD3, all blocks are similar, but some weights are different, so there are weights that always remain on the GPU.
-
-2. Sample Image Generation:
-   - Sample image generation during training is now supported.
-   - The prompts are cached and used for generation if `--cache_latents` is specified. So changing the prompts during training will not affect the generated images.
-   - Specify options such as `--sample_prompts` and `--sample_every_n_epochs`.
-   - Note: It will be very slow when `--blocks_to_swap` is specified.
-
-3. Experimental Memory-Efficient Saving:
-   - `--mem_eff_save` option can further reduce memory consumption during model saving (about 22GB).
-   - This is a custom implementation and may cause unexpected issues. Use with caution.
-
-4. T5XXL Token Length Control:
-   - Added `--t5xxl_max_token_length` option to specify the maximum token length of T5XXL.
-   - Default is 512 in dev and 256 in schnell models.
-
-5. Multi-GPU Training Support:
-   - Note: `--double_blocks_to_swap` and `--single_blocks_to_swap` cannot be used in multi-GPU training.
-
-6. Disable mmap Load for Safetensors:
-   - `--disable_mmap_load_safetensors` option now works in `flux_train.py`.
-   - Speeds up model loading during training in WSL2.
-   - Effective in reducing memory usage when loading models during multi-GPU training.
-
-
-### Extract LoRA from FLUX.1 Models
-
-Script: `networks/flux_extract_lora.py`
-
-Extracts LoRA from the difference between two FLUX.1 models.
-
-Offers memory-efficient option with `--mem_eff_safe_open`.
-
-CLIP-L LoRA is not supported.
-
-### Convert FLUX LoRA
-
-Script: `convert_flux_lora.py`
-
-Converts LoRA between sd-scripts format (BFL-based) and AI-toolkit format (Diffusers-based).
-
-If you use LoRA in the inference environment, converting it to AI-toolkit format may reduce temporary memory usage.
-
-Note that re-conversion will increase the size of LoRA.
-
-CLIP-L/T5XXL LoRA is not supported.
-
-### Merge LoRA to FLUX.1 checkpoint
-
-`networks/flux_merge_lora.py` merges LoRA to FLUX.1 checkpoint, CLIP-L or T5XXL models. __The script is experimental.__ 
-
-```
-python networks/flux_merge_lora.py --flux_model flux1-dev.safetensors --save_to output.safetensors --models lora1.safetensors --ratios 2.0 --save_precision fp16 --loading_device cuda --working_device cpu
-```
-
-You can also merge multiple LoRA models into a FLUX.1 model. Specify multiple LoRA models in `--models`. Specify the same number of ratios in `--ratios`.
-
-CLIP-L and T5XXL LoRA are supported. `--clip_l` and `--clip_l_save_to` are for CLIP-L, `--t5xxl` and `--t5xxl_save_to` are for T5XXL. Sample command is below.
-
-```
--clip_l clip_l.safetensors --clip_l_save_to merged_clip_l.safetensors  --t5xxl t5xxl_fp16.safetensors --t5xxl_save_to merged_t5xxl.safetensors
-```
-
-FLUX.1, CLIP-L, and T5XXL can be merged together or separately for memory efficiency.
-
-An experimental option `--mem_eff_load_save` is available. This option is for memory-efficient loading and saving. It may also speed up loading and saving. 
-
-`--loading_device` is the device to load the LoRA models. `--working_device` is the device to merge (calculate) the models. Default is `cpu` for both. Loading / working device examples are below (in the case of `--save_precision fp16` or `--save_precision bf16`, `float32` will consume more memory):
-
- 'cpu' / 'cpu': Uses >50GB of RAM, but works on any machine.
- 'cuda' / 'cpu': Uses 24GB of VRAM, but requires 30GB of RAM.
- 'cpu' / 'cuda': Uses 4GB of VRAM, but requires 50GB of RAM, faster than 'cpu' / 'cpu' or 'cuda' / 'cpu'.
- 'cuda' / 'cuda': Uses 30GB of VRAM, but requires 30GB of RAM, faster than 'cpu' / 'cpu' or 'cuda' / 'cpu'.
-
-`--save_precision` is the precision to save the merged model. In the case of LoRA models are trained with `bf16`, we are not sure which is better, `fp16` or `bf16` for `--save_precision`.
-
-The script can merge multiple LoRA models. If you want to merge multiple LoRA models, specify `--concat` option to work the merged LoRA model properly.
-
-### FLUX.1 Multi-resolution training
-
-You can define multiple resolutions in the dataset configuration file.
-
-The dataset configuration file is like below. You can define multiple resolutions with different batch sizes. The resolutions are defined in the `[[datasets]]` section. The `[[datasets.subsets]]` section is for the dataset directory. Please specify the same directory for each resolution.
-
-```
-[general]
-# define common settings here
-flip_aug = true
-color_aug = false
-keep_tokens_separator= "|||"
-shuffle_caption = false
-caption_tag_dropout_rate = 0
-caption_extension = ".txt"
-
-[[datasets]]
-# define the first resolution here
-batch_size = 2
-enable_bucket = true
-resolution = [1024, 1024]
-
-  [[datasets.subsets]]
-  image_dir = "path/to/image/dir"
-  num_repeats = 1
-
-[[datasets]]
-# define the second resolution here
-batch_size = 3
-enable_bucket = true
-resolution = [768, 768]
-
-  [[datasets.subsets]]
-  image_dir = "path/to/image/dir"
-  num_repeats = 1
-
-[[datasets]]
-# define the third resolution here
-batch_size = 4
-enable_bucket = true
-resolution = [512, 512]
-
-  [[datasets.subsets]]
-  image_dir = "path/to/image/dir"
-  num_repeats = 1
-```
-
-### Convert Diffusers to FLUX.1
-
-Script: `convert_diffusers_to_flux1.py`
-
-Converts Diffusers models to FLUX.1 models. The script is experimental. See `--help` for options. schnell and dev models are supported. AE/CLIP/T5XXL are not supported. The diffusers folder is a parent folder of `rmer` folder.
-
-```
-python tools/convert_diffusers_to_flux.py --diffusers_path path/to/diffusers_folder_or_00001_safetensors --save_to path/to/flux1.safetensors --mem_eff_load_save --save_precision bf16
-```
-
-## SD3 training
-
-SD3.5L/M training is now available. 
-
-### SD3 LoRA training
-
-The script is `sd3_train_network.py`. See `--help` for options. 
-
-SD3 model, CLIP-L, CLIP-G, and T5XXL models are recommended to be in float/fp16 format. If you specify `--fp8_base`, you can use fp8 models for SD3. The fp8 model is only compatible with `float8_e4m3fn` format.
-
-Sample command is below. It will work with 16GB VRAM GPUs (SD3.5L).
-
-```
-accelerate launch  --mixed_precision bf16 --num_cpu_threads_per_process 1 sd3_train_network.py 
--pretrained_model_name_or_path path/to/sd3.5_large.safetensors --clip_l sd3/clip_l.safetensors --clip_g sd3/clip_g.safetensors --t5xxl sd3/t5xxl_fp16.safetensors 
--cache_latents_to_disk --save_model_as safetensors --sdpa --persistent_data_loader_workers 
--max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 
--network_module networks.lora_sd3 --network_dim 4 --network_train_unet_only 
--optimizer_type adamw8bit --learning_rate 1e-4 
--cache_text_encoder_outputs --cache_text_encoder_outputs_to_disk --fp8_base 
--highvram --max_train_epochs 4 --save_every_n_epochs 1 --dataset_config dataset_1024_bs2.toml 
--output_dir path/to/output/dir --output_name sd3-lora-name 
-```
-(The command is multi-line for readability. Please combine it into one line.)
-
-Like FLUX.1 training, the `--blocks_to_swap` option for memory reduction is available. The maximum number of blocks that can be swapped is 36 for SD3.5L and 22 for SD3.5M.
-
-Adafactor optimizer is also available.
-
-`--cpu_offload_checkpointing` option is not available.
-
-We also not sure how many epochs are needed for convergence, and how the learning rate should be adjusted.
-
-The trained LoRA model can be used with ComfyUI. 
-
-#### Key Options for SD3 LoRA training
-
-Here are the arguments. The arguments and sample settings are still experimental and may change in the future. Feedback on the settings is welcome.
-
- `--network_module` is the module for LoRA training. Specify `networks.lora_sd3` for SD3 LoRA training.
- `--pretrained_model_name_or_path` is the path to the pretrained model (SD3/3.5). If you specify `--fp8_base`, you can use fp8 models for SD3/3.5. The fp8 model is only compatible with `float8_e4m3fn` format.
- `--clip_l` is the path to the CLIP-L model. 
- `--clip_g` is the path to the CLIP-G model.
- `--t5xxl` is the path to the T5XXL model. If you specify `--fp8_base`, you can use fp8 (float8_e4m3fn) models for T5XXL. However, it is recommended to use fp16 models for caching.
- `--vae` is the path to the autoencoder model. __This option is not necessary for SD3.__ VAE is included in the standard SD3 model.
- `--disable_mmap_load_safetensors` is to disable memory mapping when loading safetensors. __This option significantly reduces the memory usage when loading models for Windows users.__
- `--clip_l_dropout_rate`, `--clip_g_dropout_rate` and `--t5_dropout_rate` are the dropout rates for the embeddings of CLIP-L, CLIP-G, and T5XXL, described in [SAI research papre](http://arxiv.org/pdf/2403.03206). The default is 0.0. For LoRA training, it is seems to be better to set 0.0.
- `--pos_emb_random_crop_rate` is the rate of random cropping of positional embeddings, described in [SD3.5M model card](https://huggingface.co/stabilityai/stable-diffusion-3.5-medium). The default is 0. It is seems to be better to set 0.0 for LoRA training.
- `--enable_scaled_pos_embed` is to enable the scaled positional embeddings. The default is False. This option is an experimental feature for SD3.5M. Details are described below.
- `--training_shift` is the shift value for the training distribution of timesteps. The default is 1.0 (uniform distribution, no shift).  If less than 1.0, the side closer to the image is more sampled, and if more than 1.0, the side closer to noise is more sampled. 
-
-Other options are described below.
-
-#### Key Features for SD3 LoRA training
-
-1. CLIP-L, G and T5XXL LoRA Support:
-   - SD3 LoRA training now supports CLIP-L, CLIP-G and T5XXL LoRA training.
-   - Remove `--network_train_unet_only` from your command.
-   - Add `train_t5xxl=True` to `--network_args` to train T5XXL LoRA. CLIP-L and G is also trained at the same time.
-   - T5XXL output can be cached for CLIP-L and G LoRA training. So, `--cache_text_encoder_outputs` or `--cache_text_encoder_outputs_to_disk` is also available.
-   - The learning rates for CLIP-L, CLIP-G and T5XXL can be specified separately. Multiple numbers can be specified in `--text_encoder_lr`. For example, `--text_encoder_lr 1e-4 1e-5 5e-6`. The first value is the learning rate for CLIP-L, the second value is for CLIP-G, and the third value is for T5XXL. If you specify only one, the learning rates for CLIP-L, CLIP-G and T5XXL will be the same. If the third value is not specified, the second value is used for T5XXL. If `--text_encoder_lr` is not specified, the default learning rate `--learning_rate` is used for both CLIP-L and T5XXL.
-   - The trained LoRA can be used with ComfyUI.
-
-    | trained LoRA|option|network_args|cache_text_encoder_outputs (*1)|
-    |---|---|---|---|
-    |MMDiT|`--network_train_unet_only`|-|o|
-    |MMDiT + CLIP-L + CLIP-G|-|-|o (*2)|
-    |MMDiT + CLIP-L + CLIP-G + T5XXL|-|`train_t5xxl=True`|-|
-    |CLIP-L + CLIP-G (*3)|`--network_train_text_encoder_only`|-|o (*2)|
-    |CLIP-L + CLIP-G + T5XXL (*3)|`--network_train_text_encoder_only`|`train_t5xxl=True`|-|
-
-    - *1: `--cache_text_encoder_outputs` or `--cache_text_encoder_outputs_to_disk` is also available.
-    - *2: T5XXL output can be cached for CLIP-L and G LoRA training.
-    - *3: Not tested yet.
-
-2. Experimental FP8/FP16 mixed training:
-   - `--fp8_base_unet` enables training with fp8 for MMDiT and bf16/fp16 for CLIP-L/G/T5XXL.
-   - When specifying this option, the `--fp8_base` option is automatically enabled.
-
-3. Split Q/K/V Projection Layers (Experimental):
-   - Same as FLUX.1.
-   
-4. CLIP-L/G and T5 Attention Mask Application:
-   - This function is planned to be implemented in the future.
-   
-5. Multi-resolution Training Support:
-   - Only for SD3.5M. 
-   - Same as FLUX.1 for data preparation.
-   - If you train with multiple resolutions, you can enable the scaled positional embeddings with `--enable_scaled_pos_embed`. The default is False. __This option is an experimental feature.__
-
-6. Weighting scheme and training shift:
-   - The weighting scheme is described in the section 3.1 of the [SD3 paper](https://arxiv.org/abs/2403.03206v1). 
-   - The uniform distribution is the default. If you want to change the distribution, see `--help` for options. 
-   - `--training_shift` is the shift value for the training distribution of timesteps.
-   - The effect of a shift in uniform distribution is shown in the figure below.
-   - ![Figure_1](https://github.com/user-attachments/assets/99a72c67-adfb-4440-81d4-a718985ff350)
-
-Technical details of multi-resolution training for SD3.5M:
-
-SD3.5M does not use scaled positional embeddings for multi-resolution training, and is trained with a single positional embedding. Therefore, this feature is very experimental.
-
-Generally, in multi-resolution training, the values of the positional embeddings must be the same for each resolution. That is, the same value must be in the same position for 512x512, 768x768, and 1024x1024. To achieve this, the positional embeddings for each resolution are calculated in advance and switched according to the resolution of the training data. This feature is enabled by `--enable_scaled_pos_embed`.
-
-This idea and the code for calculating scaled positional embeddings are contributed by KohakuBlueleaf. Thanks to KohakuBlueleaf!
-
-
-#### Specify rank for each layer in SD3 LoRA
-
-You can specify the rank for each layer in SD3 by specifying the following network_args. If you specify `0`, LoRA will not be applied to that layer.
-
-When network_args is not specified, the default value (`network_dim`) is applied, same as before.
-
-|network_args|target layer|
-|---|---|
-|context_attn_dim|attn in context_block|
-|context_mlp_dim|mlp in context_block|
-|context_mod_dim|adaLN_modulation in context_block|
-|x_attn_dim|attn in x_block|
-|x_mlp_dim|mlp in x_block|
-|x_mod_dim|adaLN_modulation in x_block|
-
-`"verbose=True"` is also available for debugging. It shows the rank of each layer.
-
-example: 
-```
--network_args "context_attn_dim=2" "context_mlp_dim=3" "context_mod_dim=4" "x_attn_dim=5" "x_mlp_dim=6" "x_mod_dim=7" "verbose=True"
-```
-
-You can apply LoRA to the conditioning layers of SD3 by specifying `emb_dims` in network_args. When specifying, be sure to specify 6 numbers in `[]` as a comma-separated list.
-
-example: 
-```
--network_args "emb_dims=[2,3,4,5,6,7]"
-```
-
-Each number corresponds to `context_embedder`, `t_embedder`, `x_embedder`, `y_embedder`, `final_layer_adaLN_modulation`, `final_layer_linear`. The above example applies LoRA to all conditioning layers, with rank 2 for `context_embedder`, 3 for `t_embedder`, 4 for `context_embedder`, 5 for `y_embedder`, 6 for `final_layer_adaLN_modulation`, and 7 for `final_layer_linear`.
-
-If you specify `0`, LoRA will not be applied to that layer. For example, `[4,0,0,4,0,0]` applies LoRA only to `context_embedder` and `y_embedder`.
-
-#### Specify blocks to train in SD3 LoRA training
-
-You can specify the blocks to train in SD3 LoRA training by specifying `train_block_indices` in network_args. The indices are 0-based. The default (when omitted) is to train all blocks. The indices are specified as a list of integers or a range of integers, like `0,1,5,8` or `0,1,4-5,7`. 
-
-The number of blocks depends on the model. The valid range is 0-(the number of blocks - 1). `all` is also available to train all blocks, `none` is also available to train no blocks.
-
-example: 
-```
--network_args "train_block_indices=1,2,6-8" 
-```
-
-### Inference for SD3 with LoRA model
-
-The inference script is also available. The script is `sd3_minimal_inference.py`. See `--help` for options. 
-
-### SD3 fine-tuning
-
-Documentation is not available yet. Please refer to the FLUX.1 fine-tuning guide for now. The major difference are following:
-
- `--clip_g` is also available for SD3 fine-tuning.
- `--timestep_sampling` `--discrete_flow_shift``--model_prediction_type` --guidance_scale` are not necessary for SD3 fine-tuning.
- Use `--vae` instead of `--ae` if necessary. __This option is not necessary for SD3.__ VAE is included in the standard SD3 model.
- `--disable_mmap_load_safetensors` is available. __This option significantly reduces the memory usage when loading models for Windows users.__
- `--cpu_offload_checkpointing` is not available for SD3 fine-tuning.
- `--clip_l_dropout_rate`, `--clip_g_dropout_rate` and `--t5_dropout_rate` are available same as LoRA training. 
- `--pos_emb_random_crop_rate` and `--enable_scaled_pos_embed` are available for SD3.5M fine-tuning.
- Training text encoders is available with `--train_text_encoder` option, similar to SDXL training.
-  - CLIP-L and G can be trained with `--train_text_encoder` option. Training T5XXL needs `--train_t5xxl` option.
-  - If you use the cached text encoder outputs for T5XXL with training CLIP-L and G, specify `--use_t5xxl_cache_only`. This option enables to use the cached text encoder outputs for T5XXL only.
-  - The learning rates for CLIP-L, CLIP-G and T5XXL can be specified separately. `--text_encoder_lr1`, `--text_encoder_lr2` and `--text_encoder_lr3` are available. 
-
-### Extract LoRA from SD3 Models
-
-Not available yet.
-
-### Convert SD3 LoRA
-
-Not available yet.
-
-### Merge LoRA to SD3 checkpoint
-
-Not available yet.
-
 --- 

 [__Change History__](#change-history) is moved to the bottom of the page. 
--- a/docs/config_README-en.md
+++ b/docs/config_README-en.md
@@ -1,9 +1,6 @@
-Original Source by kohya-ss
+First version: A.I Translation by Model: NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO, editing by Darkstorm2150

-First version:
-A.I Translation by Model: NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO, editing by Darkstorm2150
-
-Some parts are manually added.
+Document is updated and maintained manually.

 # Config Readme

@@ -267,10 +264,10 @@ The following command line argument options are ignored if a configuration file
 * `--reg_data_dir`
 * `--in_json`

-The following command line argument options are given priority over the configuration file options if both are specified simultaneously. In most cases, they have the same names as the corresponding options in the configuration file.
+For the command line options listed below, if an option is specified in both the command line arguments and the configuration file, the value from the configuration file will be given priority. Unless otherwise noted, the option names are the same.

-| Command Line Argument Option   | Prioritized Configuration File Option |
-| ------------------------------- | ------------------------------------- |
+| Command Line Argument Option   | Corresponding Configuration File Option |
+| ------------------------------- | --------------------------------------- |
 | `--bucket_no_upscale`           |                                       |
 | `--bucket_reso_steps`           |                                       |
 | `--caption_dropout_every_n_epochs` |                                       |
--- a/docs/fine_tune.md
+++ b/docs/fine_tune.md
@@ -0,0 +1,347 @@
+# Fine-tuning Guide
+
+This document explains how to perform fine-tuning on various model architectures using the `*_train.py` scripts.
+
+<details>
+<summary>日本語</summary>
+
+# Fine-tuning ガイド
+
+このドキュメントでは、`*_train.py` スクリプトを用いた、各種モデルアーキテクチャのFine-tuningの方法について解説します。
+
+</details>
+
+### Difference between Fine-tuning and LoRA tuning
+
+This repository supports two methods for additional model training: **Fine-tuning** and **LoRA (Low-Rank Adaptation)**. Each method has distinct features and advantages.
+
+**Fine-tuning** is a method that retrains all (or most) of the weights of a pre-trained model.
+- **Pros**: It can improve the overall expressive power of the model and is suitable for learning styles or concepts that differ significantly from the original model.
+- **Cons**:
+    - It requires a large amount of VRAM and computational cost.
+    - The saved file size is large (same as the original model).
+    - It is prone to "overfitting," where the model loses the diversity of the original model if over-trained.
+- **Corresponding scripts**: Scripts named `*_train.py`, such as `sdxl_train.py`, `sd3_train.py`, `flux_train.py`, and `lumina_train.py`.
+
+**LoRA tuning** is a method that freezes the model's weights and only trains a small additional network called an "adapter."
+- **Pros**:
+    - It allows for fast training with low VRAM and computational cost.
+    - It is considered resistant to overfitting because it trains fewer weights.
+    - The saved file (LoRA network) is very small, ranging from tens to hundreds of MB, making it easy to manage.
+    - Multiple LoRAs can be used in combination.
+- **Cons**: Since it does not train the entire model, it may not achieve changes as significant as fine-tuning.
+- **Corresponding scripts**: Scripts named `*_train_network.py`, such as `sdxl_train_network.py`, `sd3_train_network.py`, and `flux_train_network.py`.
+
+| Feature | Fine-tuning | LoRA tuning |
+|:---|:---|:---|
+| **Training Target** | All model weights | Additional network (adapter) only |
+| **VRAM/Compute Cost**| High | Low |
+| **Training Time** | Long | Short |
+| **File Size** | Large (several GB) | Small (few MB to hundreds of MB) |
+| **Overfitting Risk** | High | Low |
+| **Suitable Use Case** | Major style changes, concept learning | Adding specific characters or styles |
+
+Generally, it is recommended to start with **LoRA tuning** if you want to add a specific character or style. **Fine-tuning** is a valid option for more fundamental style changes or aiming for a high-quality model.
+
+<details>
+<summary>日本語</summary>
+
+### Fine-tuningとLoRA学習の違い
+
+このリポジトリでは、モデルの追加学習手法として**Fine-tuning**と**LoRA (Low-Rank Adaptation)**学習の2種類をサポートしています。それぞれの手法には異なる特徴と利点があります。
+
+**Fine-tuning**は、事前学習済みモデルの重み全体（または大部分）を再学習する手法です。
+- **利点**: モデル全体の表現力を向上させることができ、元のモデルから大きく変化した画風やコンセプトの学習に適しています。
+- **欠点**:
+    - 学習には多くのVRAMと計算コストが必要です。
+    - 保存されるファイルサイズが大きくなります（元のモデルと同じサイズ）。
+    - 学習させすぎると、元のモデルが持っていた多様性が失われる「過学習（overfitting）」に陥りやすい傾向があります。
+- **対応スクリプト**: `sdxl_train.py`, `sd3_train.py`, `flux_train.py`, `lumina_train.py` など、`*_train.py` という命名規則のスクリプトが対応します。
+
+**LoRA学習**は、モデルの重みは凍結（固定）したまま、「アダプター」と呼ばれる小さな追加ネットワークのみを学習する手法です。
+- **利点**:
+    - 少ないVRAMと計算コストで高速に学習できます。
+    - 学習する重みが少ないため、過学習に強いとされています。
+    - 保存されるファイル（LoRAネットワーク）は数十〜数百MBと非常に小さく、管理が容易です。
+    - 複数のLoRAを組み合わせて使用することも可能です。
+- **欠点**: モデル全体を学習するわけではないため、Fine-tuningほどの大きな変化は期待できない場合があります。
+- **対応スクリプト**: `sdxl_train_network.py`, `sd3_train_network.py`, `flux_train_network.py` など、`*_train_network.py` という命名規則のスクリプトが対応します。
+
+| 特徴 | Fine-tuning | LoRA学習 |
+|:---|:---|:---|
+| **学習対象** | モデルの全重み | 追加ネットワーク（アダプター）のみ |
+| **VRAM/計算コスト**| 大 | 小 |
+| **学習時間** | 長 | 短 |
+| **ファイルサイズ** | 大（数GB） | 小（数MB〜数百MB） |
+| **過学習リスク** | 高 | 低 |
+| **適した用途** | 大規模な画風変更、コンセプト学習 | 特定のキャラ、画風の追加学習 |
+
+一般的に、特定のキャラクターや画風を追加したい場合は**LoRA学習**から試すことが推奨されます。より根本的な画風の変更や、高品質なモデルを目指す場合は**Fine-tuning**が有効な選択肢となります。
+
+</details>
+
+--- 
+
+### Fine-tuning for each architecture
+
+Fine-tuning updates the entire weights of the model, so it has different options and considerations than LoRA tuning. This section describes the fine-tuning scripts for major architectures.
+
+The basic command structure is common to all architectures.
+
+```bash
+accelerate launch --mixed_precision bf16 {script_name}.py \
+  --pretrained_model_name_or_path <path_to_model> \
+  --dataset_config <path_to_config.toml> \
+  --output_dir <output_directory> \
+  --output_name <model_output_name> \
+  --save_model_as safetensors \
+  --max_train_steps 10000 \
+  --learning_rate 1e-5 \
+  --optimizer_type AdamW8bit
+```
+
+<details>
+<summary>日本語</summary>
+
+### 各アーキテクチャのFine-tuning
+
+Fine-tuningはモデルの重み全体を更新するため、LoRA学習とは異なるオプションや考慮事項があります。ここでは主要なアーキテクチャごとのFine-tuningスクリプトについて説明します。
+
+基本的なコマンドの構造は、どのアーキテクチャでも共通です。
+
+```bash
+accelerate launch --mixed_precision bf16 {script_name}.py \
+  --pretrained_model_name_or_path <path_to_model> \
+  --dataset_config <path_to_config.toml> \
+  --output_dir <output_directory> \
+  --output_name <model_output_name> \
+  --save_model_as safetensors \
+  --max_train_steps 10000 \
+  --learning_rate 1e-5 \
+  --optimizer_type AdamW8bit
+```
+
+</details>
+
+#### SDXL (`sdxl_train.py`)
+
+Performs fine-tuning for SDXL models. It is possible to train both the U-Net and the Text Encoders.
+
+**Key Options:**
+
+- `--train_text_encoder`: Includes the weights of the Text Encoders (CLIP ViT-L and OpenCLIP ViT-bigG) in the training. Effective for significant style changes or strongly learning specific concepts.
+- `--learning_rate_te1`, `--learning_rate_te2`: Set individual learning rates for each Text Encoder.
+- `--block_lr`: Divides the U-Net into 23 blocks and sets a different learning rate for each block. This allows for advanced adjustments, such as strengthening or weakening the learning of specific layers. (Not available in LoRA tuning).
+
+**Command Example:**
+
+```bash
+accelerate launch --mixed_precision bf16 sdxl_train.py \
+  --pretrained_model_name_or_path "sd_xl_base_1.0.safetensors" \
+  --dataset_config "dataset_config.toml" \
+  --output_dir "output" \
+  --output_name "sdxl_finetuned" \
+  --train_text_encoder \
+  --learning_rate 1e-5 \
+  --learning_rate_te1 5e-6 \
+  --learning_rate_te2 2e-6
+```
+
+<details>
+<summary>日本語</summary>
+
+#### SDXL (`sdxl_train.py`)
+
+SDXLモデルのFine-tuningを行います。U-NetとText Encoderの両方を学習させることが可能です。
+
+**主要なオプション:**
+
+- `--train_text_encoder`: Text Encoder（CLIP ViT-LとOpenCLIP ViT-bigG）の重みを学習対象に含めます。画風を大きく変えたい場合や、特定の概念を強く学習させたい場合に有効です。
+- `--learning_rate_te1`, `--learning_rate_te2`: それぞれのText Encoderに個別の学習率を設定します。
+- `--block_lr`: U-Netを23個のブロックに分割し、ブロックごとに異なる学習率を設定できます。特定の層の学習を強めたり弱めたりする高度な調整が可能です。（LoRA学習では利用できません）
+
+**コマンド例:**
+
+```bash
+accelerate launch --mixed_precision bf16 sdxl_train.py \
+  --pretrained_model_name_or_path "sd_xl_base_1.0.safetensors" \
+  --dataset_config "dataset_config.toml" \
+  --output_dir "output" \
+  --output_name "sdxl_finetuned" \
+  --train_text_encoder \
+  --learning_rate 1e-5 \
+  --learning_rate_te1 5e-6 \
+  --learning_rate_te2 2e-6
+```
+
+</details>
+
+#### SD3 (`sd3_train.py`)
+
+Performs fine-tuning for Stable Diffusion 3 Medium models. SD3 consists of three Text Encoders (CLIP-L, CLIP-G, T5-XXL) and a MMDiT (equivalent to U-Net), which can be targeted for training.
+
+**Key Options:**
+
+- `--train_text_encoder`: Enables training for CLIP-L and CLIP-G.
+- `--train_t5xxl`: Enables training for T5-XXL. T5-XXL is a very large model and requires a lot of VRAM for training.
+- `--blocks_to_swap`: A memory optimization feature to reduce VRAM usage. It swaps some blocks of the MMDiT to CPU memory during training. Useful for using larger batch sizes in low VRAM environments. (Also available in LoRA tuning).
+- `--num_last_block_to_freeze`: Freezes the weights of the last N blocks of the MMDiT, excluding them from training. Useful for maintaining model stability while focusing on learning in the lower layers.
+
+**Command Example:**
+
+```bash
+accelerate launch --mixed_precision bf16 sd3_train.py \
+  --pretrained_model_name_or_path "sd3_medium.safetensors" \
+  --dataset_config "dataset_config.toml" \
+  --output_dir "output" \
+  --output_name "sd3_finetuned" \
+  --train_text_encoder \
+  --learning_rate 4e-6 \
+  --blocks_to_swap 10
+```
+
+<details>
+<summary>日本語</summary>
+
+#### SD3 (`sd3_train.py`)
+
+Stable Diffusion 3 MediumモデルのFine-tuningを行います。SD3は3つのText Encoder（CLIP-L, CLIP-G, T5-XXL）とMMDiT（U-Netに相当）で構成されており、これらを学習対象にできます。
+
+**主要なオプション:**
+
+- `--train_text_encoder`: CLIP-LとCLIP-Gの学習を有効にします。
+- `--train_t5xxl`: T5-XXLの学習を有効にします。T5-XXLは非常に大きなモデルのため、学習には多くのVRAMが必要です。
+- `--blocks_to_swap`: VRAM使用量を削減するためのメモリ最適化機能です。MMDiTの一部のブロックを学習中にCPUメモリに退避（スワップ）させます。VRAMが少ない環境で大きなバッチサイズを使いたい場合に有効です。（LoRA学習でも利用可能）
+- `--num_last_block_to_freeze`: MMDiTの最後のNブロックの重みを凍結し、学習対象から除外します。モデルの安定性を保ちつつ、下位層を中心に学習させたい場合に有効です。
+
+**コマンド例:**
+
+```bash
+accelerate launch --mixed_precision bf16 sd3_train.py \
+  --pretrained_model_name_or_path "sd3_medium.safetensors" \
+  --dataset_config "dataset_config.toml" \
+  --output_dir "output" \
+  --output_name "sd3_finetuned" \
+  --train_text_encoder \
+  --learning_rate 4e-6 \
+  --blocks_to_swap 10
+```
+
+</details>
+
+#### FLUX.1 (`flux_train.py`)
+
+Performs fine-tuning for FLUX.1 models. FLUX.1 is internally composed of two Transformer blocks (Double Blocks, Single Blocks).
+
+**Key Options:**
+
+- `--blocks_to_swap`: Similar to SD3, this feature swaps Transformer blocks to the CPU for memory optimization.
+- `--blockwise_fused_optimizers`: An experimental feature that aims to streamline training by applying individual optimizers to each block.
+
+**Command Example:**
+
+```bash
+accelerate launch --mixed_precision bf16 flux_train.py \
+  --pretrained_model_name_or_path "FLUX.1-dev.safetensors" \
+  --dataset_config "dataset_config.toml" \
+  --output_dir "output" \
+  --output_name "flux1_finetuned" \
+  --learning_rate 1e-5 \
+  --blocks_to_swap 18
+```
+
+<details>
+<summary>日本語</summary>
+
+#### FLUX.1 (`flux_train.py`)
+
+FLUX.1モデルのFine-tuningを行います。FLUX.1は内部的に2つのTransformerブロック（Double Blocks, Single Blocks）で構成されています。
+
+**主要なオプション:**
+
+- `--blocks_to_swap`: SD3と同様に、メモリ最適化のためにTransformerブロックをCPUにスワップする機能です。
+- `--blockwise_fused_optimizers`: 実験的な機能で、各ブロックに個別のオプティマイザを適用し、学習を効率化することを目指します。
+
+**コマンド例:**
+
+```bash
+accelerate launch --mixed_precision bf16 flux_train.py \
+  --pretrained_model_name_or_path "FLUX.1-dev.safetensors" \
+  --dataset_config "dataset_config.toml" \
+  --output_dir "output" \
+  --output_name "flux1_finetuned" \
+  --learning_rate 1e-5 \
+  --blocks_to_swap 18
+```
+
+</details>
+
+#### Lumina (`lumina_train.py`)
+
+Performs fine-tuning for Lumina-Next DiT models.
+
+**Key Options:**
+
+- `--use_flash_attn`: Enables Flash Attention to speed up computation.
+- `lumina_train.py` is relatively new, and many of its options are shared with other scripts. Training can be performed following the basic command pattern.
+
+**Command Example:**
+
+```bash
+accelerate launch --mixed_precision bf16 lumina_train.py \
+  --pretrained_model_name_or_path "Lumina-Next-DiT-B.safetensors" \
+  --dataset_config "dataset_config.toml" \
+  --output_dir "output" \
+  --output_name "lumina_finetuned" \
+  --learning_rate 1e-5
+```
+
+<details>
+<summary>日本語</summary>
+
+#### Lumina (`lumina_train.py`)
+
+Lumina-Next DiTモデルのFine-tuningを行います。
+
+**主要なオプション:**
+
+- `--use_flash_attn`: Flash Attentionを有効にし、計算を高速化します。
+- `lumina_train.py`は比較的新しく、オプションは他のスクリプトと共通化されている部分が多いです。基本的なコマンドパターンに従って学習を行えます。
+
+**コマンド例:**
+
+```bash
+accelerate launch --mixed_precision bf16 lumina_train.py \
+  --pretrained_model_name_or_path "Lumina-Next-DiT-B.safetensors" \
+  --dataset_config "dataset_config.toml" \
+  --output_dir "output" \
+  --output_name "lumina_finetuned" \
+  --learning_rate 1e-5
+```
+
+</details>
+
+--- 
+
+### Differences between Fine-tuning and LoRA tuning per architecture
+
+| Architecture | Key Features/Options Specific to Fine-tuning | Main Differences from LoRA tuning |
+|:---|:---|:---|
+| **SDXL** | `--block_lr` | Only fine-tuning allows for granular control over the learning rate for each U-Net block. |
+| **SD3** | `--train_text_encoder`, `--train_t5xxl`, `--num_last_block_to_freeze` | Only fine-tuning can train the entire Text Encoders. LoRA only trains the adapter parts. |
+| **FLUX.1** | `--blockwise_fused_optimizers` | Since fine-tuning updates the entire model's weights, more experimental optimizer options are available. |
+| **Lumina** | (Few specific options) | Basic training options are common, but fine-tuning differs in that it updates the entire model's foundation. |
+
+<details>
+<summary>日本語</summary>
+
+### アーキテクチャごとのFine-tuningとLoRA学習の違い
+
+| アーキテクチャ | Fine-tuning特有の主要機能・オプション | LoRA学習との主な違い |
+|:---|:---|:---|
+| **SDXL** | `--block_lr` | U-Netのブロックごとに学習率を細かく制御できるのはFine-tuningのみです。 |
+| **SD3** | `--train_text_encoder`, `--train_t5xxl`, `--num_last_block_to_freeze` | Text Encoder全体を学習対象にできるのはFine-tuningです。LoRAではアダプター部分のみ学習します。 |
+| **FLUX.1** | `--blockwise_fused_optimizers` | Fine-tuningではモデル全体の重みを更新するため、より実験的なオプティマイザの選択肢が用意されています。 |
+| **Lumina** | （特有のオプションは少ない） | 基本的な学習オプションは共通ですが、Fine-tuningはモデルの基盤全体を更新する点で異なります。 |
+
+</details>
--- a/docs/flux_train_network.md
+++ b/docs/flux_train_network.md
@@ -641,6 +641,40 @@ interpolation_type = "lanczos" # Example: Use Lanczos interpolation

 </details>

+### 7.3. Other Training Options / その他の学習オプション
+
+- **`--controlnet_model_name_or_path`**: Specifies the path to a ControlNet model compatible with FLUX.1. This allows for training a LoRA that works in conjunction with ControlNet. This is an advanced feature and requires a compatible ControlNet model.
+
+- **`--loss_type`**: Specifies the loss function for training. The default is `l2`.
+  - `l1`: L1 loss.
+  - `l2`: L2 loss (mean squared error).
+  - `huber`: Huber loss.
+  - `smooth_l1`: Smooth L1 loss.
+
+- **`--huber_schedule`**, **`--huber_c`**, **`--huber_scale`**: These are parameters for Huber loss. They are used when `--loss_type` is set to `huber` or `smooth_l1`.
+
+- **`--t5xxl_max_token_length`**: Specifies the maximum token length for the T5-XXL text encoder. For details, refer to the [`sd3_train_network.md` guide](sd3_train_network.md).
+
+- **`--weighting_scheme`**, **`--logit_mean`**, **`--logit_std`**, **`--mode_scale`**: These options allow you to adjust the loss weighting for each timestep. For details, refer to the [`sd3_train_network.md` guide](sd3_train_network.md).
+
+- **`--fused_backward_pass`**: Fuses the backward pass and optimizer step to reduce VRAM usage. For details, refer to the [`sdxl_train_network.md` guide](sdxl_train_network.md).
+
+<details>
+<summary>日本語</summary>
+
+- **`--controlnet_model_name_or_path`**: FLUX.1互換のControlNetモデルへのパスを指定します。これにより、ControlNetと連携して動作するLoRAを学習できます。これは高度な機能であり、互換性のあるControlNetモデルが必要です。
+- **`--loss_type`**: 学習に用いる損失関数を指定します。デフォルトは `l2` です。
+  - `l1`: L1損失。
+  - `l2`: L2損失（平均二乗誤差）。
+  - `huber`: Huber損失。
+  - `smooth_l1`: Smooth L1損失。
+- **`--huber_schedule`**, **`--huber_c`**, **`--huber_scale`**: これらはHuber損失のパラメータです。`--loss_type` が `huber` または `smooth_l1` の場合に使用されます。
+- **`--t5xxl_max_token_length`**: T5-XXLテキストエンコーダの最大トークン長を指定します。詳細は [`sd3_train_network.md` ガイド](sd3_train_network.md) を参照してください。
+- **`--weighting_scheme`**, **`--logit_mean`**, **`--logit_std`**, **`--mode_scale`**: これらのオプションは、各タイムステップの損失の重み付けを調整するために使用されます。詳細は [`sd3_train_network.md` ガイド](sd3_train_network.md) を参照してください。
+- **`--fused_backward_pass`**: バックワードパスとオプティマイザステップを融合してVRAM使用量を削減します。詳細は [`sdxl_train_network.md` ガイド](sdxl_train_network.md) を参照してください。
+
+</details>
+
 ## 8. Related Tools / 関連ツール

 Several related scripts are provided for models trained with `flux_train_network.py` and to assist with the training process:
--- a/docs/gen_img_README-ja.md
+++ b/docs/gen_img_README-ja.md
@@ -3,7 +3,7 @@ SD 1.xおよび2.xのモデル、当リポジトリで学習したLoRA、Control
 # 概要

 * Diffusers (v0.10.2) ベースの推論（画像生成）スクリプト。
-* SD 1.xおよび2.x (base/v-parameterization)モデルに対応。
+* SD 1.x、2.x (base/v-parameterization)、およびSDXLモデルに対応。
 * txt2img、img2img、inpaintingに対応。
 * 対話モード、およびファイルからのプロンプト読み込み、連続生成に対応。
 * プロンプト1行あたりの生成枚数を指定可能。
@@ -96,14 +96,20 @@ python gen_img_diffusers.py --ckpt <モデル名> --outdir <画像出力先>

 - `--ckpt <モデル名>`：モデル名を指定します。`--ckpt`オプションは必須です。Stable Diffusionのcheckpointファイル、またはDiffusersのモデルフォルダ、Hugging FaceのモデルIDを指定できます。

+- `--v1`：Stable Diffusion 1.x系のモデルを使う場合に指定します。これがデフォルトの動作です。
+
 - `--v2`：Stable Diffusion 2.x系のモデルを使う場合に指定します。1.x系の場合には指定不要です。

+- `--sdxl`：Stable Diffusion XLモデルを使う場合に指定します。
+
 - `--v_parameterization`：v-parameterizationを使うモデルを使う場合に指定します（`768-v-ema.ckpt`およびそこからの追加学習モデル、Waifu Diffusion v1.5など）。
    
-    `--v2`の指定有無が間違っているとモデル読み込み時にエラーになります。`--v_parameterization`の指定有無が間違っていると茶色い画像が表示されます。
+    `--v2`や`--sdxl`の指定有無が間違っているとモデル読み込み時にエラーになります。`--v_parameterization`の指定有無が間違っていると茶色い画像が表示されます。

 - `--vae`：使用するVAEを指定します。未指定時はモデル内のVAEを使用します。

+- `--tokenizer_cache_dir`：トークナイザーのキャッシュディレクトリを指定します（オフライン利用のため）。
+
 ## 画像生成と出力

 - `--interactive`：インタラクティブモードで動作します。プロンプトを入力すると画像が生成されます。
@@ -112,6 +118,10 @@ python gen_img_diffusers.py --ckpt <モデル名> --outdir <画像出力先>

 - `--from_file <プロンプトファイル名>`：プロンプトが記述されたファイルを指定します。1行1プロンプトで記述してください。なお画像サイズやguidance scaleはプロンプトオプション（後述）で指定できます。

+- `--from_module <モジュールファイル>`：Pythonモジュールからプロンプトを読み込みます。モジュールは`get_prompter(args, pipe, networks)`関数を実装している必要があります。
+
+- `--prompter_module_args`：prompterモジュールに渡す追加の引数を指定します。
+
 - `--W <画像幅>`：画像の幅を指定します。デフォルトは`512`です。

 - `--H <画像高さ>`：画像の高さを指定します。デフォルトは`512`です。
@@ -132,6 +142,24 @@ python gen_img_diffusers.py --ckpt <モデル名> --outdir <画像出力先>

 - `--negative_scale` : uncoditioningのguidance scaleを個別に指定します。[gcem156氏のこちらの記事](https://note.com/gcem156/n/ne9a53e4a6f43)を参考に実装したものです。

+- `--emb_normalize_mode`：embedding正規化モードを指定します。"original"（デフォルト）、"abs"、"none"から選択できます。プロンプトの重みの正規化方法に影響します。
+
+## SDXL固有のオプション
+
+SDXL モデル（`--sdxl`フラグ付き）を使用する場合、追加のコンディショニングオプションが利用できます：
+
+- `--original_height`：SDXL コンディショニング用の元の高さを指定します。これはモデルの対象解像度の理解に影響します。
+
+- `--original_width`：SDXL コンディショニング用の元の幅を指定します。これはモデルの対象解像度の理解に影響します。
+
+- `--original_height_negative`：SDXL ネガティブコンディショニング用の元の高さを指定します。
+
+- `--original_width_negative`：SDXL ネガティブコンディショニング用の元の幅を指定します。
+
+- `--crop_top`：SDXL コンディショニング用のクロップ上オフセットを指定します。
+
+- `--crop_left`：SDXL コンディショニング用のクロップ左オフセットを指定します。
+
 ## メモリ使用量や生成速度の調整

 - `--batch_size <バッチサイズ>`：バッチサイズを指定します。デフォルトは`1`です。バッチサイズが大きいとメモリを多く消費しますが、生成速度が速くなります。
@@ -139,8 +167,16 @@ python gen_img_diffusers.py --ckpt <モデル名> --outdir <画像出力先>
 - `--vae_batch_size <VAEのバッチサイズ>`：VAEのバッチサイズを指定します。デフォルトはバッチサイズと同じです。
    VAEのほうがメモリを多く消費するため、デノイジング後（stepが100%になった後）でメモリ不足になる場合があります。このような場合にはVAEのバッチサイズを小さくしてください。

+- `--vae_slices <スライス数>`：VAE処理時に画像をスライスに分割してVRAM使用量を削減します。None（デフォルト）で分割なし。16や32のような値が推奨されます。有効にすると処理が遅くなりますが、VRAM使用量が少なくなります。
+
+- `--no_half_vae`：VAE処理でfp16/bf16精度の使用を防ぎます。代わりにfp32を使用します。VAE関連の問題やアーティファクトが発生した場合に使用してください。
+
 - `--xformers`：xformersを使う場合に指定します。

+- `--sdpa`：最適化のためにPyTorch 2のscaled dot-product attentionを使用します。
+
+- `--diffusers_xformers`：Diffusers経由でxformersを使用します（注：Hypernetworksと互換性がありません）。
+
 - `--fp16`：fp16（単精度）での推論を行います。`fp16`と`bf16`をどちらも指定しない場合はfp32（単精度）での推論を行います。

 - `--bf16`：bf16（bfloat16）での推論を行います。RTX 30系のGPUでのみ指定可能です。`--bf16`オプションはRTX 30系以外のGPUではエラーになります。`fp16`よりも`bf16`のほうが推論結果がNaNになる（真っ黒の画像になる）可能性が低いようです。
@@ -157,6 +193,12 @@ python gen_img_diffusers.py --ckpt <モデル名> --outdir <画像出力先>

 - `--network_pre_calc`：使用する追加ネットワークの重みを生成ごとにあらかじめ計算します。プロンプトオプションの`--am`が使用できます。LoRA未使用時と同じ程度まで生成は高速化されますが、生成前に重みを計算する時間が必要で、またメモリ使用量も若干増加します。Regional LoRA使用時は無効になります 。

+- `--network_regional_mask_max_color_codes`：リージョナルマスクに使用する色コードの最大数を指定します。指定されていない場合、マスクはチャンネルごとに適用されます。Regional LoRAと組み合わせて、マスク内の色で定義できるリージョン数を制御するために使用されます。
+
+- `--network_args`：key=value形式でネットワークモジュールに渡す追加引数を指定します。例: `--network_args "alpha=1.0,dropout=0.1"`。
+
+- `--network_merge_n_models`：ネットワークマージを使用する場合、マージするモデル数を指定します（全ての読み込み済みネットワークをマージする代わりに）。
+
 # 主なオプションの指定例

 次は同一プロンプトで64枚をバッチサイズ4で一括生成する例です。
@@ -235,7 +277,9 @@ python gen_img_diffusers.py --ckpt model.safetensors

 - `--sequential_file_name`：ファイル名を連番にするかどうかを指定します。指定すると生成されるファイル名が`im_000001.png`からの連番になります。

- `--use_original_file_name`：指定すると生成ファイル名がオリジナルのファイル名と同じになります。
+- `--use_original_file_name`：指定すると生成ファイル名がオリジナルのファイル名の前に追加されます（img2imgモード用）。
+
+- `--clip_vision_strength`：指定した強度でimg2img用のCLIP Vision Conditioningを有効にします。CLIP Visionモデルを使用して入力画像からのコンディショニングを強化します。

 ## コマンドラインからの実行例

@@ -306,7 +350,9 @@ img2imgと併用できません。
 - `--highres_fix_upscaler`：2nd stageに任意のupscalerを利用します。現在は`--highres_fix_upscaler tools.latent_upscaler` のみ対応しています。

 - `--highres_fix_upscaler_args`：`--highres_fix_upscaler`で指定したupscalerに渡す引数を指定します。
-    `tools.latent_upscaler`の場合は、`--highres_fix_upscaler_args "weights=D:\Work\SD\Models\others\etc\upscaler-v1-e100-220.safetensors"`のように重みファイルを指定します。 
+    `tools.latent_upscaler`の場合は、`--highres_fix_upscaler_args "weights=D:\Work\SD\Models\others\etc\upscaler-v1-e100-220.safetensors"`のように重みファイルを指定します。
+
+- `--highres_fix_disable_control_net`：Highres fixの2nd stageでControlNetを無効にします。デフォルトでは、ControlNetは両ステージで使用されます。

 コマンドラインの例です。

@@ -319,6 +365,34 @@ python gen_img_diffusers.py  --ckpt trinart_characters_it4_v1_vae_merged.ckpt
    --highres_fix_scale 0.5 --highres_fix_steps 28 --strength 0.5
 ```

+## Deep Shrink
+
+Deep Shrinkは、異なるタイムステップで異なる深度のUNetを使用して生成プロセスを最適化する技術です。生成品質と効率を向上させることができます。
+
+以下のオプションがあります：
+
+- `--ds_depth_1`：第1フェーズでこの深度のDeep Shrinkを有効にします。有効な値は0から8です。
+
+- `--ds_timesteps_1`：このタイムステップまでDeep Shrink深度1を適用します。デフォルトは650です。
+
+- `--ds_depth_2`：Deep Shrinkの第2フェーズの深度を指定します。
+
+- `--ds_timesteps_2`：このタイムステップまでDeep Shrink深度2を適用します。デフォルトは650です。
+
+- `--ds_ratio`：Deep Shrinkでのダウンサンプリングの比率を指定します。デフォルトは0.5です。
+
+これらのパラメータはプロンプトオプションでも指定できます：
+
+- `--dsd1`：プロンプトからDeep Shrink深度1を指定します。
+  
+- `--dst1`：プロンプトからDeep Shrinkタイムステップ1を指定します。
+  
+- `--dsd2`：プロンプトからDeep Shrink深度2を指定します。
+  
+- `--dst2`：プロンプトからDeep Shrinkタイムステップ2を指定します。
+  
+- `--dsr`：プロンプトからDeep Shrink比率を指定します。
+
 ## ControlNet

 現在はControlNet 1.0のみ動作確認しています。プリプロセスはCannyのみサポートしています。
@@ -346,6 +420,20 @@ python gen_img_diffusers.py --ckpt model_ckpt --scale 8 --steps 48 --outdir txt2
    --guide_image_path guide.png --control_net_ratios 1.0 --interactive
 ```

+## ControlNet-LLLite
+
+ControlNet-LLLiteは、類似の誘導目的に使用できるControlNetの軽量な代替手段です。
+
+以下のオプションがあります：
+
+- `--control_net_lllite_models`：ControlNet-LLLiteモデルファイルを指定します。
+
+- `--control_net_multipliers`：ControlNet-LLLiteの倍率を指定します（重みに類似）。
+
+- `--control_net_ratios`：ControlNet-LLLiteを適用するステップの比率を指定します。
+
+注意：ControlNetとControlNet-LLLiteは同時に使用できません。
+
 ## Attention Couple + Reginal LoRA

 プロンプトをいくつかの部分に分割し、それぞれのプロンプトを画像内のどの領域に適用するかを指定できる機能です。個別のオプションはありませんが、`mask_path`とプロンプトで指定します。
@@ -450,7 +538,9 @@ python gen_img_diffusers.py --ckpt wd-v1-3-full-pruned-half.ckpt

 - `--opt_channels_last` : 推論時にテンソルのチャンネルを最後に配置します。場合によっては高速化されることがあります。

- `--network_show_meta` : 追加ネットワークのメタデータを表示します。
+- `--shuffle_prompts`：繰り返し時にプロンプトの順序をシャッフルします。`--from_file`で複数のプロンプトを使用する場合に便利です。
+
+- `--network_show_meta`：追加ネットワークのメタデータを表示します。


 --- 
@@ -478,6 +568,8 @@ latentのサイズを徐々に大きくしていくHires fixです。`gen_img.py
 - `--gradual_latent_ratio` : latentの初期サイズを指定します。デフォルトは 0.5 で、デフォルトの latent サイズの半分のサイズから始めます。
 - `--gradual_latent_ratio_step`: latentのサイズを大きくする割合を指定します。デフォルトは 0.125 で、latentのサイズを 0.625, 0.75, 0.875, 1.0 と徐々に大きくします。
 - `--gradual_latent_ratio_every_n_steps`: latentのサイズを大きくする間隔を指定します。デフォルトは 3 で、3ステップごとに latent のサイズを大きくします。
+- `--gradual_latent_s_noise`：Gradual LatentのS_noiseパラメータを指定します。デフォルトは1.0です。
+- `--gradual_latent_unsharp_params`：Gradual Latentのアンシャープマスクパラメータをksize,sigma,strength,target-x形式で指定します（target-x: 1=True, 0=False）。推奨値：`3,0.5,0.5,1`または`3,1.0,1.0,0`。

 それぞれのオプションは、プロンプトオプション、`--glt`、`--glr`、`--gls`、`--gle` でも指定できます。

--- a/docs/gen_img_README.md
+++ b/docs/gen_img_README.md
@@ -4,7 +4,7 @@ This is an inference (image generation) script that supports SD 1.x and 2.x mode
 # Overview

 * Inference (image generation) script.
-* Supports SD 1.x and 2.x (base/v-parameterization) models.
+* Supports SD 1.x, 2.x (base/v-parameterization), and SDXL models.
 * Supports txt2img, img2img, and inpainting.
 * Supports interactive mode, prompt reading from files, and continuous generation.
 * The number of images generated per prompt line can be specified.
@@ -13,7 +13,7 @@ This is an inference (image generation) script that supports SD 1.x and 2.x mode
 * Supports xformers for high-speed generation.
    * Although xformers are used for memory-saving generation, it is not as optimized as Automatic 1111's Web UI, so it uses about 6GB of VRAM for 512*512 image generation.
 * Extension of prompts to 225 tokens. Supports negative prompts and weighting.
-* Supports various samplers from Diffusers (fewer samplers than Web UI).
+* Supports various samplers from Diffusers including ddim, pndm, lms, euler, euler_a, heun, dpm_2, dpm_2_a, dpmsolver, dpmsolver++, dpmsingle.
 * Supports clip skip (uses the output of the nth layer from the end) of Text Encoder.
 * Separate loading of VAE.
 * Supports CLIP Guided Stable Diffusion, VGG16 Guided Stable Diffusion, Highres. fix, and upscale.
@@ -100,14 +100,20 @@ Specify from the command line.

 - `--ckpt <model_name>`: Specifies the model name. The `--ckpt` option is mandatory. You can specify a Stable Diffusion checkpoint file, a Diffusers model folder, or a Hugging Face model ID.

+- `--v1`: Specify when using Stable Diffusion 1.x series models. This is the default behavior.
+
 - `--v2`: Specify when using Stable Diffusion 2.x series models. Not required for 1.x series.

+- `--sdxl`: Specify when using Stable Diffusion XL models.
+
 - `--v_parameterization`: Specify when using models that use v-parameterization (`768-v-ema.ckpt` and models with additional training from it, Waifu Diffusion v1.5, etc.).

-    If the `--v2` specification is incorrect, an error will occur when loading the model. If the `--v_parameterization` specification is incorrect, a brown image will be displayed.
+    If the `--v2` or `--sdxl` specification is incorrect, an error will occur when loading the model. If the `--v_parameterization` specification is incorrect, a brown image will be displayed.

 - `--vae`: Specifies the VAE to use. If not specified, the VAE in the model will be used.

+- `--tokenizer_cache_dir`: Specifies the cache directory for the tokenizer (for offline usage).
+
 ## Image Generation and Output

 - `--interactive`: Operates in interactive mode. Images are generated when prompts are entered.
@@ -118,6 +124,8 @@ Specify from the command line.

 - `--from_module <module_file>`: Loads prompts from a Python module. The module should implement a `get_prompter(args, pipe, networks)` function.

+- `--prompter_module_args`: Specifies additional arguments to pass to the prompter module.
+
 - `--W <image_width>`: Specifies the width of the image. The default is `512`.

 - `--H <image_height>`: Specifies the height of the image. The default is `512`.
@@ -126,7 +134,7 @@ Specify from the command line.

 - `--scale <guidance_scale>`: Specifies the unconditional guidance scale. The default is `7.5`.

- `--sampler <sampler_name>`: Specifies the sampler. The default is `ddim`. ddim, pndm, dpmsolver, dpmsolver+++, lms, euler, euler_a provided by Diffusers can be specified (the last three can also be specified as k_lms, k_euler, k_euler_a).
+- `--sampler <sampler_name>`: Specifies the sampler. The default is `ddim`. The following samplers are supported: ddim, pndm, lms, euler, euler_a, heun, dpm_2, dpm_2_a, dpmsolver, dpmsolver++, dpmsingle. Some can also be specified with k_ prefix (k_lms, k_euler, k_euler_a, k_dpm_2, k_dpm_2_a).

 - `--outdir <image_output_destination_folder>`: Specifies the output destination for images.

@@ -140,6 +148,22 @@ Specify from the command line.

 - `--emb_normalize_mode`: Specifies the embedding normalization mode. Options are "original" (default), "abs", and "none". This affects how prompt weights are normalized.

+## SDXL-Specific Options
+
+When using SDXL models (with `--sdxl` flag), additional conditioning options are available:
+
+- `--original_height`: Specifies the original height for SDXL conditioning. This affects the model's understanding of the target resolution.
+
+- `--original_width`: Specifies the original width for SDXL conditioning. This affects the model's understanding of the target resolution.
+
+- `--original_height_negative`: Specifies the original height for SDXL negative conditioning.
+
+- `--original_width_negative`: Specifies the original width for SDXL negative conditioning.
+
+- `--crop_top`: Specifies the crop top offset for SDXL conditioning.
+
+- `--crop_left`: Specifies the crop left offset for SDXL conditioning.
+
 ## Adjusting Memory Usage and Generation Speed

 - `--batch_size <batch_size>`: Specifies the batch size. The default is `1`. A larger batch size consumes more memory but speeds up generation.
@@ -149,12 +173,14 @@ Specify from the command line.

 - `--vae_slices <number_of_slices>`: Splits the image into slices for VAE processing to reduce VRAM usage. None (default) for no splitting. Values like 16 or 32 are recommended. Enabling this is slower but uses less VRAM.

- `--no_half_vae`: Prevents using fp16/bf16 precision for VAE processing. Uses fp32 instead.
+- `--no_half_vae`: Prevents using fp16/bf16 precision for VAE processing. Uses fp32 instead. Use this if you encounter VAE-related issues or artifacts.

 - `--xformers`: Specify when using xformers.

 - `--sdpa`: Use scaled dot-product attention in PyTorch 2 for optimization.

+- `--diffusers_xformers`: Use xformers via Diffusers (note: incompatible with Hypernetworks).
+
 - `--fp16`: Performs inference in fp16 (single precision). If neither `fp16` nor `bf16` is specified, inference is performed in fp32 (single precision).

 - `--bf16`: Performs inference in bf16 (bfloat16). Can only be specified for RTX 30 series GPUs. The `--bf16` option will cause an error on GPUs other than the RTX 30 series. It seems that `bf16` is less likely to result in NaN (black image) inference results than `fp16`.
@@ -173,6 +199,10 @@ Specify from the command line.

 - `--network_regional_mask_max_color_codes`: Specifies the maximum number of color codes to use for regional masks. If not specified, masks are applied by channel. Used with Regional LoRA to control the number of regions that can be defined by colors in the mask.

+- `--network_args`: Specifies additional arguments to pass to the network module in key=value format. For example: `--network_args "alpha=1.0,dropout=0.1"`.
+
+- `--network_merge_n_models`: When using network merging, specifies the number of models to merge (instead of merging all loaded networks).
+
 # Examples of Main Option Specifications

 The following is an example of batch generating 64 images with the same prompt and a batch size of 4.
@@ -259,7 +289,7 @@ Example:

 - `--sequential_file_name`: Specifies whether to make file names sequential. If specified, the generated file names will be sequential starting from `im_000001.png`.

- `--use_original_file_name`: If specified, the generated file name will be the same as the original file name.
+- `--use_original_file_name`: If specified, the generated file name will be prepended with the original file name (for img2img mode).

 - `--clip_vision_strength`: Enables CLIP Vision Conditioning for img2img with the specified strength. Uses the CLIP Vision model to enhance conditioning from the input image.

@@ -375,6 +405,16 @@ These parameters can also be specified through prompt options:
  
 - `--dsr`: Specifies Deep Shrink ratio from the prompt.

+*Additional prompt options for Gradual Latent (requires `euler_a` sampler):*
+
+- `--glt`: Specifies the timestep to start increasing the size of the latent for Gradual Latent. Overrides the command line specification.
+
+- `--glr`: Specifies the initial size of the latent for Gradual Latent as a ratio. Overrides the command line specification.
+
+- `--gls`: Specifies the ratio to increase the size of the latent for Gradual Latent. Overrides the command line specification.
+
+- `--gle`: Specifies the interval to increase the size of the latent for Gradual Latent. Overrides the command line specification.
+
 ## ControlNet

 Currently, only ControlNet 1.0 has been confirmed to work. Only Canny is supported for preprocessing.
@@ -536,25 +576,10 @@ Gradual Latent is a Hires fix that gradually increases the size of the latent.
 - `--gradual_latent_ratio_step`: Specifies the ratio to increase the size of the latent. The default is 0.125, which means the latent size is gradually increased to 0.625, 0.75, 0.875, 1.0.
 - `--gradual_latent_ratio_every_n_steps`: Specifies the interval to increase the size of the latent. The default is 3, which means the latent size is increased every 3 steps.
 - `--gradual_latent_s_noise`: Specifies the s_noise parameter for Gradual Latent. Default is 1.0.
- `--gradual_latent_unsharp_params`: Specifies unsharp mask parameters for Gradual Latent: ksize, sigma, strength, target-x (1 means True). Values like `3,0.5,0.5,1` or `3,1.0,1.0,0` are recommended.
+- `--gradual_latent_unsharp_params`: Specifies unsharp mask parameters for Gradual Latent in the format: ksize,sigma,strength,target-x (where target-x: 1=True, 0=False). Recommended values: `3,0.5,0.5,1` or `3,1.0,1.0,0`.

 Each option can also be specified with prompt options, `--glt`, `--glr`, `--gls`, `--gle`.

 __Please specify `euler_a` for the sampler.__ Because the source code of the sampler is modified. It will not work with other samplers.

 It is more effective with SD 1.5. It is quite subtle with SDXL.
-
-# Gradual Latent について (Japanese section - kept for reference)
-
-latentのサイズを徐々に大きくしていくHires fixです。`gen_img.py` 、``sdxl_gen_img.py` 、`gen_img.py` に以下のオプションが追加されています。
-
- `--gradual_latent_timesteps` : latentのサイズを大きくし始めるタイムステップを指定します。デフォルトは None で、Gradual Latentを使用しません。750 くらいから始めてみてください。
- `--gradual_latent_ratio` : latentの初期サイズを指定します。デフォルトは 0.5 で、デフォルトの latent サイズの半分のサイズから始めます。
- `--gradual_latent_ratio_step`: latentのサイズを大きくする割合を指定します。デフォルトは 0.125 で、latentのサイズを 0.625, 0.75, 0.875, 1.0 と徐々に大きくします。
- `--gradual_latent_ratio_every_n_steps`: latentのサイズを大きくする間隔を指定します。デフォルトは 3 で、3ステップごとに latent のサイズを大きくします。
-
-それぞれのオプションは、プロンプトオプション、`--glt`、`--glr`、`--gls`、`--gle` でも指定できます。
-
-サンプラーに手を加えているため、__サンプラーに `euler_a` を指定してください。__ 他のサンプラーでは動作しません。
-
-SD 1.5 のほうが効果があります。SDXL ではかなり微妙です。
--- a/docs/lumina_train_network.md
+++ b/docs/lumina_train_network.md
@@ -170,6 +170,8 @@ Besides the arguments explained in the [train_network.py guide](train_network.md
 * `--model_prediction_type=<choice>` – Model prediction processing method. Options: `raw`, `additive`, `sigma_scaled`. Default `raw`. **Recommended: `raw`**
 * `--system_prompt=<string>` – System prompt to prepend to all prompts. Recommended: `"You are an assistant designed to generate high-quality images based on user prompts."` or `"You are an assistant designed to generate high-quality images with the highest degree of image-text alignment based on textual prompts."`
 * `--use_flash_attn` – Use Flash Attention. Requires `pip install flash-attn` (may not be supported in all environments). If installed correctly, it speeds up training. 
+* `--use_sage_attn` – Use Sage Attention for the model.
+* `--sample_batch_size=<integer>` – Batch size to use for sampling, defaults to `--training_batch_size` value. Sample batches are bucketed by width, height, guidance scale, and seed.
 * `--sigmoid_scale=<float>` – Scale factor for sigmoid timestep sampling. Default `1.0`.

 #### Memory and Speed / メモリ・速度関連
@@ -216,6 +218,8 @@ For Lumina Image 2.0, you can specify different dimensions for various component
 *   `--model_prediction_type=<choice>` – モデル予測の処理方法を指定します。`raw`, `additive`, `sigma_scaled`から選択します。デフォルトは`raw`です。**推奨: `raw`**
 *   `--system_prompt=<string>` – 全てのプロンプトに前置するシステムプロンプトを指定します。推奨: `"You are an assistant designed to generate high-quality images based on user prompts."` または `"You are an assistant designed to generate high-quality images with the highest degree of image-text alignment based on textual prompts."`
 *   `--use_flash_attn` – Flash Attentionを使用します。`pip install flash-attn`でインストールが必要です（環境によってはサポートされていません）。正しくインストールされている場合は、指定すると学習が高速化されます。
+*   `--use_sage_attn` – Sage Attentionを使用します。
+*   `--sample_batch_size=<integer>` – サンプリングに使用するバッチサイズ。デフォルトは `--training_batch_size` の値です。サンプルバッチは、幅、高さ、ガイダンススケール、シードによってバケット化されます。
 *   `--sigmoid_scale=<float>` – sigmoidタイムステップサンプリングのスケール係数を指定します。デフォルトは`1.0`です。

 #### メモリ・速度関連
--- a/docs/sd3_train_network.md
+++ b/docs/sd3_train_network.md
@@ -1,5 +1,3 @@
-Status: reviewed
-
 # LoRA Training Guide for Stable Diffusion 3/3.5 using `sd3_train_network.py` / `sd3_train_network.py` を用いたStable Diffusion 3/3.5モデルのLoRA学習ガイド

 This document explains how to train LoRA (Low-Rank Adaptation) models for Stable Diffusion 3 (SD3) and Stable Diffusion 3.5 (SD3.5) using `sd3_train_network.py` in the `sd-scripts` repository.
@@ -18,7 +16,6 @@ This guide assumes you already understand the basics of LoRA training. For commo

 <details>
 <summary>日本語</summary>
-ステータス：内容を一通り確認した

 `sd3_train_network.py`は、Stable Diffusion 3/3.5モデルに対してLoRAなどの追加ネットワークを学習させるためのスクリプトです。SD3は、MMDiT (Multi-Modal Diffusion Transformer) と呼ばれる新しいアーキテクチャを採用しており、従来のStable Diffusionモデルとは構造が異なります。このスクリプトを使用することで、SD3/3.5モデルに特化したLoRAモデルを作成できます。

@@ -106,6 +103,7 @@ accelerate launch --num_cpu_threads_per_process 1 sd3_train_network.py \

 <details>
 <summary>日本語</summary>
+
 学習は、ターミナルから`sd3_train_network.py`を実行することで開始します。基本的なコマンドラインの構造は`train_network.py`と同様ですが、SD3/3.5特有の引数を指定する必要があります。

 以下に、基本的なコマンドライン実行例を示します。
@@ -136,6 +134,7 @@ accelerate launch --num_cpu_threads_per_process 1 sd3_train_network.py
 ```

 ※実際には1行で書くか、適切な改行文字（`\` または `^`）を使用してください。
+
 </details>

 ### 4.1. Explanation of Key Options / 主要なコマンドライン引数の解説
@@ -157,11 +156,19 @@ Besides the arguments explained in the [train_network.py guide](train_network.md
 * `--enable_scaled_pos_embed` **[SD3.5][experimental]** – Scale positional embeddings when training with multiple resolutions.
 * `--training_shift=<float>` – Shift applied to the timestep distribution. Default `1.0`.
 * `--weighting_scheme=<choice>` – Weighting method for loss by timestep. Default `uniform`.
-* `--logit_mean`, `--logit_std`, `--mode_scale` – Parameters for `logit_normal` or `mode` weighting.
+* `--logit_mean=<float>` – Mean value for `logit_normal` weighting scheme. Default `0.0`.
+* `--logit_std=<float>` – Standard deviation for `logit_normal` weighting scheme. Default `1.0`.
+* `--mode_scale=<float>` – Scale factor for `mode` weighting scheme. Default `1.29`.

 #### Memory and Speed / メモリ・速度関連

 * `--blocks_to_swap=<integer>` **[experimental]** – Swap a number of Transformer blocks between CPU and GPU. More blocks reduce VRAM but slow training. Cannot be used with `--cpu_offload_checkpointing`.
+* `--cache_text_encoder_outputs` – Caches the outputs of the text encoders to reduce VRAM usage and speed up training. This is particularly effective for SD3, which uses three text encoders. Recommended when not training the text encoder LoRA. For more details, see the [`sdxl_train_network.py` guide](sdxl_train_network.md).
+* `--cache_text_encoder_outputs_to_disk` – Caches the text encoder outputs to disk when the above option is enabled.
+* `--t5xxl_device=<device>` **[not supported yet]** – Specifies the device for T5-XXL model. If not specified, uses accelerator's device.
+* `--t5xxl_dtype=<dtype>` **[not supported yet]** – Specifies the dtype for T5-XXL model. If not specified, uses default dtype from mixed precision.
+* `--save_clip` **[not supported yet]** – Saves CLIP models to checkpoint (unified checkpoint format not yet supported).
+* `--save_t5xxl` **[not supported yet]** – Saves T5-XXL model to checkpoint (unified checkpoint format not yet supported).

 #### Incompatible or Deprecated Options / 非互換・非推奨の引数

@@ -169,6 +176,7 @@ Besides the arguments explained in the [train_network.py guide](train_network.md

 <details>
 <summary>日本語</summary>
+
 [`train_network.py`のガイド](train_network.md)で説明されている引数に加え、以下のSD3/3.5特有の引数を指定します。共通の引数については、上記ガイドを参照してください。

 #### モデル関連
@@ -189,34 +197,159 @@ Besides the arguments explained in the [train_network.py guide](train_network.md
 *   `--enable_scaled_pos_embed` **[SD3.5向け][実験的機能]** – マルチ解像度学習時に解像度に応じてPositional Embeddingをスケーリングします。
 *   `--training_shift=<float>` – タイムステップ分布を調整するためのシフト値です。デフォルトは`1.0`です。
 *   `--weighting_scheme=<choice>` – タイムステップに応じた損失の重み付け方法を指定します。デフォルトは`uniform`です。
-*   `--logit_mean`, `--logit_std`, `--mode_scale` – `logit_normal`または`mode`使用時のパラメータです。
+*   `--logit_mean=<float>` – `logit_normal`重み付けスキームの平均値です。デフォルトは`0.0`です。
+*   `--logit_std=<float>` – `logit_normal`重み付けスキームの標準偏差です。デフォルトは`1.0`です。
+*   `--mode_scale=<float>` – `mode`重み付けスキームのスケール係数です。デフォルトは`1.29`です。

 #### メモリ・速度関連

 *   `--blocks_to_swap=<integer>` **[実験的機能]** – TransformerブロックをCPUとGPUでスワップしてVRAMを節約します。`--cpu_offload_checkpointing`とは併用できません。
+*   `--cache_text_encoder_outputs` – Text Encoderの出力をキャッシュし、VRAM使用量削減と学習高速化を図ります。SD3は3つのText Encoderを持つため特に効果的です。Text EncoderのLoRAを学習しない場合に推奨されます。詳細は[`sdxl_train_network.py`のガイド](sdxl_train_network.md)を参照してください。
+*   `--cache_text_encoder_outputs_to_disk` – 上記オプションと併用し、Text Encoderの出力をディスクにキャッシュします。
+*   `--t5xxl_device=<device>` **[未サポート]** – T5-XXLモデルのデバイスを指定します。指定しない場合はacceleratorのデバイスを使用します。
+*   `--t5xxl_dtype=<dtype>` **[未サポート]** – T5-XXLモデルのdtypeを指定します。指定しない場合はデフォルトのdtype（mixed precisionから）を使用します。
+*   `--save_clip` **[未サポート]** – CLIPモデルをチェックポイントに保存します（統合チェックポイント形式は未サポート）。
+*   `--save_t5xxl` **[未サポート]** – T5-XXLモデルをチェックポイントに保存します（統合チェックポイント形式は未サポート）。

 #### 非互換・非推奨の引数

 *   `--v2`, `--v_parameterization`, `--clip_skip` – Stable Diffusion v1/v2向けの引数のため、SD3/3.5学習では使用されません。
+
 </details>

 ### 4.2. Starting Training / 学習の開始

 After setting the required arguments, run the command to begin training. The overall flow and how to check logs are the same as in the [train_network.py guide](train_network.md#32-starting-the-training--学習の開始).

-## 5. Using the Trained Model / 学習済みモデルの利用
+<details>
+<summary>日本語</summary>
+
+必要な引数を設定したら、コマンドを実行して学習を開始します。全体の流れやログの確認方法は、[train_network.pyのガイド](train_network.md#32-starting-the-training--学習の開始)と同様です。
+
+</details>
+
+## 5. LoRA Target Modules / LoRAの学習対象モジュール
+
+When training LoRA with `sd3_train_network.py`, the following modules are targeted by default:
+
+*   **MMDiT (replaces U-Net)**:
+    *   `qkv` (Query, Key, Value) matrices and `proj_out` (output projection) in the attention blocks.
+*   **final_layer**:
+    *   The output layer at the end of MMDiT.
+
+By using `--network_args`, you can apply more detailed controls, such as setting different ranks (dimensions) for each module.
+
+### Specify rank for each layer in SD3 LoRA / 各層のランクを指定する
+
+You can specify the rank for each layer in SD3 by specifying the following network_args. If you specify `0`, LoRA will not be applied to that layer.
+
+When network_args is not specified, the default value (`network_dim`) is applied, same as before.
+
+|network_args|target layer|
+|---|---|
+|context_attn_dim|attn in context_block|
+|context_mlp_dim|mlp in context_block|
+|context_mod_dim|adaLN_modulation in context_block|
+|x_attn_dim|attn in x_block|
+|x_mlp_dim|mlp in x_block|
+|x_mod_dim|adaLN_modulation in x_block|
+
+`"verbose=True"` is also available for debugging. It shows the rank of each layer.
+
+example: 
+```
+--network_args "context_attn_dim=2" "context_mlp_dim=3" "context_mod_dim=4" "x_attn_dim=5" "x_mlp_dim=6" "x_mod_dim=7" "verbose=True"
+```
+
+You can apply LoRA to the conditioning layers of SD3 by specifying `emb_dims` in network_args. When specifying, be sure to specify 6 numbers in `[]` as a comma-separated list.
+
+example: 
+```
+--network_args "emb_dims=[2,3,4,5,6,7]"
+```
+
+Each number corresponds to `context_embedder`, `t_embedder`, `x_embedder`, `y_embedder`, `final_layer_adaLN_modulation`, `final_layer_linear`. The above example applies LoRA to all conditioning layers, with rank 2 for `context_embedder`, 3 for `t_embedder`, 4 for `context_embedder`, 5 for `y_embedder`, 6 for `final_layer_adaLN_modulation`, and 7 for `final_layer_linear`.
+
+If you specify `0`, LoRA will not be applied to that layer. For example, `[4,0,0,4,0,0]` applies LoRA only to `context_embedder` and `y_embedder`.
+
+### Specify blocks to train in SD3 LoRA training
+
+You can specify the blocks to train in SD3 LoRA training by specifying `train_block_indices` in network_args. The indices are 0-based. The default (when omitted) is to train all blocks. The indices are specified as a list of integers or a range of integers, like `0,1,5,8` or `0,1,4-5,7`. 
+
+The number of blocks depends on the model. The valid range is 0-(the number of blocks - 1). `all` is also available to train all blocks, `none` is also available to train no blocks.
+
+example: 
+```
+--network_args "train_block_indices=1,2,6-8" 
+```
+
+<details>
+<summary>日本語</summary>
+
+`sd3_train_network.py`でLoRAを学習させる場合、デフォルトでは以下のモジュールが対象となります。
+
+*   **MMDiT (U-Netの代替)**:
+    *   Attentionブロック内の`qkv`（Query, Key, Value）行列と、`proj_out`（出力Projection）。
+*   **final_layer**:
+    *   MMDiTの最後にある出力層。
+
+`--network_args` を使用することで、モジュールごとに異なるランク（次元数）を設定するなど、より詳細な制御が可能です。
+
+### SD3 LoRAで各層のランクを指定する
+
+各層のランクを指定するには、`--network_args`オプションを使用します。`0`を指定すると、その層にはLoRAが適用されません。
+
+network_argsが指定されない場合、デフォルト値（`network_dim`）が適用されます。
+
+|network_args|target layer|
+|---|---|
+|context_attn_dim|attn in context_block|
+|context_mlp_dim|mlp in context_block|
+|context_mod_dim|adaLN_modulation in context_block|
+|x_attn_dim|attn in x_block|
+|x_mlp_dim|mlp in x_block|
+|x_mod_dim|adaLN_modulation in x_block|
+
+`"verbose=True"`を指定すると、各層のランクが表示されます。
+
+例：
+
+```bash
+--network_args "context_attn_dim=2" "context_mlp_dim=3" "context_mod_dim=4" "x_attn_dim=5" "x_mlp_dim=6" "x_mod_dim=7" "verbose=True"
+```
+
+また、`emb_dims`を指定することで、SD3の条件付け層にLoRAを適用することもできます。指定する際は、必ず`[]`内にカンマ区切りで6つの数字を指定してください。
+
+```bash
+--network_args "emb_dims=[2,3,4,5,6,7]"
+```
+
+各数字は、`context_embedder`、`t_embedder`、`x_embedder`、`y_embedder`、`final_layer_adaLN_modulation`、`final_layer_linear`に対応しています。上記の例では、すべての条件付け層にLoRAを適用し、`context_embedder`に2、`t_embedder`に3、`x_embedder`に4、`y_embedder`に5、`final_layer_adaLN_modulation`に6、`final_layer_linear`に7のランクを設定しています。
+
+`0`を指定すると、その層にはLoRAが適用されません。例えば、`[4,0,0,4,0,0]`と指定すると、`context_embedder`と`y_embedder`のみにLoRAが適用されます。
+
+</details>
+
+
+## 6. Using the Trained Model / 学習済みモデルの利用

 When training finishes, a LoRA model file (e.g. `my_sd3_lora.safetensors`) is saved in the directory specified by `output_dir`. Use this file with inference environments that support SD3/3.5, such as ComfyUI.

-## 6. Others / その他
+<details>
+<summary>日本語</summary>
+
+学習が完了すると、指定した`output_dir`にLoRAモデルファイル（例: `my_sd3_lora.safetensors`）が保存されます。このファイルは、SD3/3.5モデルに対応した推論環境（例: ComfyUIなど）で使用できます。
+
+</details>
+
+
+## 7. Others / その他

 `sd3_train_network.py` shares many features with `train_network.py`, such as sample image generation (`--sample_prompts`, etc.) and detailed optimizer settings. For these, see the [train_network.py guide](train_network.md#5-other-features--その他の機能) or run `python sd3_train_network.py --help`.

 <details>
 <summary>日本語</summary>
-必要な引数を設定し、コマンドを実行すると学習が開始されます。基本的な流れやログの確認方法は[`train_network.py`のガイド](train_network.md#32-starting-the-training--学習の開始)と同様です。
-
-学習が完了すると、指定した`output_dir`にLoRAモデルファイル（例: `my_sd3_lora.safetensors`）が保存されます。このファイルは、SD3/3.5モデルに対応した推論環境（例: ComfyUIなど）で使用できます。

 `sd3_train_network.py`には、サンプル画像の生成 (`--sample_prompts`など) や詳細なオプティマイザ設定など、`train_network.py`と共通の機能も多く存在します。これらについては、[`train_network.py`のガイド](train_network.md#5-other-features--その他の機能)やスクリプトのヘルプ (`python sd3_train_network.py --help`) を参照してください。
+
 </details>
--- a/docs/sdxl_train_network_advanced.md
+++ b/docs/sdxl_train_network_advanced.md
@@ -1,5 +1,3 @@
-Status: under review
-
 # Advanced Settings: Detailed Guide for SDXL LoRA Training Script `sdxl_train_network.py` / 高度な設定: SDXL LoRA学習スクリプト `sdxl_train_network.py` 詳細ガイド

 This document describes the advanced options available when training LoRA models for SDXL (Stable Diffusion XL) with `sdxl_train_network.py` in the `sd-scripts` repository. For the basics, please read [How to Use the LoRA Training Script `train_network.py`](train_network.md) and [How to Use the SDXL LoRA Training Script `sdxl_train_network.py`](sdxl_train_network.md).
@@ -130,18 +128,65 @@ Basic options are common with `train_network.py`.
 *   `--huber_c=C` / `--huber_scale=S`: Parameters for `huber` or `smooth_l1` loss.
 *   `--masked_loss`: Limits loss calculation area based on a mask image. Requires specifying mask images (black and white) in `conditioning_data_dir` in dataset settings. See [About Masked Loss](masked_loss_README.md) for details.

-### 1.10. Distributed Training and Others
+### 1.10. Distributed Training and Other Training Related Options

 *   `--seed=N`: Specifies the random seed. Set this to ensure training reproducibility.
 *   `--max_token_length=N` (`75`, `150`, `225`): Maximum token length processed by Text Encoders. For SDXL, typically `75` (default), `150`, or `225`. Longer lengths can handle more complex prompts but increase VRAM usage.
 *   `--clip_skip=N`: Uses the output from N layers skipped from the final layer of Text Encoders. **Not typically used for SDXL**.
 *   `--lowram` / `--highvram`: Options for memory usage optimization. `--lowram` is for environments like Colab where RAM < VRAM, `--highvram` is for environments with ample VRAM.
 *   `--persistent_data_loader_workers` / `--max_data_loader_n_workers=N`: Settings for DataLoader worker processes. Affects wait time between epochs and memory usage.
-*   `--config_file=\"<config file>\"` / `--output_config`: Options to use/output a `.toml` file instead of command line arguments.
+*   `--config_file="<config file>"` / `--output_config`: Options to use/output a `.toml` file instead of command line arguments.
 *   **Accelerate/DeepSpeed related:** (`--ddp_timeout`, `--ddp_gradient_as_bucket_view`, `--ddp_static_graph`): Detailed settings for distributed training. Accelerate settings (`accelerate config`) are usually sufficient. DeepSpeed requires separate configuration.
+* `--initial_epoch=<integer>` – Sets the initial epoch number. `1` means first epoch (same as not specifying). Note: `initial_epoch`/`initial_step` doesn't affect the lr scheduler, which means lr scheduler will start from 0 without `--resume`.
+* `--initial_step=<integer>` – Sets the initial step number including all epochs. `0` means first step (same as not specifying). Overwrites `initial_epoch`.
+* `--skip_until_initial_step` – Skips training until `initial_step` is reached.
+
+### 1.11. Console and Logging / コンソールとログ
+
+* `--console_log_level`: Sets the logging level for the console output. Choose from `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`.
+* `--console_log_file`: Redirects console logs to a specified file.
+* `--console_log_simple`: Enables a simpler log format.
+
+### 1.12. Hugging Face Hub Integration / Hugging Face Hub 連携
+
+* `--huggingface_repo_id`: The repository name on Hugging Face Hub to upload the model to (e.g., `your-username/your-model`).
+* `--huggingface_repo_type`: The type of repository on Hugging Face Hub. Usually `model`.
+* `--huggingface_path_in_repo`: The path within the repository to upload files to.
+* `--huggingface_token`: Your Hugging Face Hub authentication token.
+* `--huggingface_repo_visibility`: Sets the visibility of the repository (`public` or `private`).
+* `--resume_from_huggingface`: Resumes training from a state saved on Hugging Face Hub.
+* `--async_upload`: Enables asynchronous uploading of models to the Hub, preventing it from blocking the training process.
+* `--save_n_epoch_ratio`: Saves the model at a certain ratio of total epochs. For example, `5` will save at least 5 checkpoints throughout the training.
+
+### 1.13. Advanced Attention Settings / 高度なAttention設定
+
+* `--mem_eff_attn`: Use memory-efficient attention mechanism. This is an older implementation and `sdpa` or `xformers` are generally recommended.
+* `--xformers`: Use xformers library for memory-efficient attention. Requires `pip install xformers`.
+
+### 1.14. Advanced LR Scheduler Settings / 高度な学習率スケジューラ設定
+
+* `--lr_scheduler_type`: Specifies a custom scheduler module.
+* `--lr_scheduler_args`: Provides additional arguments to the custom scheduler (e.g., `"T_max=100"`).
+* `--lr_decay_steps`: Sets the number of steps for the learning rate to decay.
+* `--lr_scheduler_timescale`: The timescale for the inverse square root scheduler.
+* `--lr_scheduler_min_lr_ratio`: Sets the minimum learning rate as a ratio of the initial learning rate for certain schedulers.
+
+### 1.15. Differential Learning with LoRA / LoRAの差分学習
+
+This technique involves merging a pre-trained LoRA into the base model before starting a new training session. This is useful for fine-tuning an existing LoRA or for learning the 'difference' from it.
+
+* `--base_weights`: Path to one or more LoRA weight files to be merged into the base model before training begins.
+* `--base_weights_multiplier`: A multiplier for the weights of the LoRA specified by `--base_weights`. You can specify multiple values if you provide multiple weights.
+
+### 1.16. Other Miscellaneous Options / その他のオプション
+
+* `--tokenizer_cache_dir`: Specifies a directory to cache the tokenizer, which is useful for offline training.
+* `--scale_weight_norms`: Scales the weight norms of the LoRA modules. This can help prevent overfitting by controlling the magnitude of the weights. A value of `1.0` is a good starting point.
+* `--disable_mmap_load_safetensors`: Disables memory-mapped loading for `.safetensors` files. This can speed up model loading in some environments like WSL.

 ## 2. Other Tips / その他のTips

+
 *   **VRAM Usage:** SDXL LoRA training requires a lot of VRAM. Even with 24GB VRAM, you might run out of memory depending on settings. Reduce VRAM usage with these settings:
    *   `--mixed_precision=\"bf16\"` or `\"fp16\"` (essential)
    *   `--gradient_checkpointing` (strongly recommended)
@@ -165,8 +210,6 @@ Basic options are common with `train_network.py`.
 <details>
 <summary>日本語</summary>

---
-
 # 高度な設定: SDXL LoRA学習スクリプト `sdxl_train_network.py` 詳細ガイド

 このドキュメントでは、`sd-scripts` リポジトリに含まれる `sdxl_train_network.py` を使用した、SDXL (Stable Diffusion XL) モデルに対する LoRA (Low-Rank Adaptation) モデル学習の高度な設定オプションについて解説します。
@@ -381,7 +424,7 @@ SDXLは計算コストが高いため、キャッシュ機能が効果的です
 *   `--masked_loss`
    *   マスク画像に基づいてLoss計算領域を限定します。データセット設定で`conditioning_data_dir`にマスク画像（白黒）を指定する必要があります。詳細は[マスクロスについて](masked_loss_README.md)を参照してください。

-### 1.10. 分散学習・その他
+### 1.10. 分散学習、その他学習関連

 *   `--seed=N`
    *   乱数シードを指定します。学習の再現性を確保したい場合に設定します。
@@ -397,9 +440,56 @@ SDXLは計算コストが高いため、キャッシュ機能が効果的です
    *   コマンドライン引数の代わりに`.toml`ファイルを使用/出力するオプション。
 *   **Accelerate/DeepSpeed関連:** (`--ddp_timeout`, `--ddp_gradient_as_bucket_view`, `--ddp_static_graph`)
    *   分散学習時の詳細設定。通常はAccelerateの設定 (`accelerate config`) で十分です。DeepSpeedを使用する場合は、別途設定が必要です。
+*   `--initial_epoch=<integer>` – 開始エポック番号を設定します。`1`で最初のエポック（未指定時と同じ）。注意：`initial_epoch`/`initial_step`はlr schedulerに影響しないため、`--resume`しない場合はlr schedulerは0から始まります。
+*   `--initial_step=<integer>` – 全エポックを含む開始ステップ番号を設定します。`0`で最初のステップ（未指定時と同じ）。`initial_epoch`を上書きします。
+*   `--skip_until_initial_step` – `initial_step`に到達するまで学習をスキップします。
+
+### 1.11. コンソールとログ
+
+* `--console_log_level`: コンソール出力のログレベルを設定します。`DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`から選択します。
+* `--console_log_file`: コンソールのログを指定されたファイルに出力します。
+* `--console_log_simple`: よりシンプルなログフォーマットを有効にします。
+
+### 1.12. Hugging Face Hub 連携
+
+* `--huggingface_repo_id`: モデルをアップロードするHugging Face Hubのリポジトリ名 (例: `your-username/your-model`)。
+* `--huggingface_repo_type`: Hugging Face Hubのリポジトリの種類。通常は`model`です。
+* `--huggingface_path_in_repo`: リポジトリ内でファイルをアップロードするパス。
+* `--huggingface_token`: Hugging Face Hubの認証トークン。
+* `--huggingface_repo_visibility`: リポジトリの公開設定 (`public`または`private`)。
+* `--resume_from_huggingface`: Hugging Face Hubに保存された状態から学習を再開します。
+* `--async_upload`: Hubへのモデルの非同期アップロードを有効にし、学習プロセスをブロックしないようにします。
+* `--save_n_epoch_ratio`: 総エポック数に対する特定の比率でモデルを保存します。例えば`5`を指定すると、学習全体で少なくとも5つのチェックポイントが保存されます。
+
+### 1.13. 高度なAttention設定
+
+* `--mem_eff_attn`: メモリ効率の良いAttentionメカニズムを使用します。これは古い実装であり、一般的には`sdpa`や`xformers`の使用が推奨されます。
+* `--xformers`: メモリ効率の良いAttentionのためにxformersライブラリを使用します。`pip install xformers`が必要です。
+
+### 1.14. 高度な学習率スケジューラ設定
+
+* `--lr_scheduler_type`: カスタムスケジューラモジュールを指定します。
+* `--lr_scheduler_args`: カスタムスケジューラに追加の引数を渡します (例: `"T_max=100"`)。
+* `--lr_decay_steps`: 学習率が減衰するステップ数を設定します。
+* `--lr_scheduler_timescale`: 逆平方根スケジューラのタイムスケール。
+* `--lr_scheduler_min_lr_ratio`: 特定のスケジューラについて、初期学習率に対する最小学習率の比率を設定します。
+
+### 1.15. LoRAの差分学習
+
+既存の学習済みLoRAをベースモデルにマージしてから、新たな学習を開始する手法です。既存LoRAのファインチューニングや、差分を学習させたい場合に有効です。
+
+* `--base_weights`: 学習開始前にベースモデルにマージするLoRAの重みファイルを1つ以上指定します。
+* `--base_weights_multiplier`: `--base_weights`で指定したLoRAの重みの倍率。複数指定も可能です。
+
+### 1.16. その他のオプション
+
+* `--tokenizer_cache_dir`: オフラインでの学習に便利なように、tokenizerをキャッシュするディレクトリを指定します。
+* `--scale_weight_norms`: LoRAモジュールの重みのノルムをスケーリングします。重みの大きさを制御することで過学習を防ぐ助けになります。`1.0`が良い出発点です。
+* `--disable_mmap_load_safetensors`: `.safetensors`ファイルのメモリマップドローディングを無効にします。WSLなどの一部環境でモデルの読み込みを高速化できます。

 ## 2. その他のTips

+
 *   **VRAM使用量:** SDXL LoRA学習は多くのVRAMを必要とします。24GB VRAMでも設定によってはメモリ不足になることがあります。以下の設定でVRAM使用量を削減できます。
    *   `--mixed_precision="bf16"` または `"fp16"` (必須級)
    *   `--gradient_checkpointing` (強く推奨)
@@ -422,7 +512,4 @@ SDXLは計算コストが高いため、キャッシュ機能が効果的です

 不明な点や詳細については、各スクリプトの `--help` オプションや、リポジトリ内の他のドキュメント、実装コード自体を参照してください。

---
-
-
 </details>
--- a/docs/train_textual_inversion.md
+++ b/docs/train_textual_inversion.md
@@ -0,0 +1,291 @@
+# How to use Textual Inversion training scripts / Textual Inversion学習スクリプトの使い方
+
+This document explains how to train Textual Inversion embeddings using the `train_textual_inversion.py` and `sdxl_train_textual_inversion.py` scripts included in the `sd-scripts` repository.
+
+<details>
+<summary>日本語</summary>
+このドキュメントでは、`sd-scripts` リポジトリに含まれる `train_textual_inversion.py` および `sdxl_train_textual_inversion.py` を使用してTextual Inversionの埋め込みを学習する方法について解説します。
+</details>
+
+## 1. Introduction / はじめに
+
+[Textual Inversion](https://textual-inversion.github.io/) is a technique that teaches Stable Diffusion new concepts by learning new token embeddings. Instead of fine-tuning the entire model, it only optimizes the text encoder's token embeddings, making it a lightweight approach to teaching the model specific characters, objects, or artistic styles.
+
+**Available Scripts:**
+- `train_textual_inversion.py`: For Stable Diffusion v1.x and v2.x models
+- `sdxl_train_textual_inversion.py`: For Stable Diffusion XL models
+
+**Prerequisites:**
+* The `sd-scripts` repository has been cloned and the Python environment has been set up.
+* The training dataset has been prepared. For dataset preparation, please refer to the [Dataset Configuration Guide](config_README-en.md).
+
+<details>
+<summary>日本語</summary>
+
+[Textual Inversion](https://textual-inversion.github.io/) は、新しいトークンの埋め込みを学習することで、Stable Diffusionに新しい概念を教える技術です。モデル全体をファインチューニングする代わりに、テキストエンコーダのトークン埋め込みのみを最適化するため、特定のキャラクター、オブジェクト、芸術的スタイルをモデルに教えるための軽量なアプローチです。
+
+**利用可能なスクリプト:**
+- `train_textual_inversion.py`: Stable Diffusion v1.xおよびv2.xモデル用
+- `sdxl_train_textual_inversion.py`: Stable Diffusion XLモデル用
+
+**前提条件:**
+* `sd-scripts` リポジトリのクローンとPython環境のセットアップが完了していること。
+* 学習用データセットの準備が完了していること。データセットの準備については[データセット設定ガイド](config_README-en.md)を参照してください。
+</details>
+
+## 2. Basic Usage / 基本的な使用方法
+
+### 2.1. For Stable Diffusion v1.x/v2.x Models / Stable Diffusion v1.x/v2.xモデル用
+
+```bash
+accelerate launch --num_cpu_threads_per_process 1 train_textual_inversion.py \
+  --pretrained_model_name_or_path="path/to/model.safetensors" \
+  --dataset_config="dataset_config.toml" \
+  --output_dir="output" \
+  --output_name="my_textual_inversion" \
+  --save_model_as="safetensors" \
+  --token_string="mychar" \
+  --init_word="girl" \
+  --num_vectors_per_token=4 \
+  --max_train_steps=1600 \
+  --learning_rate=1e-6 \
+  --optimizer_type="AdamW8bit" \
+  --mixed_precision="fp16" \
+  --cache_latents \
+  --sdpa
+```
+
+### 2.2. For SDXL Models / SDXLモデル用
+
+```bash
+accelerate launch --num_cpu_threads_per_process 1 sdxl_train_textual_inversion.py \
+  --pretrained_model_name_or_path="path/to/sdxl_model.safetensors" \
+  --dataset_config="dataset_config.toml" \
+  --output_dir="output" \
+  --output_name="my_sdxl_textual_inversion" \
+  --save_model_as="safetensors" \
+  --token_string="mychar" \
+  --init_word="girl" \
+  --num_vectors_per_token=4 \
+  --max_train_steps=1600 \
+  --learning_rate=1e-6 \
+  --optimizer_type="AdamW8bit" \
+  --mixed_precision="fp16" \
+  --cache_latents \
+  --sdpa
+```
+
+<details>
+<summary>日本語</summary>
+上記のコマンドは実際には1行で書く必要がありますが、見やすさのために改行しています（LinuxやMacでは行末に `\` を追加することで改行できます）。Windowsの場合は、改行せずに1行で書くか、`^` を行末に追加してください。
+</details>
+
+## 3. Key Command-Line Arguments / 主要なコマンドライン引数
+
+### 3.1. Textual Inversion Specific Arguments / Textual Inversion固有の引数
+
+#### Core Parameters / コアパラメータ
+
+* `--token_string="mychar"` **[Required]**
+  * Specifies the token string used in training. This must not exist in the tokenizer's vocabulary. In your training prompts, include this token string (e.g., if token_string is "mychar", use prompts like "mychar 1girl").
+  * 学習時に使用されるトークン文字列を指定します。tokenizerの語彙に存在しない文字である必要があります。学習時のプロンプトには、このトークン文字列を含める必要があります（例：token_stringが"mychar"なら、"mychar 1girl"のようなプロンプトを使用）。
+
+* `--init_word="girl"`
+  * Specifies the word to use for initializing the embedding vector. Choose a word that is conceptually close to what you want to teach. Must be a single token.
+  * 埋め込みベクトルの初期化に使用する単語を指定します。教えたい概念に近い単語を選ぶとよいでしょう。単一のトークンである必要があります。
+
+* `--num_vectors_per_token=4`
+  * Specifies how many embedding vectors to use for this token. More vectors provide greater expressiveness but consume more tokens from the 77-token limit.
+  * このトークンに使用する埋め込みベクトルの数を指定します。多いほど表現力が増しますが、77トークン制限からより多くのトークンを消費します。
+
+* `--weights="path/to/existing_embedding.safetensors"`
+  * Loads pre-trained embeddings to continue training from. Optional parameter for transfer learning.
+  * 既存の埋め込みを読み込んで、そこから追加で学習します。転移学習のオプションパラメータです。
+
+#### Template Options / テンプレートオプション
+
+* `--use_object_template`
+  * Ignores captions and uses predefined object templates (e.g., "a photo of a {}"). Same as the original implementation.
+  * キャプションを無視して、事前定義された物体用テンプレート（例："a photo of a {}"）を使用します。公式実装と同じです。
+
+* `--use_style_template`
+  * Ignores captions and uses predefined style templates (e.g., "a painting in the style of {}"). Same as the original implementation.
+  * キャプションを無視して、事前定義されたスタイル用テンプレート（例："a painting in the style of {}"）を使用します。公式実装と同じです。
+
+### 3.2. Model and Dataset Arguments / モデル・データセット引数
+
+For common model and dataset arguments, please refer to [LoRA Training Guide](train_network.md#31-main-command-line-arguments--主要なコマンドライン引数). The following arguments work the same way:
+
+* `--pretrained_model_name_or_path`
+* `--dataset_config`
+* `--v2`, `--v_parameterization`
+* `--resolution`
+* `--cache_latents`, `--vae_batch_size`
+* `--enable_bucket`, `--min_bucket_reso`, `--max_bucket_reso`
+
+<details>
+<summary>日本語</summary>
+一般的なモデル・データセット引数については、[LoRA学習ガイド](train_network.md#31-main-command-line-arguments--主要なコマンドライン引数)を参照してください。以下の引数は同様に動作します：
+
+* `--pretrained_model_name_or_path`
+* `--dataset_config`
+* `--v2`, `--v_parameterization`
+* `--resolution`
+* `--cache_latents`, `--vae_batch_size`
+* `--enable_bucket`, `--min_bucket_reso`, `--max_bucket_reso`
+</details>
+
+### 3.3. Training Parameters / 学習パラメータ
+
+For training parameters, please refer to [LoRA Training Guide](train_network.md#31-main-command-line-arguments--主要なコマンドライン引数). Textual Inversion typically uses these settings:
+
+* `--learning_rate=1e-6`: Lower learning rates are often used compared to LoRA training
+* `--max_train_steps=1600`: Fewer steps are usually sufficient
+* `--optimizer_type="AdamW8bit"`: Memory-efficient optimizer
+* `--mixed_precision="fp16"`: Reduces memory usage
+
+**Note:** Textual Inversion has lower memory requirements compared to full model fine-tuning, so you can often use larger batch sizes.
+
+<details>
+<summary>日本語</summary>
+学習パラメータについては、[LoRA学習ガイド](train_network.md#31-main-command-line-arguments--主要なコマンドライン引数)を参照してください。Textual Inversionでは通常以下の設定を使用します：
+
+* `--learning_rate=1e-6`: LoRA学習と比べて低い学習率がよく使用されます
+* `--max_train_steps=1600`: より少ないステップで十分な場合が多いです
+* `--optimizer_type="AdamW8bit"`: メモリ効率的なオプティマイザ
+* `--mixed_precision="fp16"`: メモリ使用量を削減
+
+**注意:** Textual Inversionはモデル全体のファインチューニングと比べてメモリ要件が低いため、多くの場合、より大きなバッチサイズを使用できます。
+</details>
+
+## 4. Dataset Preparation / データセット準備
+
+### 4.1. Dataset Configuration / データセット設定
+
+Create a TOML configuration file as described in the [Dataset Configuration Guide](config_README-en.md). Here's an example for Textual Inversion:
+
+```toml
+[general]
+shuffle_caption = false
+caption_extension = ".txt"
+keep_tokens = 1
+
+[[datasets]]
+resolution = 512                    # 1024 for SDXL
+batch_size = 4                      # Can use larger values than LoRA training
+enable_bucket = true
+
+  [[datasets.subsets]]
+  image_dir = "path/to/images"
+  caption_extension = ".txt"
+  num_repeats = 10
+```
+
+### 4.2. Caption Guidelines / キャプションガイドライン
+
+**Important:** Your captions must include the token string you specified. For example:
+
+* If `--token_string="mychar"`, captions should be like: "mychar, 1girl, blonde hair, blue eyes"
+* The token string can appear anywhere in the caption, but including it is essential
+
+You can verify that your token string is being recognized by using `--debug_dataset`, which will show token IDs. Look for tokens with IDs ≥ 49408 (these are the new custom tokens).
+
+<details>
+<summary>日本語</summary>
+
+**重要:** キャプションには指定したトークン文字列を含める必要があります。例：
+
+* `--token_string="mychar"` の場合、キャプションは "mychar, 1girl, blonde hair, blue eyes" のようにします
+* トークン文字列はキャプション内のどこに配置しても構いませんが、含めることが必須です
+
+`--debug_dataset` を使用してトークン文字列が認識されているかを確認できます。これによりトークンIDが表示されます。ID ≥ 49408 のトークン（これらは新しいカスタムトークン）を探してください。
+</details>
+
+## 5. Advanced Configuration / 高度な設定
+
+### 5.1. Multiple Token Vectors / 複数トークンベクトル
+
+When using `--num_vectors_per_token` > 1, the system creates additional token variations:
+- `--token_string="mychar"` with `--num_vectors_per_token=4` creates: "mychar", "mychar1", "mychar2", "mychar3"
+
+For generation, you can use either the base token or all tokens together.
+
+### 5.2. Memory Optimization / メモリ最適化
+
+* Use `--cache_latents` to cache VAE outputs and reduce VRAM usage
+* Use `--gradient_checkpointing` for additional memory savings
+* For SDXL, use `--cache_text_encoder_outputs` to cache text encoder outputs
+* Consider using `--mixed_precision="bf16"` on newer GPUs (RTX 30 series and later)
+
+### 5.3. Training Tips / 学習のコツ
+
+* **Learning Rate:** Start with 1e-6 and adjust based on results. Lower rates often work better than LoRA training.
+* **Steps:** 1000-2000 steps are usually sufficient, but this varies by dataset size and complexity.
+* **Batch Size:** Textual Inversion can handle larger batch sizes than full fine-tuning due to lower memory requirements.
+* **Templates:** Use `--use_object_template` for characters/objects, `--use_style_template` for artistic styles.
+
+<details>
+<summary>日本語</summary>
+
+* **学習率:** 1e-6から始めて、結果に基づいて調整してください。LoRA学習よりも低い率がよく機能します。
+* **ステップ数:** 通常1000-2000ステップで十分ですが、データセットのサイズと複雑さによって異なります。
+* **バッチサイズ:** メモリ要件が低いため、Textual Inversionは完全なファインチューニングよりも大きなバッチサイズを処理できます。
+* **テンプレート:** キャラクター/オブジェクトには `--use_object_template`、芸術的スタイルには `--use_style_template` を使用してください。
+</details>
+
+## 6. Usage After Training / 学習後の使用方法
+
+The trained Textual Inversion embeddings can be used in:
+
+* **Automatic1111 WebUI:** Place the `.safetensors` file in the `embeddings` folder
+* **ComfyUI:** Use the embedding file with appropriate nodes
+* **Other Diffusers-based applications:** Load using the embedding path
+
+In your prompts, simply use the token string you trained (e.g., "mychar") and the model will use the learned embedding.
+
+<details>
+<summary>日本語</summary>
+
+学習したTextual Inversionの埋め込みは以下で使用できます：
+
+* **Automatic1111 WebUI:** `.safetensors` ファイルを `embeddings` フォルダに配置
+* **ComfyUI:** 適切なノードで埋め込みファイルを使用
+* **その他のDiffusersベースアプリケーション:** 埋め込みパスを使用して読み込み
+
+プロンプトでは、学習したトークン文字列（例："mychar"）を単純に使用するだけで、モデルが学習した埋め込みを使用します。
+</details>
+
+## 7. Troubleshooting / トラブルシューティング
+
+### Common Issues / よくある問題
+
+1. **Token string already exists in tokenizer**
+   * Use a unique string that doesn't exist in the model's vocabulary
+   * Try adding numbers or special characters (e.g., "mychar123")
+
+2. **No improvement after training**
+   * Ensure your captions include the token string
+   * Try adjusting the learning rate (lower values like 5e-7)
+   * Increase the number of training steps
+
+   * Use `--cache_latents`
+
+<details>
+<summary>日本語</summary>
+
+1. **トークン文字列がtokenizerに既に存在する**
+   * モデルの語彙に存在しない固有の文字列を使用してください
+   * 数字や特殊文字を追加してみてください（例："mychar123"）
+
+2. **学習後に改善が見られない**
+   * キャプションにトークン文字列が含まれていることを確認してください
+   * 学習率を調整してみてください（5e-7のような低い値）
+   * 学習ステップ数を増やしてください
+
+3. **メモリ不足エラー**
+   * データセット設定でバッチサイズを減らしてください
+   * `--gradient_checkpointing` を使用してください
+   * `--cache_latents` を使用してください
+</details>
+
+For additional training options and advanced configurations, please refer to the [LoRA Training Guide](train_network.md) as many parameters are shared between training methods.
--- a/docs/validation.md
+++ b/docs/validation.md
@@ -0,0 +1,261 @@
+# Validation Loss
+
+Validation loss is a crucial metric for monitoring the training process of a model. It helps you assess how well your model is generalizing to data it hasn't seen during training, which is essential for preventing overfitting. By periodically evaluating the model on a separate validation dataset, you can gain insights into its performance and make more informed decisions about when to stop training or adjust hyperparameters.
+
+This feature provides a stable and reliable validation loss metric by ensuring the validation process is deterministic.
+
+<details>
+<summary>日本語</summary>
+
+Validation loss（検証損失）は、モデルの学習過程を監視するための重要な指標です。モデルが学習中に見ていないデータに対してどの程度汎化できているかを評価するのに役立ち、過学習を防ぐために不可欠です。個別の検証データセットで定期的にモデルを評価することで、そのパフォーマンスに関する洞察を得て、学習をいつ停止するか、またはハイパーパラメータを調整するかについて、より多くの情報に基づいた決定を下すことができます。
+
+この機能は、検証プロセスが決定論的であることを保証することにより、安定して信頼性の高い検証損失指標を提供します。
+
+</details>
+
+## How It Works
+
+When validation is enabled, a portion of your dataset is set aside specifically for this purpose. The script then runs a validation step at regular intervals, calculating the loss on this validation data.
+
+To ensure that the validation loss is a reliable indicator of model performance, the process is deterministic. This means that for every validation run, the same random seed is used for noise generation and timestep selection. This consistency ensures that any fluctuations in the validation loss are due to changes in the model's weights, not random variations in the validation process itself.
+
+The average loss across all validation steps is then logged, providing a single, clear metric to track.
+
+For more technical details, please refer to the original pull request: [PR #1903](https://github.com/kohya-ss/sd-scripts/pull/1903).
+
+<details>
+<summary>日本語</summary>
+
+検証が有効になると、データセットの一部がこの目的のために特別に確保されます。スクリプトは定期的な間隔で検証ステップを実行し、この検証データに対する損失を計算します。
+
+検証損失がモデルのパフォーマンスの信頼できる指標であることを保証するために、プロセスは決定論的です。つまり、すべての検証実行で、ノイズ生成とタイムステップ選択に同じランダムシードが使用されます。この一貫性により、検証損失の変動が、検証プロセス自体のランダムな変動ではなく、モデルの重みの変化によるものであることが保証されます。
+
+すべての検証ステップにわたる平均損失がログに記録され、追跡するための単一の明確な指標が提供されます。
+
+より技術的な詳細については、元のプルリクエストを参照してください: [PR #1903](https://github.com/kohya-ss/sd-scripts/pull/1903).
+
+</details>
+
+## How to Use
+
+### Enabling Validation
+
+There are two primary ways to enable validation:
+
+1.  **Using a Dataset Config File (Recommended)**: You can specify a validation set directly within your dataset `.toml` file. This method offers the most control, allowing you to designate entire directories as validation sets or split a percentage of a specific subset for validation.
+
+    To use a whole directory for validation, add a subset and set `validation_split = 1.0`.
+
+    **Example: Separate Validation Set**
+    ```toml
+    [[datasets]]
+      # ... training subset ...
+      [[datasets.subsets]]
+        image_dir = "path/to/train_images"
+        # ... other settings ...
+
+      # Validation subset
+      [[datasets.subsets]]
+        image_dir = "path/to/validation_images"
+        validation_split = 1.0  # Use this entire subset for validation
+    ```
+
+    To use a fraction of a subset for validation, set `validation_split` to a value between 0.0 and 1.0.
+
+    **Example: Splitting a Subset**
+    ```toml
+    [[datasets]]
+      # ... dataset settings ...
+      [[datasets.subsets]]
+        image_dir = "path/to/images"
+        validation_split = 0.1  # Use 10% of this subset for validation
+    ```
+
+2.  **Using a Command-Line Argument**: For a simpler setup, you can use the `--validation_split` argument. This will take a random percentage of your *entire* training dataset for validation. This method is ignored if `validation_split` is defined in your dataset config file.
+
+    **Example Command:**
+    ```bash
+    accelerate launch train_network.py ... --validation_split 0.1
+    ```
+    This command will use 10% of the total training data for validation.
+
+<details>
+<summary>日本語</summary>
+
+### 検証を有効にする
+
+検証を有効にする主な方法は2つあります。
+
+1.  **データセット設定ファイルを使用する（推奨）**: データセットの`.toml`ファイル内で直接検証セットを指定できます。この方法は最も制御性が高く、ディレクトリ全体を検証セットとして指定したり、特定のサブセットのパーセンテージを検証用に分割したりすることができます。
+
+    ディレクトリ全体を検証に使用するには、サブセットを追加して`validation_split = 1.0`と設定します。
+
+    **例：個別の検証セット**
+    ```toml
+    [[datasets]]
+      # ... training subset ...
+      [[datasets.subsets]]
+        image_dir = "path/to/train_images"
+        # ... other settings ...
+
+      # Validation subset
+      [[datasets.subsets]]
+        image_dir = "path/to/validation_images"
+        validation_split = 1.0  # このサブセット全体を検証に使用します
+    ```
+
+    サブセットの一部を検証に使用するには、`validation_split`を0.0から1.0の間の値に設定します。
+
+    **例：サブセットの分割**
+    ```toml
+    [[datasets]]
+      # ... dataset settings ...
+      [[datasets.subsets]]
+        image_dir = "path/to/images"
+        validation_split = 0.1  # このサブセットの10%を検証に使用します
+    ```
+
+2.  **コマンドライン引数を使用する**: より簡単な設定のために、`--validation_split`引数を使用できます。これにより、*全*学習データセットのランダムなパーセンテージが検証に使用されます。この方法は、データセット設定ファイルで`validation_split`が定義されている場合は無視されます。
+
+    **コマンド例:**
+    ```bash
+    accelerate launch train_network.py ... --validation_split 0.1
+    ```
+    このコマンドは、全学習データの10%を検証に使用します。
+
+</details>
+
+### Configuration Options
+
+| Argument                    | TOML Option         | Description                                                                                                                            |
+| --------------------------- | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
+| `--validation_split`        | `validation_split`  | The fraction of the dataset to use for validation. The command-line argument applies globally, while the TOML option applies per-subset. The TOML setting takes precedence. |
+| `--validate_every_n_steps`  |                     | Run validation every N steps.                                                                                                          |
+| `--validate_every_n_epochs` |                     | Run validation every N epochs. If not specified, validation runs once per epoch by default.                                            |
+| `--max_validation_steps`    |                     | The maximum number of batches to use for a single validation run. If not set, the entire validation dataset is used.                     |
+| `--validation_seed`         | `validation_seed`   | A specific seed for the validation dataloader shuffling. If not set in the TOML file, the main training `--seed` is used.                 |
+
+<details>
+<summary>日本語</summary>
+
+### 設定オプション
+
+| 引数                        | TOMLオプション      | 説明                                                                                                                                   |
+| --------------------------- | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
+| `--validation_split`        | `validation_split`  | 検証に使用するデータセットの割合。コマンドライン引数は全体に適用され、TOMLオプションはサブセットごとに適用されます。TOML設定が優先されます。 |
+| `--validate_every_n_steps`  |                     | Nステップごとに検証を実行します。                                                                                                      |
+| `--validate_every_n_epochs` |                     | Nエポックごとに検証を実行します。指定しない場合、デフォルトでエポックごとに1回検証が実行されます。                                       |
+| `--max_validation_steps`    |                     | 1回の検証実行に使用するバッチの最大数。設定しない場合、検証データセット全体が使用されます。                                            |
+| `--validation_seed`         | `validation_seed`   | 検証データローダーのシャッフル用の特定のシード。TOMLファイルで設定されていない場合、メインの学習`--seed`が使用されます。                 |
+
+</details>
+
+### Viewing the Results
+
+The validation loss is logged to your tracking tool of choice (TensorBoard or Weights & Biases). Look for the metric `loss/validation` to monitor the performance.
+
+<details>
+<summary>日本語</summary>
+
+### 結果の表示
+
+検証損失は、選択した追跡ツール（TensorBoardまたはWeights & Biases）に記録されます。パフォーマンスを監視するには、`loss/validation`という指標を探してください。
+
+</details>
+
+### Practical Example
+
+Here is a complete example of how to run a LoRA training with validation enabled:
+
+**1. Prepare your `dataset_config.toml`:**
+
+```toml
+[general]
+shuffle_caption = true
+keep_tokens = 1
+
+[[datasets]]
+resolution = "1024,1024"
+batch_size = 2
+
+  [[datasets.subsets]]
+  image_dir = 'path/to/your_images'
+  caption_extension = '.txt'
+  num_repeats = 10
+
+  [[datasets.subsets]]
+  image_dir = 'path/to/your_validation_images'
+  caption_extension = '.txt'
+  validation_split = 1.0 # Use this entire subset for validation
+```
+
+**2. Run the training command:**
+
+```bash
+accelerate launch sdxl_train_network.py \
+  --pretrained_model_name_or_path="sd_xl_base_1.0.safetensors" \
+  --dataset_config="dataset_config.toml" \
+  --output_dir="output" \
+  --output_name="my_lora" \
+  --network_module=networks.lora \
+  --network_dim=32 \
+  --network_alpha=16 \
+  --save_every_n_epochs=1 \
+  --learning_rate=1e-4 \
+  --optimizer_type="AdamW8bit" \
+  --mixed_precision="bf16" \
+  --logging_dir=logs
+```
+
+The validation loss will be calculated once per epoch and saved to the `logs` directory, which you can view with TensorBoard.
+
+<details>
+<summary>日本語</summary>
+
+### 実践的な例
+
+検証を有効にしてLoRAの学習を実行する完全な例を次に示します。
+
+**1. `dataset_config.toml`を準備します:**
+
+```toml
+[general]
+shuffle_caption = true
+keep_tokens = 1
+
+[[datasets]]
+resolution = "1024,1024"
+batch_size = 2
+
+  [[datasets.subsets]]
+  image_dir = 'path/to/your_images'
+  caption_extension = '.txt'
+  num_repeats = 10
+
+  [[datasets.subsets]]
+  image_dir = 'path/to/your_validation_images'
+  caption_extension = '.txt'
+  validation_split = 1.0 # このサブセット全体を検証に使用します
+```
+
+**2. 学習コマンドを実行します:**
+
+```bash
+accelerate launch sdxl_train_network.py \
+  --pretrained_model_name_or_path="sd_xl_base_1.0.safetensors" \
+  --dataset_config="dataset_config.toml" \
+  --output_dir="output" \
+  --output_name="my_lora" \
+  --network_module=networks.lora \
+  --network_dim=32 \
+  --network_alpha=16 \
+  --save_every_n_epochs=1 \
+  --learning_rate=1e-4 \
+  --optimizer_type="AdamW8bit" \
+  --mixed_precision="bf16" \
+  --logging_dir=logs
+```
+
+検証損失はエポックごとに1回計算され、`logs`ディレクトリに保存されます。これはTensorBoardで表示できます。
+
+</details>