Merge pull request #388 from kohya-ss/dev

add weighted caption for training
update readme
2026-04-06 21:52:27 +00:00 · 2023-04-08 22:00:46 +09:00 · 2023-04-08 21:58:22 +09:00 · 2023-04-08 21:45:57 +09:00 · 2023-04-08 21:36:35 +09:00 · 2023-04-08 21:31:05 +09:00
36 changed files with 6916 additions and 3963 deletions
--- a/README.md
+++ b/README.md
@@ -127,80 +127,100 @@ The majority of scripts is licensed under ASL 2.0 (including codes from Diffuser

 ## Change History

+### 8 Apr. 2021, 2021/4/8:

- 19 Mar. 2023, 2023/3/19:
-  - Add a function to load training config with `.toml` to each training script. Thanks to Linaqruf for this great contribution!
-    - Specify `.toml` file with `--config_file`. `.toml` file has `key=value` entries. Keys are same as command line options. See [#241](https://github.com/kohya-ss/sd-scripts/pull/241) for details.
-    - All sub-sections are combined to a single dictionary (the section names are ignored.)
-    - Omitted arguments are the default values for command line arguments.
-    - Command line args override the arguments in `.toml`.
-    - With `--output_config` option, you can output current command line options  to the `.toml` specified with`--config_file`. Please use as a template.
-  - Add `--lr_scheduler_type` and `--lr_scheduler_args` arguments for custom LR scheduler to each training script. Thanks to Isotr0py! [#271](https://github.com/kohya-ss/sd-scripts/pull/271)
-    - Same as the optimizer.
-  - Add sample image generation with weight and no length limit. Thanks to mio2333! [#288](https://github.com/kohya-ss/sd-scripts/pull/288)
-    - `( )`, `(xxxx:1.2)` and `[ ]` can be used.
-  - Fix exception on training model in diffusers format with `train_network.py` Thanks to orenwang! [#290](https://github.com/kohya-ss/sd-scripts/pull/290)
+- Added support for training with weighted captions. Thanks to AI-Casanova for the great contribution! 
+  - Please refer to the PR for details: [PR #336](https://github.com/kohya-ss/sd-scripts/pull/336)
+  - Specify the `--weighted_captions` option. It is available for all training scripts except Textual Inversion and XTI.
+  - This option is also applicable to token strings of the DreamBooth method.
+  - The syntax for weighted captions is almost the same as the Web UI, and you can use things like `(abc)`, `[abc]`, and `(abc:1.23)`. Nesting is also possible.
+  - If you include a comma in the parentheses, the parentheses will not be properly matched in the prompt shuffle/dropout, so do not include a comma in the parentheses.

-  - 各学習スクリプトでコマンドライン引数の代わりに`.toml` ファイルで引数を指定できるようになりました。Linaqruf氏の多大な貢献に感謝します。
-    - `--config_file` で `.toml` ファイルを指定してください。ファイルは `key=value` 形式の行で指定し、key はコマンドラインオプションと同じです。詳細は [#241](https://github.com/kohya-ss/sd-scripts/pull/241) をご覧ください。
-    - ファイル内のサブセクションはすべて無視されます。
-    - 省略した引数はコマンドライン引数のデフォルト値になります。
-    - コマンドライン引数で `.toml` の設定を上書きできます。
-    - `--output_config` オプションを指定すると、現在のコマンドライン引数を`--config_file` オプションで指定した `.toml` ファイルに出力します。ひな形としてご利用ください。
-  - 任意のスケジューラを使うための `--lr_scheduler_type` と `--lr_scheduler_args` オプションを各学習スクリプトに追加しました。Isotr0py氏に感謝します。 [#271](https://github.com/kohya-ss/sd-scripts/pull/271)
-    - 任意のオプティマイザ指定と同じ形式です。
-  - 学習中のサンプル画像出力でプロンプトの重みづけができるようになりました。また長さ制限も緩和されています。mio2333氏に感謝します。 [#288](https://github.com/kohya-ss/sd-scripts/pull/288)
-    - `( )`、 `(xxxx:1.2)` や `[ ]` が使えます。
-  -  `train_network.py` でローカルのDiffusersモデルを指定した時のエラーを修正しました。orenwang氏に感謝します。 [#290](https://github.com/kohya-ss/sd-scripts/pull/290)
+- 重みづけキャプションによる学習に対応しました。 AI-Casanova 氏の素晴らしい貢献に感謝します。
+  - 詳細はこちらをご確認ください。[PR #336](https://github.com/kohya-ss/sd-scripts/pull/336)
+  - `--weighted_captions` オプションを指定してください。Textual InversionおよびXTIを除く学習スクリプトで使用可能です。
+  - キャプションだけでなく DreamBooth 手法の token string でも有効です。
+  - 重みづけキャプションの記法はWeb UIとほぼ同じで、`(abc)`や`[abc]`、`(abc:1.23)`などが使用できます。入れ子も可能です。
+  - 括弧内にカンマを含めるとプロンプトのshuffle/dropoutで括弧の対応付けがおかしくなるため、括弧内にはカンマを含めないでください。
+  
+### 6 Apr. 2023, 2023/4/6:
+- There may be bugs because I changed a lot. If you cannot revert the script to the previous version when a problem occurs, please wait for the update for a while.

- 11 Mar. 2023, 2023/3/11:
-    - Fix `svd_merge_lora.py` causes an error about the device.
-    - `svd_merge_lora.py` でデバイス関連のエラーが発生する不具合を修正しました。
+- Added a feature to upload model and state to HuggingFace. Thanks to ddPn08 for the contribution! [PR #348](https://github.com/kohya-ss/sd-scripts/pull/348)
+  - When `--huggingface_repo_id` is specified, the model is uploaded to HuggingFace at the same time as saving the model.
+  - Please note that the access token is handled with caution. Please refer to the [HuggingFace documentation](https://huggingface.co/docs/hub/security-tokens).
+  - For example, specify other arguments as follows.
+    - `--huggingface_repo_id "your-hf-name/your-model" --huggingface_path_in_repo "path" --huggingface_repo_type model --huggingface_repo_visibility private --huggingface_token hf_YourAccessTokenHere`
+  - If `public` is specified for `--huggingface_repo_visibility`, the repository will be public. If the option is omitted or `private` (or anything other than `public`) is specified, it will be private.
+  - If you specify `--save_state` and `--save_state_to_huggingface`, the state will also be uploaded.
+  - If you specify `--resume` and `--resume_from_huggingface`, the state will be downloaded from HuggingFace and resumed.
+    - In this case, the `--resume` option is `--resume {repo_id}/{path_in_repo}:{revision}:{repo_type}`. For example: `--resume_from_huggingface --resume your-hf-name/your-model/path/test-000002-state:main:model`
+  - If you specify `--async_upload`, the upload will be done asynchronously.
+- Added the documentation for applying LoRA to generate with the standard pipeline of Diffusers.   [training LoRA](./train_network_README-ja.md#diffusersのpipelineで生成する) (Japanese only)
+- Support for Attention Couple and regional LoRA in `gen_img_diffusers.py`.
+  - If you use ` AND ` to separate the prompts, each sub-prompt is sequentially applied to LoRA. `--mask_path` is treated as a mask image. The number of sub-prompts and the number of LoRA must match.


-  - Sample image generation:
-    A prompt file might look like this, for example
+- 大きく変更したため不具合があるかもしれません。問題が起きた時にスクリプトを前のバージョンに戻せない場合は、しばらく更新を控えてください。

-    ```
-    # prompt 1
-    masterpiece, best quality, (1girl), in white shirts, upper body, looking at viewer, simple background --n low quality, worst quality, bad anatomy,bad composition, poor, low effort --w 768 --h 768 --d 1 --l 7.5 --s 28
+- モデルおよびstateをHuggingFaceにアップロードする機能を各スクリプトに追加しました。 [PR #348](https://github.com/kohya-ss/sd-scripts/pull/348) ddPn08 氏の貢献に感謝します。
+  - `--huggingface_repo_id`が指定されているとモデル保存時に同時にHuggingFaceにアップロードします。
+  - アクセストークンの取り扱いに注意してください。[HuggingFaceのドキュメント](https://huggingface.co/docs/hub/security-tokens)を参照してください。
+  - 他の引数をたとえば以下のように指定してください。
+    - `--huggingface_repo_id "your-hf-name/your-model" --huggingface_path_in_repo "path" --huggingface_repo_type model --huggingface_repo_visibility private --huggingface_token hf_YourAccessTokenHere`
+  - `--huggingface_repo_visibility`に`public`を指定するとリポジトリが公開されます。省略時または`private`（など`public`以外）を指定すると非公開になります。
+  - `--save_state`オプション指定時に`--save_state_to_huggingface`を指定するとstateもアップロードします。
+  - `--resume`オプション指定時に`--resume_from_huggingface`を指定するとHuggingFaceからstateをダウンロードして再開します。
+    - その時の `--resume`オプションは `--resume {repo_id}/{path_in_repo}:{revision}:{repo_type}`になります。例: `--resume_from_huggingface --resume your-hf-name/your-model/path/test-000002-state:main:model`
+  - `--async_upload`オプションを指定するとアップロードを非同期で行います。
+- [LoRAの文書](./train_network_README-ja.md#diffusersのpipelineで生成する)に、LoRAを適用してDiffusersの標準的なパイプラインで生成する方法を追記しました。
+- `gen_img_diffusers.py` で Attention Couple および領域別LoRAに対応しました。
+  - プロンプトを` AND `で区切ると各サブプロンプトが順にLoRAに適用されます。`--mask_path` がマスク画像として扱われます。サブプロンプトの数とLoRAの数は一致している必要があります。

-    # prompt 2
-    masterpiece, best quality, 1boy, in business suit, standing at street, looking back --n (low quality, worst quality), bad anatomy,bad composition, poor, low effort --w 576 --h 832 --d 2 --l 5.5 --s 40
-    ```

-    Lines beginning with `#` are comments. You can specify options for the generated image with options like `--n` after the prompt. The following can be used.
+## Sample image generation during training
+  A prompt file might look like this, for example

-    * `--n` Negative prompt up to the next option.
-    * `--w` Specifies the width of the generated image.
-    * `--h` Specifies the height of the generated image.
-    * `--d` Specifies the seed of the generated image.
-    * `--l` Specifies the CFG scale of the generated image.
-    * `--s` Specifies the number of steps in the generation.
+```
+# prompt 1
+masterpiece, best quality, (1girl), in white shirts, upper body, looking at viewer, simple background --n low quality, worst quality, bad anatomy,bad composition, poor, low effort --w 768 --h 768 --d 1 --l 7.5 --s 28

-    The prompt weighting such as `( )` and `[ ]` are working.
+# prompt 2
+masterpiece, best quality, 1boy, in business suit, standing at street, looking back --n (low quality, worst quality), bad anatomy,bad composition, poor, low effort --w 576 --h 832 --d 2 --l 5.5 --s 40
+```

-  - サンプル画像生成：
-    プロンプトファイルは例えば以下のようになります。
+  Lines beginning with `#` are comments. You can specify options for the generated image with options like `--n` after the prompt. The following can be used.

-    ```
-    # prompt 1
-    masterpiece, best quality, 1girl, in white shirts, upper body, looking at viewer, simple background --n low quality, worst quality, bad anatomy,bad composition, poor, low effort --w 768 --h 768 --d 1 --l 7.5 --s 28
+  * `--n` Negative prompt up to the next option.
+  * `--w` Specifies the width of the generated image.
+  * `--h` Specifies the height of the generated image.
+  * `--d` Specifies the seed of the generated image.
+  * `--l` Specifies the CFG scale of the generated image.
+  * `--s` Specifies the number of steps in the generation.

-    # prompt 2
-    masterpiece, best quality, 1boy, in business suit, standing at street, looking back --n low quality, worst quality, bad anatomy,bad composition, poor, low effort --w 576 --h 832 --d 2 --l 5.5 --s 40
-    ```
+  The prompt weighting such as `( )` and `[ ]` are working.

-    `#` で始まる行はコメントになります。`--n` のように「ハイフン二個＋英小文字」の形でオプションを指定できます。以下が使用可能できます。
+## サンプル画像生成
+プロンプトファイルは例えば以下のようになります。

-    * `--n` Negative prompt up to the next option.
-    * `--w` Specifies the width of the generated image.
-    * `--h` Specifies the height of the generated image.
-    * `--d` Specifies the seed of the generated image.
-    * `--l` Specifies the CFG scale of the generated image.
-    * `--s` Specifies the number of steps in the generation.
+```
+# prompt 1
+masterpiece, best quality, (1girl), in white shirts, upper body, looking at viewer, simple background --n low quality, worst quality, bad anatomy,bad composition, poor, low effort --w 768 --h 768 --d 1 --l 7.5 --s 28

-    `( )` や `[ ]` などの重みづけは動作しません。
+# prompt 2
+masterpiece, best quality, 1boy, in business suit, standing at street, looking back --n (low quality, worst quality), bad anatomy,bad composition, poor, low effort --w 576 --h 832 --d 2 --l 5.5 --s 40
+```
+
+  `#` で始まる行はコメントになります。`--n` のように「ハイフン二個＋英小文字」の形でオプションを指定できます。以下が使用可能できます。
+
+  * `--n` Negative prompt up to the next option.
+  * `--w` Specifies the width of the generated image.
+  * `--h` Specifies the height of the generated image.
+  * `--d` Specifies the seed of the generated image.
+  * `--l` Specifies the CFG scale of the generated image.
+  * `--s` Specifies the number of steps in the generation.
+
+  `( )` や `[ ]` などの重みづけも動作します。

 Please read [Releases](https://github.com/kohya-ss/sd-scripts/releases) for recent updates.
 最近の更新情報は [Release](https://github.com/kohya-ss/sd-scripts/releases) をご覧ください。
--- a/XTI_hijack.py
+++ b/XTI_hijack.py
@@ -0,0 +1,209 @@
+import torch
+from typing import Union, List, Optional, Dict, Any, Tuple
+from diffusers.models.unet_2d_condition import UNet2DConditionOutput
+
+def unet_forward_XTI(self,
+        sample: torch.FloatTensor,
+        timestep: Union[torch.Tensor, float, int],
+        encoder_hidden_states: torch.Tensor,
+        class_labels: Optional[torch.Tensor] = None,
+        return_dict: bool = True,
+    ) -> Union[UNet2DConditionOutput, Tuple]:
+        r"""
+        Args:
+            sample (`torch.FloatTensor`): (batch, channel, height, width) noisy inputs tensor
+            timestep (`torch.FloatTensor` or `float` or `int`): (batch) timesteps
+            encoder_hidden_states (`torch.FloatTensor`): (batch, sequence_length, feature_dim) encoder hidden states
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether or not to return a [`models.unet_2d_condition.UNet2DConditionOutput`] instead of a plain tuple.
+
+        Returns:
+            [`~models.unet_2d_condition.UNet2DConditionOutput`] or `tuple`:
+            [`~models.unet_2d_condition.UNet2DConditionOutput`] if `return_dict` is True, otherwise a `tuple`. When
+            returning a tuple, the first element is the sample tensor.
+        """
+        # By default samples have to be AT least a multiple of the overall upsampling factor.
+        # The overall upsampling factor is equal to 2 ** (# num of upsampling layears).
+        # However, the upsampling interpolation output size can be forced to fit any upsampling size
+        # on the fly if necessary.
+        default_overall_up_factor = 2**self.num_upsamplers
+
+        # upsample size should be forwarded when sample is not a multiple of `default_overall_up_factor`
+        forward_upsample_size = False
+        upsample_size = None
+
+        if any(s % default_overall_up_factor != 0 for s in sample.shape[-2:]):
+            logger.info("Forward upsample size to force interpolation output size.")
+            forward_upsample_size = True
+
+        # 0. center input if necessary
+        if self.config.center_input_sample:
+            sample = 2 * sample - 1.0
+
+        # 1. time
+        timesteps = timestep
+        if not torch.is_tensor(timesteps):
+            # TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
+            # This would be a good case for the `match` statement (Python 3.10+)
+            is_mps = sample.device.type == "mps"
+            if isinstance(timestep, float):
+                dtype = torch.float32 if is_mps else torch.float64
+            else:
+                dtype = torch.int32 if is_mps else torch.int64
+            timesteps = torch.tensor([timesteps], dtype=dtype, device=sample.device)
+        elif len(timesteps.shape) == 0:
+            timesteps = timesteps[None].to(sample.device)
+
+        # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
+        timesteps = timesteps.expand(sample.shape[0])
+
+        t_emb = self.time_proj(timesteps)
+
+        # timesteps does not contain any weights and will always return f32 tensors
+        # but time_embedding might actually be running in fp16. so we need to cast here.
+        # there might be better ways to encapsulate this.
+        t_emb = t_emb.to(dtype=self.dtype)
+        emb = self.time_embedding(t_emb)
+
+        if self.config.num_class_embeds is not None:
+            if class_labels is None:
+                raise ValueError("class_labels should be provided when num_class_embeds > 0")
+            class_emb = self.class_embedding(class_labels).to(dtype=self.dtype)
+            emb = emb + class_emb
+
+        # 2. pre-process
+        sample = self.conv_in(sample)
+
+        # 3. down
+        down_block_res_samples = (sample,)
+        down_i = 0
+        for downsample_block in self.down_blocks:
+            if hasattr(downsample_block, "has_cross_attention") and downsample_block.has_cross_attention:
+                sample, res_samples = downsample_block(
+                    hidden_states=sample,
+                    temb=emb,
+                    encoder_hidden_states=encoder_hidden_states[down_i:down_i+2],
+                )
+                down_i += 2
+            else:
+                sample, res_samples = downsample_block(hidden_states=sample, temb=emb)
+
+            down_block_res_samples += res_samples
+
+        # 4. mid
+        sample = self.mid_block(sample, emb, encoder_hidden_states=encoder_hidden_states[6])
+
+        # 5. up
+        up_i = 7
+        for i, upsample_block in enumerate(self.up_blocks):
+            is_final_block = i == len(self.up_blocks) - 1
+
+            res_samples = down_block_res_samples[-len(upsample_block.resnets) :]
+            down_block_res_samples = down_block_res_samples[: -len(upsample_block.resnets)]
+
+            # if we have not reached the final block and need to forward the
+            # upsample size, we do it here
+            if not is_final_block and forward_upsample_size:
+                upsample_size = down_block_res_samples[-1].shape[2:]
+
+            if hasattr(upsample_block, "has_cross_attention") and upsample_block.has_cross_attention:
+                sample = upsample_block(
+                    hidden_states=sample,
+                    temb=emb,
+                    res_hidden_states_tuple=res_samples,
+                    encoder_hidden_states=encoder_hidden_states[up_i:up_i+3],
+                    upsample_size=upsample_size,
+                )
+                up_i += 3
+            else:
+                sample = upsample_block(
+                    hidden_states=sample, temb=emb, res_hidden_states_tuple=res_samples, upsample_size=upsample_size
+                )
+        # 6. post-process
+        sample = self.conv_norm_out(sample)
+        sample = self.conv_act(sample)
+        sample = self.conv_out(sample)
+
+        if not return_dict:
+            return (sample,)
+
+        return UNet2DConditionOutput(sample=sample)
+
+def downblock_forward_XTI(
+    self, hidden_states, temb=None, encoder_hidden_states=None, attention_mask=None, cross_attention_kwargs=None
+):
+    output_states = ()
+    i = 0
+
+    for resnet, attn in zip(self.resnets, self.attentions):
+        if self.training and self.gradient_checkpointing:
+
+            def create_custom_forward(module, return_dict=None):
+                def custom_forward(*inputs):
+                    if return_dict is not None:
+                        return module(*inputs, return_dict=return_dict)
+                    else:
+                        return module(*inputs)
+
+                return custom_forward
+
+            hidden_states = torch.utils.checkpoint.checkpoint(create_custom_forward(resnet), hidden_states, temb)
+            hidden_states = torch.utils.checkpoint.checkpoint(
+                create_custom_forward(attn, return_dict=False), hidden_states, encoder_hidden_states[i]
+            )[0]
+        else:
+            hidden_states = resnet(hidden_states, temb)
+            hidden_states = attn(hidden_states, encoder_hidden_states=encoder_hidden_states[i]).sample
+
+        output_states += (hidden_states,)
+        i += 1
+
+    if self.downsamplers is not None:
+        for downsampler in self.downsamplers:
+            hidden_states = downsampler(hidden_states)
+
+        output_states += (hidden_states,)
+
+    return hidden_states, output_states
+
+def upblock_forward_XTI(
+    self,
+    hidden_states,
+    res_hidden_states_tuple,
+    temb=None,
+    encoder_hidden_states=None,
+    upsample_size=None,
+):
+    i = 0
+    for resnet, attn in zip(self.resnets, self.attentions):
+        # pop res hidden states
+        res_hidden_states = res_hidden_states_tuple[-1]
+        res_hidden_states_tuple = res_hidden_states_tuple[:-1]
+        hidden_states = torch.cat([hidden_states, res_hidden_states], dim=1)
+
+        if self.training and self.gradient_checkpointing:
+
+            def create_custom_forward(module, return_dict=None):
+                def custom_forward(*inputs):
+                    if return_dict is not None:
+                        return module(*inputs, return_dict=return_dict)
+                    else:
+                        return module(*inputs)
+
+                return custom_forward
+
+            hidden_states = torch.utils.checkpoint.checkpoint(create_custom_forward(resnet), hidden_states, temb)
+            hidden_states = torch.utils.checkpoint.checkpoint(
+                create_custom_forward(attn, return_dict=False), hidden_states, encoder_hidden_states[i]
+            )[0]
+        else:
+            hidden_states = resnet(hidden_states, temb)
+            hidden_states = attn(hidden_states, encoder_hidden_states=encoder_hidden_states[i]).sample
+        
+        i += 1
+
+    if self.upsamplers is not None:
+        for upsampler in self.upsamplers:
+            hidden_states = upsampler(hidden_states, upsample_size)
+
+    return hidden_states
--- a/fine_tune.py
+++ b/fine_tune.py
@@ -6,6 +6,7 @@ import gc
 import math
 import os
 import toml
+from multiprocessing import Value

 from tqdm import tqdm
 import torch
@@ -19,10 +20,8 @@ from library.config_util import (
    ConfigSanitizer,
    BlueprintGenerator,
 )
-
-
-def collate_fn(examples):
-    return examples[0]
+import library.custom_train_functions as custom_train_functions
+from library.custom_train_functions import apply_snr_weight, get_weighted_text_embeddings


 def train(args):
@@ -64,6 +63,11 @@ def train(args):
    blueprint = blueprint_generator.generate(user_config, args, tokenizer=tokenizer)
    train_dataset_group = config_util.generate_dataset_group_by_blueprint(blueprint.dataset_group)

+    current_epoch = Value("i", 0)
+    current_step = Value("i", 0)
+    ds_for_collater = train_dataset_group if args.max_data_loader_n_workers == 0 else None
+    collater = train_util.collater_class(current_epoch, current_step, ds_for_collater)
+
    if args.debug_dataset:
        train_util.debug_dataset(train_dataset_group)
        return
@@ -138,7 +142,7 @@ def train(args):
        vae.requires_grad_(False)
        vae.eval()
        with torch.no_grad():
-            train_dataset_group.cache_latents(vae)
+            train_dataset_group.cache_latents(vae, args.vae_batch_size)
        vae.to("cpu")
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
@@ -187,16 +191,21 @@ def train(args):
        train_dataset_group,
        batch_size=1,
        shuffle=True,
-        collate_fn=collate_fn,
+        collate_fn=collater,
        num_workers=n_workers,
        persistent_workers=args.persistent_data_loader_workers,
    )

    # 学習ステップ数を計算する
    if args.max_train_epochs is not None:
-        args.max_train_steps = args.max_train_epochs * len(train_dataloader)
+        args.max_train_steps = args.max_train_epochs * math.ceil(
+            len(train_dataloader) / accelerator.num_processes / args.gradient_accumulation_steps
+        )
        print(f"override steps. steps for {args.max_train_epochs} epochs is / 指定エポックまでのステップ数: {args.max_train_steps}")

+    # データセット側にも学習ステップを送信
+    train_dataset_group.set_max_train_steps(args.max_train_steps)
+
    # lr schedulerを用意する
    lr_scheduler = train_util.get_scheduler_fix(args, optimizer, accelerator.num_processes)

@@ -222,9 +231,7 @@ def train(args):
        train_util.patch_accelerator_for_fp16_training(accelerator)

    # resumeする
-    if args.resume is not None:
-        print(f"resume training from state: {args.resume}")
-        accelerator.load_state(args.resume)
+    train_util.resume_from_local_or_hf_if_specified(accelerator, args)

    # epoch数を計算する
    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
@@ -240,7 +247,7 @@ def train(args):
    print(f"  num epochs / epoch数: {num_train_epochs}")
    print(f"  batch size per device / バッチサイズ: {args.train_batch_size}")
    print(f"  total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ（並列学習、勾配合計含む）: {total_batch_size}")
-    print(f"  gradient ccumulation steps / 勾配を合計するステップ数 = {args.gradient_accumulation_steps}")
+    print(f"  gradient accumulation steps / 勾配を合計するステップ数 = {args.gradient_accumulation_steps}")
    print(f"  total optimization steps / 学習ステップ数: {args.max_train_steps}")

    progress_bar = tqdm(range(args.max_train_steps), smoothing=0, disable=not accelerator.is_local_main_process, desc="steps")
@@ -255,17 +262,18 @@ def train(args):

    for epoch in range(num_train_epochs):
        print(f"epoch {epoch+1}/{num_train_epochs}")
-        train_dataset_group.set_current_epoch(epoch + 1)
+        current_epoch.value = epoch + 1

        for m in training_models:
            m.train()

        loss_total = 0
        for step, batch in enumerate(train_dataloader):
+            current_step.value = global_step
            with accelerator.accumulate(training_models[0]):  # 複数モデルに対応していない模様だがとりあえずこうしておく
                with torch.no_grad():
                    if "latents" in batch and batch["latents"] is not None:
-                        latents = batch["latents"].to(accelerator.device)
+                        latents = batch["latents"].to(accelerator.device).to(dtype=weight_dtype)
                    else:
                        # latentに変換
                        latents = vae.encode(batch["images"].to(dtype=weight_dtype)).latent_dist.sample()
@@ -274,10 +282,19 @@ def train(args):

                with torch.set_grad_enabled(args.train_text_encoder):
                    # Get the text embedding for conditioning
-                    input_ids = batch["input_ids"].to(accelerator.device)
-                    encoder_hidden_states = train_util.get_hidden_states(
-                        args, input_ids, tokenizer, text_encoder, None if not args.full_fp16 else weight_dtype
-                    )
+                    if args.weighted_captions:
+                      encoder_hidden_states = get_weighted_text_embeddings(tokenizer,
+                        text_encoder,
+                        batch["captions"],
+                        accelerator.device,
+                        args.max_token_length // 75 if args.max_token_length else 1,
+                        clip_skip=args.clip_skip,
+                        )
+                    else:
+                      input_ids = batch["input_ids"].to(accelerator.device)
+                      encoder_hidden_states = train_util.get_hidden_states(
+                          args, input_ids, tokenizer, text_encoder, None if not args.full_fp16 else weight_dtype
+                      )

                # Sample noise that we'll add to the latents
                noise = torch.randn_like(latents, device=latents.device)
@@ -302,7 +319,14 @@ def train(args):
                else:
                    target = noise

-                loss = torch.nn.functional.mse_loss(noise_pred.float(), target.float(), reduction="mean")
+                if args.min_snr_gamma:
+                    # do not mean over batch dimension for snr weight
+                    loss = torch.nn.functional.mse_loss(noise_pred.float(), target.float(), reduction="none")
+                    loss = loss.mean([1, 2, 3])
+                    loss = apply_snr_weight(loss, timesteps, noise_scheduler, args.min_snr_gamma)
+                    loss = loss.mean()  # mean over batch dimension
+                else:
+                    loss = torch.nn.functional.mse_loss(noise_pred.float(), target.float(), reduction="mean")

                accelerator.backward(loss)
                if accelerator.sync_gradients and args.max_grad_norm != 0.0:
@@ -387,7 +411,7 @@ def train(args):
        print("model saved.")


-if __name__ == "__main__":
+def setup_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser()

    train_util.add_sd_models_arguments(parser)
@@ -396,11 +420,18 @@ if __name__ == "__main__":
    train_util.add_sd_saving_arguments(parser)
    train_util.add_optimizer_arguments(parser)
    config_util.add_config_arguments(parser)
+    custom_train_functions.add_custom_train_arguments(parser)

    parser.add_argument("--diffusers_xformers", action="store_true", help="use xformers by diffusers / Diffusersでxformersを使用する")
    parser.add_argument("--train_text_encoder", action="store_true", help="train text encoder / text encoderも学習する")

+    return parser
+
+
+if __name__ == "__main__":
+    parser = setup_parser()
+
    args = parser.parse_args()
    args = train_util.read_config_from_file(args, parser)

-    train(args)
+    train(args)
--- a/finetune/clean_captions_and_tags.py
+++ b/finetune/clean_captions_and_tags.py
@@ -163,13 +163,19 @@ def main(args):
  print("done!")


-if __name__ == '__main__':
+def setup_parser() -> argparse.ArgumentParser:
  parser = argparse.ArgumentParser()
  # parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
  parser.add_argument("in_json", type=str, help="metadata file to input / 読み込むメタデータファイル")
  parser.add_argument("out_json", type=str, help="metadata file to output / メタデータファイル書き出し先")
  parser.add_argument("--debug", action="store_true", help="debug mode")

+  return parser
+
+
+if __name__ == '__main__':
+  parser = setup_parser()
+
  args, unknown = parser.parse_known_args()
  if len(unknown) == 1:
    print("WARNING: train_data_dir argument is removed. This script will not work with three arguments in future. Please specify two arguments: in_json and out_json.")
--- a/finetune/make_captions.py
+++ b/finetune/make_captions.py
@@ -133,7 +133,7 @@ def main(args):
  print("done!")


-if __name__ == '__main__':
+def setup_parser() -> argparse.ArgumentParser:
  parser = argparse.ArgumentParser()
  parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
  parser.add_argument("--caption_weights", type=str, default="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption.pth",
@@ -153,6 +153,12 @@ if __name__ == '__main__':
  parser.add_argument('--seed', default=42, type=int, help='seed for reproducibility / 再現性を確保するための乱数seed')
  parser.add_argument("--debug", action="store_true", help="debug mode")

+  return parser
+
+
+if __name__ == '__main__':
+  parser = setup_parser()
+
  args = parser.parse_args()

  # スペルミスしていたオプションを復元する
--- a/finetune/make_captions_by_git.py
+++ b/finetune/make_captions_by_git.py
@@ -127,7 +127,7 @@ def main(args):
  print("done!")


-if __name__ == '__main__':
+def setup_parser() -> argparse.ArgumentParser:
  parser = argparse.ArgumentParser()
  parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
  parser.add_argument("--caption_extension", type=str, default=".caption", help="extension of caption file / 出力されるキャプションファイルの拡張子")
@@ -141,5 +141,11 @@ if __name__ == '__main__':
                      help="remove like `with the words xxx` from caption / `with the words xxx`のような部分をキャプションから削除する")
  parser.add_argument("--debug", action="store_true", help="debug mode")

+  return parser
+
+
+if __name__ == '__main__':
+  parser = setup_parser()
+
  args = parser.parse_args()
  main(args)
--- a/finetune/merge_captions_to_metadata.py
+++ b/finetune/merge_captions_to_metadata.py
@@ -46,7 +46,7 @@ def main(args):
  print("done!")


-if __name__ == '__main__':
+def setup_parser() -> argparse.ArgumentParser:
  parser = argparse.ArgumentParser()
  parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
  parser.add_argument("out_json", type=str, help="metadata file to output / メタデータファイル書き出し先")
@@ -61,6 +61,12 @@ if __name__ == '__main__':
                      help="recursively look for training tags in all child folders of train_data_dir / train_data_dirのすべての子フォルダにある学習タグを再帰的に探す")
  parser.add_argument("--debug", action="store_true", help="debug mode")

+  return parser
+
+
+if __name__ == '__main__':
+  parser = setup_parser()
+
  args = parser.parse_args()

  # スペルミスしていたオプションを復元する
--- a/finetune/merge_dd_tags_to_metadata.py
+++ b/finetune/merge_dd_tags_to_metadata.py
@@ -47,7 +47,7 @@ def main(args):
  print("done!")


-if __name__ == '__main__':
+def setup_parser() -> argparse.ArgumentParser:
  parser = argparse.ArgumentParser()
  parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
  parser.add_argument("out_json", type=str, help="metadata file to output / メタデータファイル書き出し先")
@@ -61,5 +61,11 @@ if __name__ == '__main__':
                      help="extension of caption (tag) file / 読み込むキャプション（タグ）ファイルの拡張子")
  parser.add_argument("--debug", action="store_true", help="debug mode, print tags")

+  return parser
+
+
+if __name__ == '__main__':
+  parser = setup_parser()
+
  args = parser.parse_args()
  main(args)
--- a/finetune/prepare_buckets_latents.py
+++ b/finetune/prepare_buckets_latents.py
@@ -229,7 +229,7 @@ def main(args):
  print("done!")


-if __name__ == '__main__':
+def setup_parser() -> argparse.ArgumentParser:
  parser = argparse.ArgumentParser()
  parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
  parser.add_argument("in_json", type=str, help="metadata file to input / 読み込むメタデータファイル")
@@ -257,5 +257,11 @@ if __name__ == '__main__':
  parser.add_argument("--skip_existing", action="store_true",
                      help="skip images if npz already exists (both normal and flipped exists if flip_aug is enabled) / npzが既に存在する画像をスキップする（flip_aug有効時は通常、反転の両方が存在する画像をスキップ）")

+  return parser
+
+
+if __name__ == '__main__':
+  parser = setup_parser()
+
  args = parser.parse_args()
  main(args)
--- a/finetune/tag_images_by_wd14_tagger.py
+++ b/finetune/tag_images_by_wd14_tagger.py
@@ -173,7 +173,7 @@ def main(args):
  print("done!")


-if __name__ == '__main__':
+def setup_parser() -> argparse.ArgumentParser:
  parser = argparse.ArgumentParser()
  parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
  parser.add_argument("--repo_id", type=str, default=DEFAULT_WD14_TAGGER_REPO,
@@ -191,6 +191,12 @@ if __name__ == '__main__':
  parser.add_argument("--caption_extension", type=str, default=".txt", help="extension of caption file / 出力されるキャプションファイルの拡張子")
  parser.add_argument("--debug", action="store_true", help="debug mode")

+  return parser
+
+
+if __name__ == '__main__':
+  parser = setup_parser()
+
  args = parser.parse_args()

  # スペルミスしていたオプションを復元する
--- a/gen_img_diffusers.py
+++ b/gen_img_diffusers.py
--- a/library/config_util.py
+++ b/library/config_util.py
@@ -4,6 +4,7 @@ from dataclasses import (
  dataclass,
 )
 import functools
+import random
 from textwrap import dedent, indent
 import json
 from pathlib import Path
@@ -56,6 +57,8 @@ class BaseSubsetParams:
  caption_dropout_rate: float = 0.0
  caption_dropout_every_n_epochs: int = 0
  caption_tag_dropout_rate: float = 0.0
+  token_warmup_min: int = 1
+  token_warmup_step: float = 0

@dataclass
 class DreamBoothSubsetParams(BaseSubsetParams):
@@ -137,6 +140,8 @@ class ConfigSanitizer:
    "random_crop": bool,
    "shuffle_caption": bool,
    "keep_tokens": int,
+    "token_warmup_min": int,
+    "token_warmup_step": Any(float,int),
  }
  # DO means DropOut
  DO_SUBSET_ASCENDABLE_SCHEMA = {
@@ -406,6 +411,8 @@ def generate_dataset_group_by_blueprint(dataset_group_blueprint: DatasetGroupBlu
          flip_aug: {subset.flip_aug}
          face_crop_aug_range: {subset.face_crop_aug_range}
          random_crop: {subset.random_crop}
+          token_warmup_min: {subset.token_warmup_min},
+          token_warmup_step: {subset.token_warmup_step},
      """), "  ")

      if is_dreambooth:
@@ -422,9 +429,12 @@ def generate_dataset_group_by_blueprint(dataset_group_blueprint: DatasetGroupBlu
  print(info)

  # make buckets first because it determines the length of dataset
+  # and set the same seed for all datasets
+  seed = random.randint(0, 2**31) # actual seed is seed + epoch_no
  for i, dataset in enumerate(datasets):
    print(f"[Dataset {i}]")
    dataset.make_buckets()
+    dataset.set_seed(seed)

  return DatasetGroup(datasets)

@@ -435,7 +445,7 @@ def generate_dreambooth_subsets_config_by_subdirs(train_data_dir: Optional[str]
    try:
      n_repeats = int(tokens[0])
    except ValueError as e:
-      print(f"ignore directory without repeats / 繰り返し回数のないディレクトリを無視します: {dir}")
+      print(f"ignore directory without repeats / 繰り返し回数のないディレクトリを無視します: {name}")
      return 0, ""
    caption_by_folder = '_'.join(tokens[1:])
    return n_repeats, caption_by_folder
@@ -476,7 +486,8 @@ def load_user_config(file: str) -> dict:

  if file.name.lower().endswith('.json'):
    try:
-      config = json.load(file)
+      with open(file, 'r') as f:
+        config = json.load(f)
    except Exception:
      print(f"Error on parsing JSON config file. Please check the format. / JSON 形式の設定ファイルの読み込みに失敗しました。文法が正しいか確認してください。: {file}")
      raise
@@ -491,7 +502,6 @@ def load_user_config(file: str) -> dict:

  return config

-
 # for config test
 if __name__ == "__main__":
  parser = argparse.ArgumentParser()
--- a/library/custom_train_functions.py
+++ b/library/custom_train_functions.py
@@ -0,0 +1,344 @@
+import torch
+import argparse
+import re
+from typing import List, Optional, Union
+
+
+def apply_snr_weight(loss, timesteps, noise_scheduler, gamma):
+    alphas_cumprod = noise_scheduler.alphas_cumprod
+    sqrt_alphas_cumprod = torch.sqrt(alphas_cumprod)
+    sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - alphas_cumprod)
+    alpha = sqrt_alphas_cumprod
+    sigma = sqrt_one_minus_alphas_cumprod
+    all_snr = (alpha / sigma) ** 2
+    snr = torch.stack([all_snr[t] for t in timesteps])
+    gamma_over_snr = torch.div(torch.ones_like(snr) * gamma, snr)
+    snr_weight = torch.minimum(gamma_over_snr, torch.ones_like(gamma_over_snr)).float()  # from paper
+    loss = loss * snr_weight
+    return loss
+
+
+def add_custom_train_arguments(parser: argparse.ArgumentParser, support_weighted_captions: bool = True):
+    parser.add_argument(
+        "--min_snr_gamma",
+        type=float,
+        default=None,
+        help="gamma for reducing the weight of high loss timesteps. Lower numbers have stronger effect. 5 is recommended by paper. / 低いタイムステップでの高いlossに対して重みを減らすためのgamma値、低いほど効果が強く、論文では5が推奨",
+    )
+    if support_weighted_captions:
+        parser.add_argument(
+            "--weighted_captions",
+            action="store_true",
+            default=False,
+            help="Enable weighted captions in the standard style (token:1.3). No commas inside parens, or shuffle/dropout may break the decoder. / 「[token]」、「(token)」「(token:1.3)」のような重み付きキャプションを有効にする。カンマを括弧内に入れるとシャッフルやdropoutで重みづけがおかしくなるので注意",
+        )
+
+
+re_attention = re.compile(
+    r"""
+\\\(|
+\\\)|
+\\\[|
+\\]|
+\\\\|
+\\|
+\(|
+\[|
+:([+-]?[.\d]+)\)|
+\)|
+]|
+[^\\()\[\]:]+|
+:
+""",
+    re.X,
+)
+
+
+def parse_prompt_attention(text):
+    """
+    Parses a string with attention tokens and returns a list of pairs: text and its associated weight.
+    Accepted tokens are:
+      (abc) - increases attention to abc by a multiplier of 1.1
+      (abc:3.12) - increases attention to abc by a multiplier of 3.12
+      [abc] - decreases attention to abc by a multiplier of 1.1
+      \( - literal character '('
+      \[ - literal character '['
+      \) - literal character ')'
+      \] - literal character ']'
+      \\ - literal character '\'
+      anything else - just text
+    >>> parse_prompt_attention('normal text')
+    [['normal text', 1.0]]
+    >>> parse_prompt_attention('an (important) word')
+    [['an ', 1.0], ['important', 1.1], [' word', 1.0]]
+    >>> parse_prompt_attention('(unbalanced')
+    [['unbalanced', 1.1]]
+    >>> parse_prompt_attention('\(literal\]')
+    [['(literal]', 1.0]]
+    >>> parse_prompt_attention('(unnecessary)(parens)')
+    [['unnecessaryparens', 1.1]]
+    >>> parse_prompt_attention('a (((house:1.3)) [on] a (hill:0.5), sun, (((sky))).')
+    [['a ', 1.0],
+     ['house', 1.5730000000000004],
+     [' ', 1.1],
+     ['on', 1.0],
+     [' a ', 1.1],
+     ['hill', 0.55],
+     [', sun, ', 1.1],
+     ['sky', 1.4641000000000006],
+     ['.', 1.1]]
+    """
+
+    res = []
+    round_brackets = []
+    square_brackets = []
+
+    round_bracket_multiplier = 1.1
+    square_bracket_multiplier = 1 / 1.1
+
+    def multiply_range(start_position, multiplier):
+        for p in range(start_position, len(res)):
+            res[p][1] *= multiplier
+
+    for m in re_attention.finditer(text):
+        text = m.group(0)
+        weight = m.group(1)
+
+        if text.startswith("\\"):
+            res.append([text[1:], 1.0])
+        elif text == "(":
+            round_brackets.append(len(res))
+        elif text == "[":
+            square_brackets.append(len(res))
+        elif weight is not None and len(round_brackets) > 0:
+            multiply_range(round_brackets.pop(), float(weight))
+        elif text == ")" and len(round_brackets) > 0:
+            multiply_range(round_brackets.pop(), round_bracket_multiplier)
+        elif text == "]" and len(square_brackets) > 0:
+            multiply_range(square_brackets.pop(), square_bracket_multiplier)
+        else:
+            res.append([text, 1.0])
+
+    for pos in round_brackets:
+        multiply_range(pos, round_bracket_multiplier)
+
+    for pos in square_brackets:
+        multiply_range(pos, square_bracket_multiplier)
+
+    if len(res) == 0:
+        res = [["", 1.0]]
+
+    # merge runs of identical weights
+    i = 0
+    while i + 1 < len(res):
+        if res[i][1] == res[i + 1][1]:
+            res[i][0] += res[i + 1][0]
+            res.pop(i + 1)
+        else:
+            i += 1
+
+    return res
+
+
+def get_prompts_with_weights(tokenizer, prompt: List[str], max_length: int):
+    r"""
+    Tokenize a list of prompts and return its tokens with weights of each token.
+
+    No padding, starting or ending token is included.
+    """
+    tokens = []
+    weights = []
+    truncated = False
+    for text in prompt:
+        texts_and_weights = parse_prompt_attention(text)
+        text_token = []
+        text_weight = []
+        for word, weight in texts_and_weights:
+            # tokenize and discard the starting and the ending token
+            token = tokenizer(word).input_ids[1:-1]
+            text_token += token
+            # copy the weight by length of token
+            text_weight += [weight] * len(token)
+            # stop if the text is too long (longer than truncation limit)
+            if len(text_token) > max_length:
+                truncated = True
+                break
+        # truncate
+        if len(text_token) > max_length:
+            truncated = True
+            text_token = text_token[:max_length]
+            text_weight = text_weight[:max_length]
+        tokens.append(text_token)
+        weights.append(text_weight)
+    if truncated:
+        print("Prompt was truncated. Try to shorten the prompt or increase max_embeddings_multiples")
+    return tokens, weights
+
+
+def pad_tokens_and_weights(tokens, weights, max_length, bos, eos, no_boseos_middle=True, chunk_length=77):
+    r"""
+    Pad the tokens (with starting and ending tokens) and weights (with 1.0) to max_length.
+    """
+    max_embeddings_multiples = (max_length - 2) // (chunk_length - 2)
+    weights_length = max_length if no_boseos_middle else max_embeddings_multiples * chunk_length
+    for i in range(len(tokens)):
+        tokens[i] = [bos] + tokens[i] + [eos] * (max_length - 1 - len(tokens[i]))
+        if no_boseos_middle:
+            weights[i] = [1.0] + weights[i] + [1.0] * (max_length - 1 - len(weights[i]))
+        else:
+            w = []
+            if len(weights[i]) == 0:
+                w = [1.0] * weights_length
+            else:
+                for j in range(max_embeddings_multiples):
+                    w.append(1.0)  # weight for starting token in this chunk
+                    w += weights[i][j * (chunk_length - 2) : min(len(weights[i]), (j + 1) * (chunk_length - 2))]
+                    w.append(1.0)  # weight for ending token in this chunk
+                w += [1.0] * (weights_length - len(w))
+            weights[i] = w[:]
+
+    return tokens, weights
+
+
+def get_unweighted_text_embeddings(
+    tokenizer,
+    text_encoder,
+    text_input: torch.Tensor,
+    chunk_length: int,
+    clip_skip: int,
+    eos: int,
+    pad: int,
+    no_boseos_middle: Optional[bool] = True,
+):
+    """
+    When the length of tokens is a multiple of the capacity of the text encoder,
+    it should be split into chunks and sent to the text encoder individually.
+    """
+    max_embeddings_multiples = (text_input.shape[1] - 2) // (chunk_length - 2)
+    if max_embeddings_multiples > 1:
+        text_embeddings = []
+        for i in range(max_embeddings_multiples):
+            # extract the i-th chunk
+            text_input_chunk = text_input[:, i * (chunk_length - 2) : (i + 1) * (chunk_length - 2) + 2].clone()
+
+            # cover the head and the tail by the starting and the ending tokens
+            text_input_chunk[:, 0] = text_input[0, 0]
+            if pad == eos:  # v1
+                text_input_chunk[:, -1] = text_input[0, -1]
+            else:  # v2
+                for j in range(len(text_input_chunk)):
+                    if text_input_chunk[j, -1] != eos and text_input_chunk[j, -1] != pad:  # 最後に普通の文字がある
+                        text_input_chunk[j, -1] = eos
+                    if text_input_chunk[j, 1] == pad:  # BOSだけであとはPAD
+                        text_input_chunk[j, 1] = eos
+
+            if clip_skip is None or clip_skip == 1:
+                text_embedding = text_encoder(text_input_chunk)[0]
+            else:
+                enc_out = text_encoder(text_input_chunk, output_hidden_states=True, return_dict=True)
+                text_embedding = enc_out["hidden_states"][-clip_skip]
+                text_embedding = text_encoder.text_model.final_layer_norm(text_embedding)
+
+            # cover the head and the tail by the starting and the ending tokens
+            text_input_chunk[:, 0] = text_input[0, 0]
+            text_input_chunk[:, -1] = text_input[0, -1]
+            text_embedding = text_encoder(text_input_chunk, attention_mask=None)[0]
+
+            if no_boseos_middle:
+                if i == 0:
+                    # discard the ending token
+                    text_embedding = text_embedding[:, :-1]
+                elif i == max_embeddings_multiples - 1:
+                    # discard the starting token
+                    text_embedding = text_embedding[:, 1:]
+                else:
+                    # discard both starting and ending tokens
+                    text_embedding = text_embedding[:, 1:-1]
+
+            text_embeddings.append(text_embedding)
+        text_embeddings = torch.concat(text_embeddings, axis=1)
+    else:
+        text_embeddings = text_encoder(text_input)[0]
+    return text_embeddings
+
+
+def get_weighted_text_embeddings(
+    tokenizer,
+    text_encoder,
+    prompt: Union[str, List[str]],
+    device,
+    max_embeddings_multiples: Optional[int] = 3,
+    no_boseos_middle: Optional[bool] = False,
+    clip_skip=None,
+):
+    r"""
+    Prompts can be assigned with local weights using brackets. For example,
+    prompt 'A (very beautiful) masterpiece' highlights the words 'very beautiful',
+    and the embedding tokens corresponding to the words get multiplied by a constant, 1.1.
+
+    Also, to regularize of the embedding, the weighted embedding would be scaled to preserve the original mean.
+
+    Args:
+        prompt (`str` or `List[str]`):
+            The prompt or prompts to guide the image generation.
+        max_embeddings_multiples (`int`, *optional*, defaults to `3`):
+            The max multiple length of prompt embeddings compared to the max output length of text encoder.
+        no_boseos_middle (`bool`, *optional*, defaults to `False`):
+            If the length of text token is multiples of the capacity of text encoder, whether reserve the starting and
+            ending token in each of the chunk in the middle.
+        skip_parsing (`bool`, *optional*, defaults to `False`):
+            Skip the parsing of brackets.
+        skip_weighting (`bool`, *optional*, defaults to `False`):
+            Skip the weighting. When the parsing is skipped, it is forced True.
+    """
+    max_length = (tokenizer.model_max_length - 2) * max_embeddings_multiples + 2
+    if isinstance(prompt, str):
+        prompt = [prompt]
+
+    prompt_tokens, prompt_weights = get_prompts_with_weights(tokenizer, prompt, max_length - 2)
+
+    # round up the longest length of tokens to a multiple of (model_max_length - 2)
+    max_length = max([len(token) for token in prompt_tokens])
+
+    max_embeddings_multiples = min(
+        max_embeddings_multiples,
+        (max_length - 1) // (tokenizer.model_max_length - 2) + 1,
+    )
+    max_embeddings_multiples = max(1, max_embeddings_multiples)
+    max_length = (tokenizer.model_max_length - 2) * max_embeddings_multiples + 2
+
+    # pad the length of tokens and weights
+    bos = tokenizer.bos_token_id
+    eos = tokenizer.eos_token_id
+    pad = tokenizer.pad_token_id
+    prompt_tokens, prompt_weights = pad_tokens_and_weights(
+        prompt_tokens,
+        prompt_weights,
+        max_length,
+        bos,
+        eos,
+        no_boseos_middle=no_boseos_middle,
+        chunk_length=tokenizer.model_max_length,
+    )
+    prompt_tokens = torch.tensor(prompt_tokens, dtype=torch.long, device=device)
+
+    # get the embeddings
+    text_embeddings = get_unweighted_text_embeddings(
+        tokenizer,
+        text_encoder,
+        prompt_tokens,
+        tokenizer.model_max_length,
+        clip_skip,
+        eos,
+        pad,
+        no_boseos_middle=no_boseos_middle,
+    )
+    prompt_weights = torch.tensor(prompt_weights, dtype=text_embeddings.dtype, device=device)
+
+    # assign weights to the prompts and normalize in the sense of mean
+    previous_mean = text_embeddings.float().mean(axis=[-2, -1]).to(text_embeddings.dtype)
+    text_embeddings = text_embeddings * prompt_weights.unsqueeze(-1)
+    current_mean = text_embeddings.float().mean(axis=[-2, -1]).to(text_embeddings.dtype)
+    text_embeddings = text_embeddings * (previous_mean / current_mean).unsqueeze(-1).unsqueeze(-1)
+
+    return text_embeddings
--- a/library/huggingface_util.py
+++ b/library/huggingface_util.py
@@ -0,0 +1,78 @@
+from typing import *
+from huggingface_hub import HfApi
+from pathlib import Path
+import argparse
+import os
+
+from library.utils import fire_in_thread
+
+
+def exists_repo(
+    repo_id: str, repo_type: str, revision: str = "main", token: str = None
+):
+    api = HfApi(
+        token=token,
+    )
+    try:
+        api.repo_info(repo_id=repo_id, revision=revision, repo_type=repo_type)
+        return True
+    except:
+        return False
+
+
+def upload(
+    args: argparse.Namespace,
+    src: Union[str, Path, bytes, BinaryIO],
+    dest_suffix: str = "",
+    force_sync_upload: bool = False,
+):
+    repo_id = args.huggingface_repo_id
+    repo_type = args.huggingface_repo_type
+    token = args.huggingface_token
+    path_in_repo = args.huggingface_path_in_repo + dest_suffix
+    private = args.huggingface_repo_visibility is None or args.huggingface_repo_visibility != "public"
+    api = HfApi(token=token)
+    if not exists_repo(repo_id=repo_id, repo_type=repo_type, token=token):
+        api.create_repo(repo_id=repo_id, repo_type=repo_type, private=private)
+
+    is_folder = (type(src) == str and os.path.isdir(src)) or (
+        isinstance(src, Path) and src.is_dir()
+    )
+
+    def uploader():
+        if is_folder:
+            api.upload_folder(
+                repo_id=repo_id,
+                repo_type=repo_type,
+                folder_path=src,
+                path_in_repo=path_in_repo,
+            )
+        else:
+            api.upload_file(
+                repo_id=repo_id,
+                repo_type=repo_type,
+                path_or_fileobj=src,
+                path_in_repo=path_in_repo,
+            )
+
+    if args.async_upload and not force_sync_upload:
+        fire_in_thread(uploader)
+    else:
+        uploader()
+
+
+def list_dir(
+    repo_id: str,
+    subfolder: str,
+    repo_type: str,
+    revision: str = "main",
+    token: str = None,
+):
+    api = HfApi(
+        token=token,
+    )
+    repo_info = api.repo_info(repo_id=repo_id, revision=revision, repo_type=repo_type)
+    file_list = [
+        file for file in repo_info.siblings if file.rfilename.startswith(subfolder)
+    ]
+    return file_list
--- a/library/model_util.py
+++ b/library/model_util.py
--- a/library/train_util.py
+++ b/library/train_util.py
@@ -2,6 +2,7 @@

 import argparse
 import ast
+import asyncio
 import importlib
 import json
 import pathlib
@@ -49,6 +50,7 @@ from diffusers import (
    KDPM2DiscreteScheduler,
    KDPM2AncestralDiscreteScheduler,
 )
+from huggingface_hub import hf_hub_download
 import albumentations as albu
 import numpy as np
 from PIL import Image
@@ -58,6 +60,7 @@ from torch import einsum
 import safetensors.torch
 from library.lpw_stable_diffusion import StableDiffusionLongPromptWeightingPipeline
 import library.model_util as model_util
+import library.huggingface_util as huggingface_util

 # Tokenizer: checkpointから読み込むのではなくあらかじめ提供されているものを使う
 TOKENIZER_PATH = "openai/clip-vit-large-patch14"
@@ -73,8 +76,7 @@ DEFAULT_LAST_OUTPUT_NAME = "last"

 # region dataset

-IMAGE_EXTENSIONS = [".png", ".jpg", ".jpeg", ".webp", ".bmp"]
-# , ".PNG", ".JPG", ".JPEG", ".WEBP", ".BMP"]         # Linux?
+IMAGE_EXTENSIONS = [".png", ".jpg", ".jpeg", ".webp", ".bmp", ".PNG", ".JPG", ".JPEG", ".WEBP", ".BMP"]


 class ImageInfo:
@@ -277,6 +279,8 @@ class BaseSubset:
        caption_dropout_rate: float,
        caption_dropout_every_n_epochs: int,
        caption_tag_dropout_rate: float,
+        token_warmup_min: int,
+        token_warmup_step: Union[float, int],
    ) -> None:
        self.image_dir = image_dir
        self.num_repeats = num_repeats
@@ -290,6 +294,9 @@ class BaseSubset:
        self.caption_dropout_every_n_epochs = caption_dropout_every_n_epochs
        self.caption_tag_dropout_rate = caption_tag_dropout_rate

+        self.token_warmup_min = token_warmup_min  # step=0におけるタグの数
+        self.token_warmup_step = token_warmup_step  # N（N<1ならN*max_train_steps）ステップ目でタグの数が最大になる
+
        self.img_count = 0


@@ -310,6 +317,8 @@ class DreamBoothSubset(BaseSubset):
        caption_dropout_rate,
        caption_dropout_every_n_epochs,
        caption_tag_dropout_rate,
+        token_warmup_min,
+        token_warmup_step,
    ) -> None:
        assert image_dir is not None, "image_dir must be specified / image_dirは指定が必須です"

@@ -325,6 +334,8 @@ class DreamBoothSubset(BaseSubset):
            caption_dropout_rate,
            caption_dropout_every_n_epochs,
            caption_tag_dropout_rate,
+            token_warmup_min,
+            token_warmup_step,
        )

        self.is_reg = is_reg
@@ -352,6 +363,8 @@ class FineTuningSubset(BaseSubset):
        caption_dropout_rate,
        caption_dropout_every_n_epochs,
        caption_tag_dropout_rate,
+        token_warmup_min,
+        token_warmup_step,
    ) -> None:
        assert metadata_file is not None, "metadata_file must be specified / metadata_fileは指定が必須です"

@@ -367,6 +380,8 @@ class FineTuningSubset(BaseSubset):
            caption_dropout_rate,
            caption_dropout_every_n_epochs,
            caption_tag_dropout_rate,
+            token_warmup_min,
+            token_warmup_step,
        )

        self.metadata_file = metadata_file
@@ -392,6 +407,8 @@ class BaseDataset(torch.utils.data.Dataset):

        self.token_padding_disabled = False
        self.tag_frequency = {}
+        self.XTI_layers = None
+        self.token_strings = None

        self.enable_bucket = False
        self.bucket_manager: BucketManager = None  # not initialized
@@ -405,6 +422,10 @@ class BaseDataset(torch.utils.data.Dataset):

        self.current_epoch: int = 0  # インスタンスがepochごとに新しく作られるようなので外側から渡さないとダメ

+        self.current_step: int = 0
+        self.max_train_steps: int = 0
+        self.seed: int = 0
+
        # augmentation
        self.aug_helper = AugHelper()

@@ -420,9 +441,19 @@ class BaseDataset(torch.utils.data.Dataset):

        self.replacements = {}

+    def set_seed(self, seed):
+        self.seed = seed
+
    def set_current_epoch(self, epoch):
+        if not self.current_epoch == epoch:  # epochが切り替わったらバケツをシャッフルする
+            self.shuffle_buckets()
        self.current_epoch = epoch
-        self.shuffle_buckets()
+
+    def set_current_step(self, step):
+        self.current_step = step
+
+    def set_max_train_steps(self, max_train_steps):
+        self.max_train_steps = max_train_steps

    def set_tag_frequency(self, dir_name, captions):
        frequency_for_dir = self.tag_frequency.get(dir_name, {})
@@ -438,6 +469,10 @@ class BaseDataset(torch.utils.data.Dataset):
    def disable_token_padding(self):
        self.token_padding_disabled = True

+    def enable_XTI(self, layers=None, token_strings=None):
+        self.XTI_layers = layers
+        self.token_strings = token_strings
+
    def add_replacement(self, str_from, str_to):
        self.replacements[str_from] = str_to

@@ -453,7 +488,16 @@ class BaseDataset(torch.utils.data.Dataset):
        if is_drop_out:
            caption = ""
        else:
-            if subset.shuffle_caption or subset.caption_tag_dropout_rate > 0:
+            if subset.shuffle_caption or subset.token_warmup_step > 0 or subset.caption_tag_dropout_rate > 0:
+                tokens = [t.strip() for t in caption.strip().split(",")]
+                if subset.token_warmup_step < 1:  # 初回に上書きする
+                    subset.token_warmup_step = math.floor(subset.token_warmup_step * self.max_train_steps)
+                if subset.token_warmup_step and self.current_step < subset.token_warmup_step:
+                    tokens_len = (
+                        math.floor((self.current_step) * ((len(tokens) - subset.token_warmup_min) / (subset.token_warmup_step)))
+                        + subset.token_warmup_min
+                    )
+                    tokens = tokens[:tokens_len]

                def dropout_tags(tokens):
                    if subset.caption_tag_dropout_rate <= 0:
@@ -465,10 +509,10 @@ class BaseDataset(torch.utils.data.Dataset):
                    return l

                fixed_tokens = []
-                flex_tokens = [t.strip() for t in caption.strip().split(",")]
+                flex_tokens = tokens[:]
                if subset.keep_tokens > 0:
                    fixed_tokens = flex_tokens[: subset.keep_tokens]
-                    flex_tokens = flex_tokens[subset.keep_tokens :]
+                    flex_tokens = tokens[subset.keep_tokens :]

                if subset.shuffle_caption:
                    random.shuffle(flex_tokens)
@@ -638,6 +682,9 @@ class BaseDataset(torch.utils.data.Dataset):
        self._length = len(self.buckets_indices)

    def shuffle_buckets(self):
+        # set random seed for this epoch
+        random.seed(self.seed + self.current_epoch)
+
        random.shuffle(self.buckets_indices)
        self.bucket_manager.shuffle()

@@ -675,10 +722,19 @@ class BaseDataset(torch.utils.data.Dataset):
    def is_latent_cacheable(self):
        return all([not subset.color_aug and not subset.random_crop for subset in self.subsets])

-    def cache_latents(self, vae):
-        # TODO ここを高速化したい
+    def cache_latents(self, vae, vae_batch_size=1):
+        # ちょっと速くした
        print("caching latents.")
-        for info in tqdm(self.image_data.values()):
+
+        image_infos = list(self.image_data.values())
+
+        # sort by resolution
+        image_infos.sort(key=lambda info: info.bucket_reso[0] * info.bucket_reso[1])
+
+        # split by resolution
+        batches = []
+        batch = []
+        for info in image_infos:
            subset = self.image_to_subset[info.image_key]

            if info.latents_npz is not None:
@@ -689,18 +745,42 @@ class BaseDataset(torch.utils.data.Dataset):
                    info.latents_flipped = torch.FloatTensor(info.latents_flipped)
                continue

-            image = self.load_image(info.absolute_path)
-            image = self.trim_and_resize_if_required(subset, image, info.bucket_reso, info.resized_size)
+            # if last member of batch has different resolution, flush the batch
+            if len(batch) > 0 and batch[-1].bucket_reso != info.bucket_reso:
+                batches.append(batch)
+                batch = []

-            img_tensor = self.image_transforms(image)
-            img_tensor = img_tensor.unsqueeze(0).to(device=vae.device, dtype=vae.dtype)
-            info.latents = vae.encode(img_tensor).latent_dist.sample().squeeze(0).to("cpu")
+            batch.append(info)
+
+            # if number of data in batch is enough, flush the batch
+            if len(batch) >= vae_batch_size:
+                batches.append(batch)
+                batch = []
+
+        if len(batch) > 0:
+            batches.append(batch)
+
+        # iterate batches
+        for batch in tqdm(batches, smoothing=1, total=len(batches)):
+            images = []
+            for info in batch:
+                image = self.load_image(info.absolute_path)
+                image = self.trim_and_resize_if_required(subset, image, info.bucket_reso, info.resized_size)
+                image = self.image_transforms(image)
+                images.append(image)
+
+            img_tensors = torch.stack(images, dim=0)
+            img_tensors = img_tensors.to(device=vae.device, dtype=vae.dtype)
+
+            latents = vae.encode(img_tensors).latent_dist.sample().to("cpu")
+            for info, latent in zip(batch, latents):
+                info.latents = latent

            if subset.flip_aug:
-                image = image[:, ::-1].copy()  # cannot convert to Tensor without copy
-                img_tensor = self.image_transforms(image)
-                img_tensor = img_tensor.unsqueeze(0).to(device=vae.device, dtype=vae.dtype)
-                info.latents_flipped = vae.encode(img_tensor).latent_dist.sample().squeeze(0).to("cpu")
+                img_tensors = torch.flip(img_tensors, dims=[3])
+                latents = vae.encode(img_tensors).latent_dist.sample().to("cpu")
+                for info, latent in zip(batch, latents):
+                    info.latents_flipped = latent

    def get_image_size(self, image_path):
        image = Image.open(image_path)
@@ -838,9 +918,22 @@ class BaseDataset(torch.utils.data.Dataset):
            latents_list.append(latents)

            caption = self.process_caption(subset, image_info.caption)
-            captions.append(caption)
+            if self.XTI_layers:
+                caption_layer = []
+                for layer in self.XTI_layers:
+                    token_strings_from = " ".join(self.token_strings)
+                    token_strings_to = " ".join([f"{x}_{layer}" for x in self.token_strings])
+                    caption_ = caption.replace(token_strings_from, token_strings_to)
+                    caption_layer.append(caption_)
+                captions.append(caption_layer)
+            else:
+                captions.append(caption)
            if not self.token_padding_disabled:  # this option might be omitted in future
-                input_ids_list.append(self.get_input_ids(caption))
+                if self.XTI_layers:
+                    token_caption = self.get_input_ids(caption_layer)
+                else:
+                    token_caption = self.get_input_ids(caption)
+                input_ids_list.append(token_caption)

        example = {}
        example["loss_weights"] = torch.FloatTensor(loss_weights)
@@ -860,10 +953,10 @@ class BaseDataset(torch.utils.data.Dataset):
        example["images"] = images

        example["latents"] = torch.stack(latents_list) if latents_list[0] is not None else None
+        example["captions"] = captions

        if self.debug_dataset:
            example["image_keys"] = bucket[image_index : image_index + self.batch_size]
-            example["captions"] = captions
        return example


@@ -1011,7 +1104,7 @@ class DreamBoothDataset(BaseDataset):
                        self.register_image(info, subset)
                        n += info.num_repeats
                    else:
-                        info.num_repeats += 1
+                        info.num_repeats += 1  # rewrite registered info
                        n += 1
                    if n >= num_train_images:
                        break
@@ -1072,6 +1165,8 @@ class FineTuningDataset(BaseDataset):
                # path情報を作る
                if os.path.exists(image_key):
                    abs_path = image_key
+                elif os.path.exists(os.path.splitext(image_key)[0] + ".npz"):
+                    abs_path = os.path.splitext(image_key)[0] + ".npz"
                else:
                    npz_path = os.path.join(subset.image_dir, image_key + ".npz")
                    if os.path.exists(npz_path):
@@ -1197,6 +1292,10 @@ class FineTuningDataset(BaseDataset):
                npz_file_flip = None
            return npz_file_norm, npz_file_flip

+        # if not full path, check image_dir. if image_dir is None, return None
+        if subset.image_dir is None:
+            return None, None
+
        # image_key is relative path
        npz_file_norm = os.path.join(subset.image_dir, image_key + ".npz")
        npz_file_flip = os.path.join(subset.image_dir, image_key + "_flip.npz")
@@ -1237,10 +1336,14 @@ class DatasetGroup(torch.utils.data.ConcatDataset):
    #   for dataset in self.datasets:
    #     dataset.make_buckets()

-    def cache_latents(self, vae):
+    def enable_XTI(self, *args, **kwargs):
+        for dataset in self.datasets:
+            dataset.enable_XTI(*args, **kwargs)
+
+    def cache_latents(self, vae, vae_batch_size=1):
        for i, dataset in enumerate(self.datasets):
            print(f"[Dataset {i}]")
-            dataset.cache_latents(vae)
+            dataset.cache_latents(vae, vae_batch_size)

    def is_latent_cacheable(self) -> bool:
        return all([dataset.is_latent_cacheable() for dataset in self.datasets])
@@ -1249,6 +1352,14 @@ class DatasetGroup(torch.utils.data.ConcatDataset):
        for dataset in self.datasets:
            dataset.set_current_epoch(epoch)

+    def set_current_step(self, step):
+        for dataset in self.datasets:
+            dataset.set_current_step(step)
+
+    def set_max_train_steps(self, max_train_steps):
+        for dataset in self.datasets:
+            dataset.set_max_train_steps(max_train_steps)
+
    def disable_token_padding(self):
        for dataset in self.datasets:
            dataset.disable_token_padding()
@@ -1256,37 +1367,55 @@ class DatasetGroup(torch.utils.data.ConcatDataset):

 def debug_dataset(train_dataset, show_input_ids=False):
    print(f"Total dataset length (steps) / データセットの長さ（ステップ数）: {len(train_dataset)}")
-    print("Escape for exit. / Escキーで中断、終了します")
+    print("`S` for next step, `E` for next epoch no. , Escape for exit. / Sキーで次のステップ、Eキーで次のエポック、Escキーで中断、終了します")

-    train_dataset.set_current_epoch(1)
-    k = 0
-    indices = list(range(len(train_dataset)))
-    random.shuffle(indices)
-    for i, idx in enumerate(indices):
-        example = train_dataset[idx]
-        if example["latents"] is not None:
-            print(f"sample has latents from npz file: {example['latents'].size()}")
-        for j, (ik, cap, lw, iid) in enumerate(
-            zip(example["image_keys"], example["captions"], example["loss_weights"], example["input_ids"])
-        ):
-            print(f'{ik}, size: {train_dataset.image_data[ik].image_size}, loss weight: {lw}, caption: "{cap}"')
-            if show_input_ids:
-                print(f"input ids: {iid}")
-            if example["images"] is not None:
-                im = example["images"][j]
-                print(f"image size: {im.size()}")
-                im = ((im.numpy() + 1.0) * 127.5).astype(np.uint8)
-                im = np.transpose(im, (1, 2, 0))  # c,H,W -> H,W,c
-                im = im[:, :, ::-1]  # RGB -> BGR (OpenCV)
-                if os.name == "nt":  # only windows
-                    cv2.imshow("img", im)
-                k = cv2.waitKey()
-                cv2.destroyAllWindows()
-                if k == 27:
-                    break
-        if k == 27 or (example["images"] is None and i >= 8):
+    epoch = 1
+    while True:
+        print(f"epoch: {epoch}")
+
+        steps = (epoch - 1) * len(train_dataset) + 1
+        indices = list(range(len(train_dataset)))
+        random.shuffle(indices)
+
+        k = 0
+        for i, idx in enumerate(indices):
+            train_dataset.set_current_epoch(epoch)
+            train_dataset.set_current_step(steps)
+            print(f"steps: {steps} ({i + 1}/{len(train_dataset)})")
+
+            example = train_dataset[idx]
+            if example["latents"] is not None:
+                print(f"sample has latents from npz file: {example['latents'].size()}")
+            for j, (ik, cap, lw, iid) in enumerate(
+                zip(example["image_keys"], example["captions"], example["loss_weights"], example["input_ids"])
+            ):
+                print(f'{ik}, size: {train_dataset.image_data[ik].image_size}, loss weight: {lw}, caption: "{cap}"')
+                if show_input_ids:
+                    print(f"input ids: {iid}")
+                if example["images"] is not None:
+                    im = example["images"][j]
+                    print(f"image size: {im.size()}")
+                    im = ((im.numpy() + 1.0) * 127.5).astype(np.uint8)
+                    im = np.transpose(im, (1, 2, 0))  # c,H,W -> H,W,c
+                    im = im[:, :, ::-1]  # RGB -> BGR (OpenCV)
+                    if os.name == "nt":  # only windows
+                        cv2.imshow("img", im)
+                    k = cv2.waitKey()
+                    cv2.destroyAllWindows()
+                    if k == 27 or k == ord("s") or k == ord("e"):
+                        break
+            steps += 1
+
+            if k == ord("e"):
+                break
+            if k == 27 or (example["images"] is None and i >= 8):
+                k = 27
+                break
+        if k == 27:
            break

+        epoch += 1
+

 def glob_images(directory, base="*"):
    img_paths = []
@@ -1295,8 +1424,8 @@ def glob_images(directory, base="*"):
            img_paths.extend(glob.glob(os.path.join(glob.escape(directory), base + ext)))
        else:
            img_paths.extend(glob.glob(glob.escape(os.path.join(directory, base + ext))))
-    # img_paths = list(set(img_paths))                    # 重複を排除
-    # img_paths.sort()
+    img_paths = list(set(img_paths))  # 重複を排除
+    img_paths.sort()
    return img_paths


@@ -1308,14 +1437,13 @@ def glob_images_pathlib(dir_path, recursive):
    else:
        for ext in IMAGE_EXTENSIONS:
            image_paths += list(dir_path.glob("*" + ext))
-    # image_paths = list(set(image_paths))        # 重複を排除
-    # image_paths.sort()
+    image_paths = list(set(image_paths))  # 重複を排除
+    image_paths.sort()
    return image_paths


 # endregion

-
 # region モジュール入れ替え部
 """
 高速化のためのモジュール入れ替え
@@ -1770,6 +1898,38 @@ def add_optimizer_arguments(parser: argparse.ArgumentParser):
 def add_training_arguments(parser: argparse.ArgumentParser, support_dreambooth: bool):
    parser.add_argument("--output_dir", type=str, default=None, help="directory to output trained model / 学習後のモデル出力先ディレクトリ")
    parser.add_argument("--output_name", type=str, default=None, help="base name of trained model file / 学習後のモデルの拡張子を除くファイル名")
+    parser.add_argument(
+        "--huggingface_repo_id", type=str, default=None, help="huggingface repo name to upload / huggingfaceにアップロードするリポジトリ名"
+    )
+    parser.add_argument(
+        "--huggingface_repo_type", type=str, default=None, help="huggingface repo type to upload / huggingfaceにアップロードするリポジトリの種類"
+    )
+    parser.add_argument(
+        "--huggingface_path_in_repo",
+        type=str,
+        default=None,
+        help="huggingface model path to upload files / huggingfaceにアップロードするファイルのパス",
+    )
+    parser.add_argument("--huggingface_token", type=str, default=None, help="huggingface token / huggingfaceのトークン")
+    parser.add_argument(
+        "--huggingface_repo_visibility",
+        type=str,
+        default=None,
+        help="huggingface repository visibility ('public' for public, 'private' or None for private) / huggingfaceにアップロードするリポジトリの公開設定（'public'で公開、'private'またはNoneで非公開）",
+    )
+    parser.add_argument(
+        "--save_state_to_huggingface", action="store_true", help="save state to huggingface / huggingfaceにstateを保存する"
+    )
+    parser.add_argument(
+        "--resume_from_huggingface",
+        action="store_true",
+        help="resume from huggingface (ex: --resume {repo_id}/{path_in_repo}:{revision}:{repo_type}) / huggingfaceから学習を再開する(例: --resume {repo_id}/{path_in_repo}:{revision}:{repo_type})",
+    )
+    parser.add_argument(
+        "--async_upload",
+        action="store_true",
+        help="upload to huggingface asynchronously / huggingfaceに非同期でアップロードする",
+    )
    parser.add_argument(
        "--save_precision",
        type=str,
@@ -1986,6 +2146,7 @@ def add_dataset_arguments(
        action="store_true",
        help="cache latents to reduce memory (augmentations must be disabled) / メモリ削減のためにlatentをcacheする（augmentationは使用不可）",
    )
+    parser.add_argument("--vae_batch_size", type=int, default=1, help="batch size for caching latents / latentのcache時のバッチサイズ")
    parser.add_argument(
        "--enable_bucket", action="store_true", help="enable buckets for multi aspect ratio training / 複数解像度学習のためのbucketを有効にする"
    )
@@ -2001,6 +2162,20 @@ def add_dataset_arguments(
        "--bucket_no_upscale", action="store_true", help="make bucket for each image without upscaling / 画像を拡大せずbucketを作成します"
    )

+    parser.add_argument(
+        "--token_warmup_min",
+        type=int,
+        default=1,
+        help="start learning at N tags (token means comma separated strinfloatgs) / タグ数をN個から増やしながら学習する",
+    )
+
+    parser.add_argument(
+        "--token_warmup_step",
+        type=float,
+        default=0,
+        help="tag length reaches maximum on N steps (or N*max_train_steps if N<1) / N（N<1ならN*max_train_steps）ステップでタグ長が最大になる。デフォルトは0（最初から最大）",
+    )
+
    if support_caption_dropout:
        # Textual Inversion はcaptionのdropoutをsupportしない
        # いわゆるtensorのDropoutと紛らわしいのでprefixにcaptionを付けておく　every_n_epochsは他と平仄を合わせてdefault Noneに
@@ -2120,6 +2295,57 @@ def read_config_from_file(args: argparse.Namespace, parser: argparse.ArgumentPar
 # region utils


+def resume_from_local_or_hf_if_specified(accelerator, args):
+    if not args.resume:
+        return
+
+    if not args.resume_from_huggingface:
+        print(f"resume training from local state: {args.resume}")
+        accelerator.load_state(args.resume)
+        return
+
+    print(f"resume training from huggingface state: {args.resume}")
+    repo_id = args.resume.split("/")[0] + "/" + args.resume.split("/")[1]
+    path_in_repo = "/".join(args.resume.split("/")[2:])
+    revision = None
+    repo_type = None
+    if ":" in path_in_repo:
+        divided = path_in_repo.split(":")
+        if len(divided) == 2:
+            path_in_repo, revision = divided
+            repo_type = "model"
+        else:
+            path_in_repo, revision, repo_type = divided
+    print(f"Downloading state from huggingface: {repo_id}/{path_in_repo}@{revision}")
+
+    list_files = huggingface_util.list_dir(
+        repo_id=repo_id,
+        subfolder=path_in_repo,
+        revision=revision,
+        token=args.huggingface_token,
+        repo_type=repo_type,
+    )
+
+    async def download(filename) -> str:
+        def task():
+            return hf_hub_download(
+                repo_id=repo_id,
+                filename=filename,
+                revision=revision,
+                repo_type=repo_type,
+                token=args.huggingface_token,
+            )
+
+        return await asyncio.get_event_loop().run_in_executor(None, task)
+
+    loop = asyncio.get_event_loop()
+    results = loop.run_until_complete(asyncio.gather(*[download(filename=filename.rfilename) for filename in list_files]))
+    if len(results) == 0:
+        raise ValueError("No files found in the specified repo id/path/revision / 指定されたリポジトリID/パス/リビジョンにファイルが見つかりませんでした")
+    dirname = os.path.dirname(results[0])
+    accelerator.load_state(dirname)
+
+
 def get_optimizer(args, trainable_params):
    # "Optimizer to use: AdamW, AdamW8bit, Lion, SGDNesterov, SGDNesterov8bit, DAdaptation, Adafactor"

@@ -2319,7 +2545,7 @@ def get_scheduler_fix(args, optimizer: Optimizer, num_processes: int):
    Unified API to get any scheduler from its name.
    """
    name = args.lr_scheduler
-    num_warmup_steps = args.lr_warmup_steps
+    num_warmup_steps: Optional[int] = args.lr_warmup_steps
    num_training_steps = args.max_train_steps * num_processes * args.gradient_accumulation_steps
    num_cycles = args.lr_scheduler_num_cycles
    power = args.lr_scheduler_power
@@ -2343,6 +2569,11 @@ def get_scheduler_fix(args, optimizer: Optimizer, num_processes: int):

            lr_scheduler_kwargs[key] = value

+    def wrap_check_needless_num_warmup_steps(return_vals):
+        if num_warmup_steps is not None and num_warmup_steps != 0:
+            raise ValueError(f"{name} does not require `num_warmup_steps`. Set None or 0.")
+        return return_vals
+
    # using any lr_scheduler from other library
    if args.lr_scheduler_type:
        lr_scheduler_type = args.lr_scheduler_type
@@ -2355,7 +2586,7 @@ def get_scheduler_fix(args, optimizer: Optimizer, num_processes: int):
            lr_scheduler_type = values[-1]
        lr_scheduler_class = getattr(lr_scheduler_module, lr_scheduler_type)
        lr_scheduler = lr_scheduler_class(optimizer, **lr_scheduler_kwargs)
-        return lr_scheduler
+        return wrap_check_needless_num_warmup_steps(lr_scheduler)

    if name.startswith("adafactor"):
        assert (
@@ -2363,12 +2594,12 @@ def get_scheduler_fix(args, optimizer: Optimizer, num_processes: int):
        ), f"adafactor scheduler must be used with Adafactor optimizer / adafactor schedulerはAdafactorオプティマイザと同時に使ってください"
        initial_lr = float(name.split(":")[1])
        # print("adafactor scheduler init lr", initial_lr)
-        return transformers.optimization.AdafactorSchedule(optimizer, initial_lr)
+        return wrap_check_needless_num_warmup_steps(transformers.optimization.AdafactorSchedule(optimizer, initial_lr))

    name = SchedulerType(name)
    schedule_func = TYPE_TO_SCHEDULER_FUNCTION[name]
    if name == SchedulerType.CONSTANT:
-        return schedule_func(optimizer)
+        return wrap_check_needless_num_warmup_steps(schedule_func(optimizer))

    # All other schedulers require `num_warmup_steps`
    if num_warmup_steps is None:
@@ -2499,14 +2730,15 @@ def prepare_dtype(args: argparse.Namespace):
    return weight_dtype, save_dtype


-def load_target_model(args: argparse.Namespace, weight_dtype):
+def load_target_model(args: argparse.Namespace, weight_dtype, device="cpu"):
    name_or_path = args.pretrained_model_name_or_path
    name_or_path = os.readlink(name_or_path) if os.path.islink(name_or_path) else name_or_path
    load_stable_diffusion_format = os.path.isfile(name_or_path)  # determine SD or Diffusers
    if load_stable_diffusion_format:
        print("load StableDiffusion checkpoint")
-        text_encoder, vae, unet = model_util.load_models_from_stable_diffusion_checkpoint(args.v2, name_or_path)
+        text_encoder, vae, unet = model_util.load_models_from_stable_diffusion_checkpoint(args.v2, name_or_path, device)
    else:
+        # Diffusers model is loaded to CPU
        print("load Diffusers pretrained models")
        try:
            pipe = StableDiffusionPipeline.from_pretrained(name_or_path, tokenizer=None, safety_checker=None)
@@ -2625,6 +2857,8 @@ def save_sd_model_on_epoch_end(
            model_util.save_stable_diffusion_checkpoint(
                args.v2, ckpt_file, text_encoder, unet, src_path, epoch_no, global_step, save_dtype, vae
            )
+            if args.huggingface_repo_id is not None:
+                huggingface_util.upload(args, ckpt_file, "/" + ckpt_name)

        def remove_sd(old_epoch_no):
            _, old_ckpt_name = get_epoch_ckpt_name(args, use_safetensors, old_epoch_no)
@@ -2644,6 +2878,8 @@ def save_sd_model_on_epoch_end(
            model_util.save_diffusers_checkpoint(
                args.v2, out_dir, text_encoder, unet, src_path, vae=vae, use_safetensors=use_safetensors
            )
+            if args.huggingface_repo_id is not None:
+                huggingface_util.upload(args, out_dir, "/" + model_name)

        def remove_du(old_epoch_no):
            out_dir_old = os.path.join(args.output_dir, EPOCH_DIFFUSERS_DIR_NAME.format(model_name, old_epoch_no))
@@ -2661,7 +2897,11 @@ def save_sd_model_on_epoch_end(

 def save_state_on_epoch_end(args: argparse.Namespace, accelerator, model_name, epoch_no):
    print("saving state.")
-    accelerator.save_state(os.path.join(args.output_dir, EPOCH_STATE_NAME.format(model_name, epoch_no)))
+    state_dir = os.path.join(args.output_dir, EPOCH_STATE_NAME.format(model_name, epoch_no))
+    accelerator.save_state(state_dir)
+    if args.save_state_to_huggingface:
+        print("uploading state to huggingface.")
+        huggingface_util.upload(args, state_dir, "/" + EPOCH_STATE_NAME.format(model_name, epoch_no))

    last_n_epochs = args.save_last_n_epochs_state if args.save_last_n_epochs_state else args.save_last_n_epochs
    if last_n_epochs is not None:
@@ -2672,6 +2912,17 @@ def save_state_on_epoch_end(args: argparse.Namespace, accelerator, model_name, e
            shutil.rmtree(state_dir_old)


+def save_state_on_train_end(args: argparse.Namespace, accelerator):
+    print("saving last state.")
+    os.makedirs(args.output_dir, exist_ok=True)
+    model_name = DEFAULT_LAST_OUTPUT_NAME if args.output_name is None else args.output_name
+    state_dir = os.path.join(args.output_dir, LAST_STATE_NAME.format(model_name))
+    accelerator.save_state(state_dir)
+    if args.save_state_to_huggingface:
+        print("uploading last state to huggingface.")
+        huggingface_util.upload(args, state_dir, "/" + LAST_STATE_NAME.format(model_name))
+
+
 def save_sd_model_on_train_end(
    args: argparse.Namespace,
    src_path: str,
@@ -2696,6 +2947,8 @@ def save_sd_model_on_train_end(
        model_util.save_stable_diffusion_checkpoint(
            args.v2, ckpt_file, text_encoder, unet, src_path, epoch, global_step, save_dtype, vae
        )
+        if args.huggingface_repo_id is not None:
+            huggingface_util.upload(args, ckpt_file, "/" + ckpt_name, force_sync_upload=True)
    else:
        out_dir = os.path.join(args.output_dir, model_name)
        os.makedirs(out_dir, exist_ok=True)
@@ -2704,13 +2957,8 @@ def save_sd_model_on_train_end(
        model_util.save_diffusers_checkpoint(
            args.v2, out_dir, text_encoder, unet, src_path, vae=vae, use_safetensors=use_safetensors
        )
-
-
-def save_state_on_train_end(args: argparse.Namespace, accelerator):
-    print("saving last state.")
-    os.makedirs(args.output_dir, exist_ok=True)
-    model_name = DEFAULT_LAST_OUTPUT_NAME if args.output_name is None else args.output_name
-    accelerator.save_state(os.path.join(args.output_dir, LAST_STATE_NAME.format(model_name)))
+        if args.huggingface_repo_id is not None:
+            huggingface_util.upload(args, out_dir, "/" + model_name, force_sync_upload=True)


 # scheduler:
@@ -2935,3 +3183,24 @@ class ImageLoadingDataset(torch.utils.data.Dataset):


 # endregion
+
+
+# collate_fn用 epoch,stepはmultiprocessing.Value
+class collater_class:
+    def __init__(self, epoch, step, dataset):
+        self.current_epoch = epoch
+        self.current_step = step
+        self.dataset = dataset  # not used if worker_info is not None, in case of multiprocessing
+
+    def __call__(self, examples):
+        worker_info = torch.utils.data.get_worker_info()
+        # worker_info is None in the main process
+        if worker_info is not None:
+            dataset = worker_info.dataset
+        else:
+            dataset = self.dataset
+
+        # set epoch and step
+        dataset.set_current_epoch(self.current_epoch.value)
+        dataset.set_current_step(self.current_step.value)
+        return examples[0]
--- a/library/utils.py
+++ b/library/utils.py
@@ -0,0 +1,6 @@
+import threading
+from typing import *
+
+
+def fire_in_thread(f, *args, **kwargs):
+    threading.Thread(target=f, args=args, kwargs=kwargs).start()
--- a/networks/check_lora_weights.py
+++ b/networks/check_lora_weights.py
@@ -24,9 +24,16 @@ def main(file):
    print(f"{key},{str(tuple(value.size())).replace(', ', '-')},{torch.mean(torch.abs(value))},{torch.min(torch.abs(value))}")


-if __name__ == '__main__':
+def setup_parser() -> argparse.ArgumentParser:
  parser = argparse.ArgumentParser()
  parser.add_argument("file", type=str, help="model file to check / 重みを確認するモデルファイル")
+
+  return parser
+
+
+if __name__ == '__main__':
+  parser = setup_parser()
+
  args = parser.parse_args()

  main(args.file)
--- a/networks/extract_lora_from_models.py
+++ b/networks/extract_lora_from_models.py
@@ -145,8 +145,8 @@ def svd(args):
    lora_sd[lora_name + '.alpha'] = torch.tensor(down_weight.size()[0])

  # load state dict to LoRA and save it
-  lora_network_save = lora.create_network_from_weights(1.0, None, None, text_encoder_o, unet_o, weights_sd=lora_sd)
-  lora_network_save.apply_to(text_encoder_o, unet_o)        # create internal module references for state_dict
+  lora_network_save, lora_sd = lora.create_network_from_weights(1.0, None, None, text_encoder_o, unet_o, weights_sd=lora_sd)
+  lora_network_save.apply_to(text_encoder_o, unet_o)  # create internal module references for state_dict  

  info = lora_network_save.load_state_dict(lora_sd)
  print(f"Loading extracted LoRA weights: {info}")
@@ -162,7 +162,7 @@ def svd(args):
  print(f"LoRA weights are saved to: {args.save_to}")


-if __name__ == '__main__':
+def setup_parser() -> argparse.ArgumentParser:
  parser = argparse.ArgumentParser()
  parser.add_argument("--v2", action='store_true',
                      help='load Stable Diffusion v2.x model / Stable Diffusion 2.xのモデルを読み込む')
@@ -179,5 +179,11 @@ if __name__ == '__main__':
                      help="dimension (rank) of LoRA for Conv2d-3x3 (default None, disabled) / LoRAのConv2d-3x3の次元数（rank）（デフォルトNone、適用なし）")
  parser.add_argument("--device", type=str, default=None, help="device to use, cuda for GPU / 計算を行うデバイス、cuda でGPUを使う")

+  return parser
+
+
+if __name__ == '__main__':
+  parser = setup_parser()
+
  args = parser.parse_args()
  svd(args)
--- a/networks/lora.py
+++ b/networks/lora.py
--- a/networks/lora_interrogator.py
+++ b/networks/lora_interrogator.py
@@ -105,7 +105,7 @@ def interrogate(args):
    print(f"[{i:3d}]: {token:5d} {string:<20s}: {diff:.5f}")


-if __name__ == '__main__':
+def setup_parser() -> argparse.ArgumentParser:
  parser = argparse.ArgumentParser()
  parser.add_argument("--v2", action='store_true',
                      help='load Stable Diffusion v2.x model / Stable Diffusion 2.xのモデルを読み込む')
@@ -118,5 +118,11 @@ if __name__ == '__main__':
  parser.add_argument("--clip_skip", type=int, default=None,
                      help="use output of nth layer from back of text encoder (n>=1) / text encoderの後ろからn番目の層の出力を用いる（nは1以上）")

+  return parser
+
+
+if __name__ == '__main__':
+  parser = setup_parser()
+
  args = parser.parse_args()
  interrogate(args)
--- a/networks/merge_lora.py
+++ b/networks/merge_lora.py
@@ -1,4 +1,3 @@
-
 import math
 import argparse
 import os
@@ -9,210 +8,236 @@ import lora


 def load_state_dict(file_name, dtype):
-  if os.path.splitext(file_name)[1] == '.safetensors':
-    sd = load_file(file_name)
-  else:
-    sd = torch.load(file_name, map_location='cpu')
-  for key in list(sd.keys()):
-    if type(sd[key]) == torch.Tensor:
-      sd[key] = sd[key].to(dtype)
-  return sd
+    if os.path.splitext(file_name)[1] == ".safetensors":
+        sd = load_file(file_name)
+    else:
+        sd = torch.load(file_name, map_location="cpu")
+    for key in list(sd.keys()):
+        if type(sd[key]) == torch.Tensor:
+            sd[key] = sd[key].to(dtype)
+    return sd


 def save_to_file(file_name, model, state_dict, dtype):
-  if dtype is not None:
-    for key in list(state_dict.keys()):
-      if type(state_dict[key]) == torch.Tensor:
-        state_dict[key] = state_dict[key].to(dtype)
+    if dtype is not None:
+        for key in list(state_dict.keys()):
+            if type(state_dict[key]) == torch.Tensor:
+                state_dict[key] = state_dict[key].to(dtype)

-  if os.path.splitext(file_name)[1] == '.safetensors':
-    save_file(model, file_name)
-  else:
-    torch.save(model, file_name)
+    if os.path.splitext(file_name)[1] == ".safetensors":
+        save_file(model, file_name)
+    else:
+        torch.save(model, file_name)


 def merge_to_sd_model(text_encoder, unet, models, ratios, merge_dtype):
-  text_encoder.to(merge_dtype)
-  unet.to(merge_dtype)
+    text_encoder.to(merge_dtype)
+    unet.to(merge_dtype)

-  # create module map
-  name_to_module = {}
-  for i, root_module in enumerate([text_encoder, unet]):
-    if i == 0:
-      prefix = lora.LoRANetwork.LORA_PREFIX_TEXT_ENCODER
-      target_replace_modules = lora.LoRANetwork.TEXT_ENCODER_TARGET_REPLACE_MODULE
-    else:
-      prefix = lora.LoRANetwork.LORA_PREFIX_UNET
-      target_replace_modules = lora.LoRANetwork.UNET_TARGET_REPLACE_MODULE
-
-    for name, module in root_module.named_modules():
-      if module.__class__.__name__ in target_replace_modules:
-        for child_name, child_module in module.named_modules():
-          if child_module.__class__.__name__ == "Linear" or child_module.__class__.__name__ == "Conv2d":
-            lora_name = prefix + '.' + name + '.' + child_name
-            lora_name = lora_name.replace('.', '_')
-            name_to_module[lora_name] = child_module
-
-  for model, ratio in zip(models, ratios):
-    print(f"loading: {model}")
-    lora_sd = load_state_dict(model, merge_dtype)
-
-    print(f"merging...")
-    for key in lora_sd.keys():
-      if "lora_down" in key:
-        up_key = key.replace("lora_down", "lora_up")
-        alpha_key = key[:key.index("lora_down")] + 'alpha'
-
-        # find original module for this lora
-        module_name = '.'.join(key.split('.')[:-2])               # remove trailing ".lora_down.weight"
-        if module_name not in name_to_module:
-          print(f"no module found for LoRA weight: {key}")
-          continue
-        module = name_to_module[module_name]
-        # print(f"apply {key} to {module}")
-
-        down_weight = lora_sd[key]
-        up_weight = lora_sd[up_key]
-
-        dim = down_weight.size()[0]
-        alpha = lora_sd.get(alpha_key, dim)
-        scale = alpha / dim
-
-        # W <- W + U * D
-        weight = module.weight
-        # print(module_name, down_weight.size(), up_weight.size())
-        if len(weight.size()) == 2:
-          # linear
-          weight = weight + ratio * (up_weight @ down_weight) * scale
-        elif down_weight.size()[2:4] == (1, 1):
-          # conv2d 1x1
-          weight = weight + ratio * (up_weight.squeeze(3).squeeze(2) @ down_weight.squeeze(3).squeeze(2)
-                                     ).unsqueeze(2).unsqueeze(3) * scale
+    # create module map
+    name_to_module = {}
+    for i, root_module in enumerate([text_encoder, unet]):
+        if i == 0:
+            prefix = lora.LoRANetwork.LORA_PREFIX_TEXT_ENCODER
+            target_replace_modules = lora.LoRANetwork.TEXT_ENCODER_TARGET_REPLACE_MODULE
        else:
-          # conv2d 3x3
-          conved = torch.nn.functional.conv2d(down_weight.permute(1, 0, 2, 3), up_weight).permute(1, 0, 2, 3)
-          # print(conved.size(), weight.size(), module.stride, module.padding)
-          weight = weight + ratio * conved * scale
+            prefix = lora.LoRANetwork.LORA_PREFIX_UNET
+            target_replace_modules = (
+                lora.LoRANetwork.UNET_TARGET_REPLACE_MODULE + lora.LoRANetwork.UNET_TARGET_REPLACE_MODULE_CONV2D_3X3
+            )

-        module.weight = torch.nn.Parameter(weight)
+        for name, module in root_module.named_modules():
+            if module.__class__.__name__ in target_replace_modules:
+                for child_name, child_module in module.named_modules():
+                    if child_module.__class__.__name__ == "Linear" or child_module.__class__.__name__ == "Conv2d":
+                        lora_name = prefix + "." + name + "." + child_name
+                        lora_name = lora_name.replace(".", "_")
+                        name_to_module[lora_name] = child_module
+
+    for model, ratio in zip(models, ratios):
+        print(f"loading: {model}")
+        lora_sd = load_state_dict(model, merge_dtype)
+
+        print(f"merging...")
+        for key in lora_sd.keys():
+            if "lora_down" in key:
+                up_key = key.replace("lora_down", "lora_up")
+                alpha_key = key[: key.index("lora_down")] + "alpha"
+
+                # find original module for this lora
+                module_name = ".".join(key.split(".")[:-2])  # remove trailing ".lora_down.weight"
+                if module_name not in name_to_module:
+                    print(f"no module found for LoRA weight: {key}")
+                    continue
+                module = name_to_module[module_name]
+                # print(f"apply {key} to {module}")
+
+                down_weight = lora_sd[key]
+                up_weight = lora_sd[up_key]
+
+                dim = down_weight.size()[0]
+                alpha = lora_sd.get(alpha_key, dim)
+                scale = alpha / dim
+
+                # W <- W + U * D
+                weight = module.weight
+                # print(module_name, down_weight.size(), up_weight.size())
+                if len(weight.size()) == 2:
+                    # linear
+                    weight = weight + ratio * (up_weight @ down_weight) * scale
+                elif down_weight.size()[2:4] == (1, 1):
+                    # conv2d 1x1
+                    weight = (
+                        weight
+                        + ratio
+                        * (up_weight.squeeze(3).squeeze(2) @ down_weight.squeeze(3).squeeze(2)).unsqueeze(2).unsqueeze(3)
+                        * scale
+                    )
+                else:
+                    # conv2d 3x3
+                    conved = torch.nn.functional.conv2d(down_weight.permute(1, 0, 2, 3), up_weight).permute(1, 0, 2, 3)
+                    # print(conved.size(), weight.size(), module.stride, module.padding)
+                    weight = weight + ratio * conved * scale
+
+                module.weight = torch.nn.Parameter(weight)


 def merge_lora_models(models, ratios, merge_dtype):
-  base_alphas = {}                          # alpha for merged model
-  base_dims = {}
+    base_alphas = {}  # alpha for merged model
+    base_dims = {}

-  merged_sd = {}
-  for model, ratio in zip(models, ratios):
-    print(f"loading: {model}")
-    lora_sd = load_state_dict(model, merge_dtype)
+    merged_sd = {}
+    for model, ratio in zip(models, ratios):
+        print(f"loading: {model}")
+        lora_sd = load_state_dict(model, merge_dtype)

-    # get alpha and dim
-    alphas = {}                             # alpha for current model
-    dims = {}                               # dims for current model
-    for key in lora_sd.keys():
-      if 'alpha' in key:
-        lora_module_name = key[:key.rfind(".alpha")]
-        alpha = float(lora_sd[key].detach().numpy())
-        alphas[lora_module_name] = alpha
-        if lora_module_name not in base_alphas:
-          base_alphas[lora_module_name] = alpha
-      elif "lora_down" in key:
-        lora_module_name = key[:key.rfind(".lora_down")]
-        dim = lora_sd[key].size()[0]
-        dims[lora_module_name] = dim
-        if lora_module_name not in base_dims:
-          base_dims[lora_module_name] = dim
+        # get alpha and dim
+        alphas = {}  # alpha for current model
+        dims = {}  # dims for current model
+        for key in lora_sd.keys():
+            if "alpha" in key:
+                lora_module_name = key[: key.rfind(".alpha")]
+                alpha = float(lora_sd[key].detach().numpy())
+                alphas[lora_module_name] = alpha
+                if lora_module_name not in base_alphas:
+                    base_alphas[lora_module_name] = alpha
+            elif "lora_down" in key:
+                lora_module_name = key[: key.rfind(".lora_down")]
+                dim = lora_sd[key].size()[0]
+                dims[lora_module_name] = dim
+                if lora_module_name not in base_dims:
+                    base_dims[lora_module_name] = dim

-    for lora_module_name in dims.keys():
-      if lora_module_name not in alphas:
-        alpha = dims[lora_module_name]
-        alphas[lora_module_name] = alpha
-        if lora_module_name not in base_alphas:
-          base_alphas[lora_module_name] = alpha
+        for lora_module_name in dims.keys():
+            if lora_module_name not in alphas:
+                alpha = dims[lora_module_name]
+                alphas[lora_module_name] = alpha
+                if lora_module_name not in base_alphas:
+                    base_alphas[lora_module_name] = alpha

-    print(f"dim: {list(set(dims.values()))}, alpha: {list(set(alphas.values()))}")
+        print(f"dim: {list(set(dims.values()))}, alpha: {list(set(alphas.values()))}")

-    # merge
-    print(f"merging...")
-    for key in lora_sd.keys():
-      if 'alpha' in key:
-        continue
+        # merge
+        print(f"merging...")
+        for key in lora_sd.keys():
+            if "alpha" in key:
+                continue

-      lora_module_name = key[:key.rfind(".lora_")]
+            lora_module_name = key[: key.rfind(".lora_")]

-      base_alpha = base_alphas[lora_module_name]
-      alpha = alphas[lora_module_name]
+            base_alpha = base_alphas[lora_module_name]
+            alpha = alphas[lora_module_name]

-      scale = math.sqrt(alpha / base_alpha) * ratio
+            scale = math.sqrt(alpha / base_alpha) * ratio

-      if key in merged_sd:
-        assert merged_sd[key].size() == lora_sd[key].size(
-        ), f"weights shape mismatch merging v1 and v2, different dims? / 重みのサイズが合いません。v1とv2、または次元数の異なるモデルはマージできません"
-        merged_sd[key] = merged_sd[key] + lora_sd[key] * scale
-      else:
-        merged_sd[key] = lora_sd[key] * scale
+            if key in merged_sd:
+                assert (
+                    merged_sd[key].size() == lora_sd[key].size()
+                ), f"weights shape mismatch merging v1 and v2, different dims? / 重みのサイズが合いません。v1とv2、または次元数の異なるモデルはマージできません"
+                merged_sd[key] = merged_sd[key] + lora_sd[key] * scale
+            else:
+                merged_sd[key] = lora_sd[key] * scale

-  # set alpha to sd
-  for lora_module_name, alpha in base_alphas.items():
-    key = lora_module_name + ".alpha"
-    merged_sd[key] = torch.tensor(alpha)
+    # set alpha to sd
+    for lora_module_name, alpha in base_alphas.items():
+        key = lora_module_name + ".alpha"
+        merged_sd[key] = torch.tensor(alpha)

-  print("merged model")
-  print(f"dim: {list(set(base_dims.values()))}, alpha: {list(set(base_alphas.values()))}")
+    print("merged model")
+    print(f"dim: {list(set(base_dims.values()))}, alpha: {list(set(base_alphas.values()))}")

-  return merged_sd
+    return merged_sd


 def merge(args):
-  assert len(args.models) == len(args.ratios), f"number of models must be equal to number of ratios / モデルの数と重みの数は合わせてください"
+    assert len(args.models) == len(args.ratios), f"number of models must be equal to number of ratios / モデルの数と重みの数は合わせてください"

-  def str_to_dtype(p):
-    if p == 'float':
-      return torch.float
-    if p == 'fp16':
-      return torch.float16
-    if p == 'bf16':
-      return torch.bfloat16
-    return None
+    def str_to_dtype(p):
+        if p == "float":
+            return torch.float
+        if p == "fp16":
+            return torch.float16
+        if p == "bf16":
+            return torch.bfloat16
+        return None

-  merge_dtype = str_to_dtype(args.precision)
-  save_dtype = str_to_dtype(args.save_precision)
-  if save_dtype is None:
-    save_dtype = merge_dtype
+    merge_dtype = str_to_dtype(args.precision)
+    save_dtype = str_to_dtype(args.save_precision)
+    if save_dtype is None:
+        save_dtype = merge_dtype

-  if args.sd_model is not None:
-    print(f"loading SD model: {args.sd_model}")
+    if args.sd_model is not None:
+        print(f"loading SD model: {args.sd_model}")

-    text_encoder, vae, unet = model_util.load_models_from_stable_diffusion_checkpoint(args.v2, args.sd_model)
+        text_encoder, vae, unet = model_util.load_models_from_stable_diffusion_checkpoint(args.v2, args.sd_model)

-    merge_to_sd_model(text_encoder, unet, args.models, args.ratios, merge_dtype)
+        merge_to_sd_model(text_encoder, unet, args.models, args.ratios, merge_dtype)

-    print(f"saving SD model to: {args.save_to}")
-    model_util.save_stable_diffusion_checkpoint(args.v2, args.save_to, text_encoder, unet,
-                                                args.sd_model, 0, 0, save_dtype, vae)
-  else:
-    state_dict = merge_lora_models(args.models, args.ratios, merge_dtype)
+        print(f"saving SD model to: {args.save_to}")
+        model_util.save_stable_diffusion_checkpoint(args.v2, args.save_to, text_encoder, unet, args.sd_model, 0, 0, save_dtype, vae)
+    else:
+        state_dict = merge_lora_models(args.models, args.ratios, merge_dtype)

-    print(f"saving model to: {args.save_to}")
-    save_to_file(args.save_to, state_dict, state_dict, save_dtype)
+        print(f"saving model to: {args.save_to}")
+        save_to_file(args.save_to, state_dict, state_dict, save_dtype)


-if __name__ == '__main__':
-  parser = argparse.ArgumentParser()
-  parser.add_argument("--v2", action='store_true',
-                      help='load Stable Diffusion v2.x model / Stable Diffusion 2.xのモデルを読み込む')
-  parser.add_argument("--save_precision", type=str, default=None,
-                      choices=[None, "float", "fp16", "bf16"], help="precision in saving, same to merging if omitted / 保存時に精度を変更して保存する、省略時はマージ時の精度と同じ")
-  parser.add_argument("--precision", type=str, default="float",
-                      choices=["float", "fp16", "bf16"], help="precision in merging (float is recommended) / マージの計算時の精度（floatを推奨）")
-  parser.add_argument("--sd_model", type=str, default=None,
-                      help="Stable Diffusion model to load: ckpt or safetensors file, merge LoRA models if omitted / 読み込むモデル、ckptまたはsafetensors。省略時はLoRAモデル同士をマージする")
-  parser.add_argument("--save_to", type=str, default=None,
-                      help="destination file name: ckpt or safetensors file / 保存先のファイル名、ckptまたはsafetensors")
-  parser.add_argument("--models", type=str, nargs='*',
-                      help="LoRA models to merge: ckpt or safetensors file / マージするLoRAモデル、ckptまたはsafetensors")
-  parser.add_argument("--ratios", type=float, nargs='*',
-                      help="ratios for each model / それぞれのLoRAモデルの比率")
+def setup_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--v2", action="store_true", help="load Stable Diffusion v2.x model / Stable Diffusion 2.xのモデルを読み込む")
+    parser.add_argument(
+        "--save_precision",
+        type=str,
+        default=None,
+        choices=[None, "float", "fp16", "bf16"],
+        help="precision in saving, same to merging if omitted / 保存時に精度を変更して保存する、省略時はマージ時の精度と同じ",
+    )
+    parser.add_argument(
+        "--precision",
+        type=str,
+        default="float",
+        choices=["float", "fp16", "bf16"],
+        help="precision in merging (float is recommended) / マージの計算時の精度（floatを推奨）",
+    )
+    parser.add_argument(
+        "--sd_model",
+        type=str,
+        default=None,
+        help="Stable Diffusion model to load: ckpt or safetensors file, merge LoRA models if omitted / 読み込むモデル、ckptまたはsafetensors。省略時はLoRAモデル同士をマージする",
+    )
+    parser.add_argument(
+        "--save_to", type=str, default=None, help="destination file name: ckpt or safetensors file / 保存先のファイル名、ckptまたはsafetensors"
+    )
+    parser.add_argument(
+        "--models", type=str, nargs="*", help="LoRA models to merge: ckpt or safetensors file / マージするLoRAモデル、ckptまたはsafetensors"
+    )
+    parser.add_argument("--ratios", type=float, nargs="*", help="ratios for each model / それぞれのLoRAモデルの比率")

-  args = parser.parse_args()
-  merge(args)
+    return parser
+
+
+if __name__ == "__main__":
+    parser = setup_parser()
+
+    args = parser.parse_args()
+    merge(args)
--- a/networks/merge_lora_old.py
+++ b/networks/merge_lora_old.py
@@ -158,7 +158,7 @@ def merge(args):
    save_to_file(args.save_to, state_dict, state_dict, save_dtype)


-if __name__ == '__main__':
+def setup_parser() -> argparse.ArgumentParser:
  parser = argparse.ArgumentParser()
  parser.add_argument("--v2", action='store_true',
                      help='load Stable Diffusion v2.x model / Stable Diffusion 2.xのモデルを読み込む')
@@ -175,5 +175,11 @@ if __name__ == '__main__':
  parser.add_argument("--ratios", type=float, nargs='*',
                      help="ratios for each model / それぞれのLoRAモデルの比率")

+  return parser
+
+
+if __name__ == '__main__':
+  parser = setup_parser()
+
  args = parser.parse_args()
  merge(args)
--- a/networks/resize_lora.py
+++ b/networks/resize_lora.py
@@ -11,6 +11,8 @@ import numpy as np

 MIN_SV = 1e-6

+# Model save and load functions
+
 def load_state_dict(file_name, dtype):
  if model_util.is_safetensors(file_name):
    sd = load_file(file_name)
@@ -39,12 +41,13 @@ def save_to_file(file_name, model, state_dict, dtype, metadata):
    torch.save(model, file_name)


+# Indexing functions
+
 def index_sv_cumulative(S, target):
  original_sum = float(torch.sum(S))
  cumulative_sums = torch.cumsum(S, dim=0)/original_sum
  index = int(torch.searchsorted(cumulative_sums, target)) + 1
-  if index >= len(S):
-    index = len(S) - 1
+  index = max(1, min(index, len(S)-1))

  return index

@@ -54,8 +57,16 @@ def index_sv_fro(S, target):
  s_fro_sq = float(torch.sum(S_squared))
  sum_S_squared = torch.cumsum(S_squared, dim=0)/s_fro_sq
  index = int(torch.searchsorted(sum_S_squared, target**2)) + 1
-  if index >= len(S):
-    index = len(S) - 1
+  index = max(1, min(index, len(S)-1))
+
+  return index
+
+
+def index_sv_ratio(S, target):
+  max_sv = S[0]
+  min_sv = max_sv/target
+  index = int(torch.sum(S > min_sv).item())
+  index = max(1, min(index, len(S)-1))

  return index

@@ -125,26 +136,24 @@ def merge_linear(lora_down, lora_up, device):
    return weight
  

+# Calculate new rank
+
 def rank_resize(S, rank, dynamic_method, dynamic_param, scale=1):
    param_dict = {}

    if dynamic_method=="sv_ratio":
        # Calculate new dim and alpha based off ratio
-        max_sv = S[0]
-        min_sv = max_sv/dynamic_param
-        new_rank = max(torch.sum(S > min_sv).item(),1)
+        new_rank = index_sv_ratio(S, dynamic_param) + 1
        new_alpha = float(scale*new_rank)

    elif dynamic_method=="sv_cumulative":
        # Calculate new dim and alpha based off cumulative sum
-        new_rank = index_sv_cumulative(S, dynamic_param)
-        new_rank = max(new_rank, 1)
+        new_rank = index_sv_cumulative(S, dynamic_param) + 1
        new_alpha = float(scale*new_rank)

    elif dynamic_method=="sv_fro":
        # Calculate new dim and alpha based off sqrt sum of squares
-        new_rank = index_sv_fro(S, dynamic_param)
-        new_rank = min(max(new_rank, 1), len(S)-1)
+        new_rank = index_sv_fro(S, dynamic_param) + 1
        new_alpha = float(scale*new_rank)
    else:
        new_rank = rank
@@ -172,7 +181,7 @@ def rank_resize(S, rank, dynamic_method, dynamic_param, scale=1):
    param_dict["new_alpha"] = new_alpha
    param_dict["sum_retained"] = (s_rank)/s_sum
    param_dict["fro_retained"] = fro_percent
-    param_dict["max_ratio"] = S[0]/S[new_rank]
+    param_dict["max_ratio"] = S[0]/S[new_rank - 1]

    return param_dict

@@ -208,18 +217,28 @@ def resize_lora_model(lora_sd, new_rank, save_dtype, device, dynamic_method, dyn

  with torch.no_grad():
    for key, value in tqdm(lora_sd.items()):
+      weight_name = None
      if 'lora_down' in key:
        block_down_name = key.split(".")[0]
+        weight_name = key.split(".")[-1]
        lora_down_weight = value
-      if 'lora_up' in key:
-        block_up_name = key.split(".")[0]
-        lora_up_weight = value
+      else:
+        continue
+
+      # find corresponding lora_up and alpha
+      block_up_name = block_down_name
+      lora_up_weight = lora_sd.get(block_up_name + '.lora_up.' + weight_name, None)
+      lora_alpha = lora_sd.get(block_down_name + '.alpha', None)

      weights_loaded = (lora_down_weight is not None and lora_up_weight is not None)

-      if (block_down_name == block_up_name) and weights_loaded:
+      if weights_loaded:

        conv2d = (len(lora_down_weight.size()) == 4)
+        if lora_alpha is None:
+          scale = 1.0
+        else:
+          scale = lora_alpha/lora_down_weight.size()[0]

        if conv2d:
          full_weight_matrix = merge_conv(lora_down_weight, lora_up_weight, device)
@@ -311,7 +330,7 @@ def resize(args):
  save_to_file(args.save_to, state_dict, state_dict, save_dtype, metadata)


-if __name__ == '__main__':
+def setup_parser() -> argparse.ArgumentParser:
  parser = argparse.ArgumentParser()

  parser.add_argument("--save_precision", type=str, default=None,
@@ -329,7 +348,12 @@ if __name__ == '__main__':
                      help="Specify dynamic resizing method, --new_rank is used as a hard limit for max rank")
  parser.add_argument("--dynamic_param", type=float, default=None,
                      help="Specify target for dynamic reduction")
-                                           
+       
+  return parser
+
+
+if __name__ == '__main__':
+  parser = setup_parser()

  args = parser.parse_args()
  resize(args)
--- a/networks/svd_merge_lora.py
+++ b/networks/svd_merge_lora.py
@@ -164,7 +164,7 @@ def merge(args):
  save_to_file(args.save_to, state_dict, save_dtype)


-if __name__ == '__main__':
+def setup_parser() -> argparse.ArgumentParser:
  parser = argparse.ArgumentParser()
  parser.add_argument("--save_precision", type=str, default=None,
                      choices=[None, "float", "fp16", "bf16"], help="precision in saving, same to merging if omitted / 保存時に精度を変更して保存する、省略時はマージ時の精度と同じ")
@@ -182,5 +182,11 @@ if __name__ == '__main__':
                      help="Specify rank of output LoRA for Conv2d 3x3, None for same as new_rank / 出力するConv2D 3x3 LoRAのrank (dim)、Noneでnew_rankと同じ")
  parser.add_argument("--device", type=str, default=None, help="device to use, cuda for GPU / 計算を行うデバイス、cuda でGPUを使う")

+  return parser
+
+
+if __name__ == '__main__':
+  parser = setup_parser()
+
  args = parser.parse_args()
  merge(args)
--- a/requirements.txt
+++ b/requirements.txt
@@ -21,6 +21,6 @@ fairscale==0.4.13
 # for WD14 captioning
 # tensorflow<2.11
 tensorflow==2.10.1
-huggingface-hub==0.12.0
+huggingface-hub==0.13.3
 # for kohya_ss library
 .
--- a/tools/canny.py
+++ b/tools/canny.py
@@ -13,12 +13,18 @@ def canny(args):
  print("done!")


-if __name__ == '__main__':
+def setup_parser() -> argparse.ArgumentParser:
  parser = argparse.ArgumentParser()
  parser.add_argument("--input", type=str, default=None, help="input path")
  parser.add_argument("--output", type=str, default=None, help="output path")
  parser.add_argument("--thres1", type=int, default=32, help="thres1")
  parser.add_argument("--thres2", type=int, default=224, help="thres2")

+  return parser
+
+
+if __name__ == '__main__':
+  parser = setup_parser()
+
  args = parser.parse_args()
  canny(args)
--- a/tools/convert_diffusers20_original_sd.py
+++ b/tools/convert_diffusers20_original_sd.py
@@ -61,7 +61,7 @@ def convert(args):
    print(f"model saved.")


-if __name__ == '__main__':
+def setup_parser() -> argparse.ArgumentParser:
  parser = argparse.ArgumentParser()
  parser.add_argument("--v1", action='store_true',
                      help='load v1.x model (v1 or v2 is required to load checkpoint) / 1.xのモデルを読み込む')
@@ -84,6 +84,11 @@ if __name__ == '__main__':
                      help="model to load: checkpoint file or Diffusers model's directory / 読み込むモデル、checkpointかDiffusers形式モデルのディレクトリ")
  parser.add_argument("model_to_save", type=str, default=None,
                      help="model to save: checkpoint (with extension) or Diffusers model's directory (without extension) / 変換後のモデル、拡張子がある場合はcheckpoint、ない場合はDiffusesモデルとして保存")
+  return parser
+
+
+if __name__ == '__main__':
+  parser = setup_parser()

  args = parser.parse_args()
  convert(args)
--- a/tools/detect_face_rotate.py
+++ b/tools/detect_face_rotate.py
@@ -214,7 +214,7 @@ def process(args):
        buf.tofile(f)


-if __name__ == '__main__':
+def setup_parser() -> argparse.ArgumentParser:
  parser = argparse.ArgumentParser()
  parser.add_argument("--src_dir", type=str, help="directory to load images / 画像を読み込むディレクトリ")
  parser.add_argument("--dst_dir", type=str, help="directory to save images / 画像を保存するディレクトリ")
@@ -234,6 +234,13 @@ if __name__ == '__main__':
  parser.add_argument("--multiple_faces", action="store_true",
                      help="output each faces / 複数の顔が見つかった場合、それぞれを切り出す")
  parser.add_argument("--debug", action="store_true", help="render rect for face / 処理後画像の顔位置に矩形を描画します")
+
+  return parser
+
+
+if __name__ == '__main__':
+  parser = setup_parser()
+
  args = parser.parse_args()

  process(args)
--- a/tools/resize_images_to_resolution.py
+++ b/tools/resize_images_to_resolution.py
@@ -98,7 +98,7 @@ def resize_images(src_img_folder, dst_img_folder, max_resolution="512x512", divi
          shutil.copy(os.path.join(src_img_folder, asoc_file), os.path.join(dst_img_folder, new_asoc_file))


-def main():
+def setup_parser() -> argparse.ArgumentParser:
  parser = argparse.ArgumentParser(
      description='Resize images in a folder to a specified max resolution(s) / 指定されたフォルダ内の画像を指定した最大画像サイズ（面積）以下にアスペクト比を維持したままリサイズします')
  parser.add_argument('src_img_folder', type=str, help='Source folder containing the images / 元画像のフォルダ')
@@ -113,6 +113,12 @@ def main():
  parser.add_argument('--copy_associated_files', action='store_true',
                      help='Copy files with same base name to images (captions etc) / 画像と同じファイル名（拡張子を除く）のファイルもコピーする')

+  return parser
+
+
+def main():
+  parser = setup_parser()
+
  args = parser.parse_args()
  resize_images(args.src_img_folder, args.dst_img_folder, args.max_resolution,
                args.divisible_by, args.interpolation, args.save_as_png, args.copy_associated_files)
--- a/train_README-ja.md
+++ b/train_README-ja.md
@@ -801,7 +801,7 @@ model_dirオプションでモデルの保存先フォルダを指定できま
 キャプションをメタデータに入れるには、作業フォルダ内で以下を実行してください（キャプションを学習に使わない場合は実行不要です）（実際は1行で記述します、以下同様）。`--full_path` オプションを指定してメタデータに画像ファイルの場所をフルパスで格納します。このオプションを省略すると相対パスで記録されますが、フォルダ指定が `.toml` ファイル内で別途必要になります。

 ```
-python merge_captions_to_metadata.py --full_apth <教師データフォルダ>
+python merge_captions_to_metadata.py --full_path <教師データフォルダ>
 　  --in_json <読み込むメタデータファイル名> <メタデータファイル名>
 ```

--- a/train_db.py
+++ b/train_db.py
@@ -8,6 +8,7 @@ import itertools
 import math
 import os
 import toml
+from multiprocessing import Value

 from tqdm import tqdm
 import torch
@@ -21,11 +22,8 @@ from library.config_util import (
    ConfigSanitizer,
    BlueprintGenerator,
 )
-
-
-def collate_fn(examples):
-    return examples[0]
-
+import library.custom_train_functions as custom_train_functions
+from library.custom_train_functions import apply_snr_weight, get_weighted_text_embeddings

 def train(args):
    train_util.verify_training_args(args)
@@ -59,6 +57,11 @@ def train(args):
    blueprint = blueprint_generator.generate(user_config, args, tokenizer=tokenizer)
    train_dataset_group = config_util.generate_dataset_group_by_blueprint(blueprint.dataset_group)

+    current_epoch = Value("i", 0)
+    current_step = Value("i", 0)
+    ds_for_collater = train_dataset_group if args.max_data_loader_n_workers == 0 else None
+    collater = train_util.collater_class(current_epoch, current_step, ds_for_collater)
+
    if args.no_token_padding:
        train_dataset_group.disable_token_padding()

@@ -114,7 +117,7 @@ def train(args):
        vae.requires_grad_(False)
        vae.eval()
        with torch.no_grad():
-            train_dataset_group.cache_latents(vae)
+            train_dataset_group.cache_latents(vae, args.vae_batch_size)
        vae.to("cpu")
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
@@ -152,16 +155,21 @@ def train(args):
        train_dataset_group,
        batch_size=1,
        shuffle=True,
-        collate_fn=collate_fn,
+        collate_fn=collater,
        num_workers=n_workers,
        persistent_workers=args.persistent_data_loader_workers,
    )

    # 学習ステップ数を計算する
    if args.max_train_epochs is not None:
-        args.max_train_steps = args.max_train_epochs * len(train_dataloader)
+        args.max_train_steps = args.max_train_epochs * math.ceil(
+            len(train_dataloader) / accelerator.num_processes / args.gradient_accumulation_steps
+        )
        print(f"override steps. steps for {args.max_train_epochs} epochs is / 指定エポックまでのステップ数: {args.max_train_steps}")

+    # データセット側にも学習ステップを送信
+    train_dataset_group.set_max_train_steps(args.max_train_steps)
+
    if args.stop_text_encoder_training is None:
        args.stop_text_encoder_training = args.max_train_steps + 1  # do not stop until end

@@ -193,9 +201,7 @@ def train(args):
        train_util.patch_accelerator_for_fp16_training(accelerator)

    # resumeする
-    if args.resume is not None:
-        print(f"resume training from state: {args.resume}")
-        accelerator.load_state(args.resume)
+    train_util.resume_from_local_or_hf_if_specified(accelerator, args)

    # epoch数を計算する
    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
@@ -229,7 +235,7 @@ def train(args):
    loss_total = 0.0
    for epoch in range(num_train_epochs):
        print(f"epoch {epoch+1}/{num_train_epochs}")
-        train_dataset_group.set_current_epoch(epoch + 1)
+        current_epoch.value = epoch + 1

        # 指定したステップ数までText Encoderを学習する：epoch最初の状態
        unet.train()
@@ -238,6 +244,7 @@ def train(args):
            text_encoder.train()

        for step, batch in enumerate(train_dataloader):
+            current_step.value = global_step
            # 指定したステップ数でText Encoderの学習を止める
            if global_step == args.stop_text_encoder_training:
                print(f"stop text encoder training at step {global_step}")
@@ -263,10 +270,19 @@ def train(args):

                # Get the text embedding for conditioning
                with torch.set_grad_enabled(global_step < args.stop_text_encoder_training):
-                    input_ids = batch["input_ids"].to(accelerator.device)
-                    encoder_hidden_states = train_util.get_hidden_states(
-                        args, input_ids, tokenizer, text_encoder, None if not args.full_fp16 else weight_dtype
-                    )
+                    if args.weighted_captions:
+                      encoder_hidden_states = get_weighted_text_embeddings(tokenizer,
+                        text_encoder,
+                        batch["captions"],
+                        accelerator.device,
+                        args.max_token_length // 75 if args.max_token_length else 1,
+                        clip_skip=args.clip_skip,
+                        )
+                    else:
+                      input_ids = batch["input_ids"].to(accelerator.device)
+                      encoder_hidden_states = train_util.get_hidden_states(
+                          args, input_ids, tokenizer, text_encoder, None if not args.full_fp16 else weight_dtype
+                      )

                # Sample a random timestep for each image
                timesteps = torch.randint(0, noise_scheduler.config.num_train_timesteps, (b_size,), device=latents.device)
@@ -291,6 +307,9 @@ def train(args):
                loss_weights = batch["loss_weights"]  # 各sampleごとのweight
                loss = loss * loss_weights

+                if args.min_snr_gamma:
+                    loss = apply_snr_weight(loss, timesteps, noise_scheduler, args.min_snr_gamma)
+
                loss = loss.mean()  # 平均なのでbatch_sizeで割る必要なし

                accelerator.backward(loss)
@@ -381,7 +400,7 @@ def train(args):
        print("model saved.")


-if __name__ == "__main__":
+def setup_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser()

    train_util.add_sd_models_arguments(parser)
@@ -390,6 +409,7 @@ if __name__ == "__main__":
    train_util.add_sd_saving_arguments(parser)
    train_util.add_optimizer_arguments(parser)
    config_util.add_config_arguments(parser)
+    custom_train_functions.add_custom_train_arguments(parser)

    parser.add_argument(
        "--no_token_padding",
@@ -403,7 +423,13 @@ if __name__ == "__main__":
        help="steps to stop text encoder training, -1 for no training / Text Encoderの学習を止めるステップ数、-1で最初から学習しない",
    )

+    return parser
+
+
+if __name__ == "__main__":
+    parser = setup_parser()
+
    args = parser.parse_args()
    args = train_util.read_config_from_file(args, parser)

-    train(args)
+    train(args)
--- a/train_network.py
+++ b/train_network.py
@@ -8,6 +8,7 @@ import random
 import time
 import json
 import toml
+from multiprocessing import Value

 from tqdm import tqdm
 import torch
@@ -23,26 +24,40 @@ from library.config_util import (
    ConfigSanitizer,
    BlueprintGenerator,
 )
-
-
-def collate_fn(examples):
-    return examples[0]
+import library.huggingface_util as huggingface_util
+import library.custom_train_functions as custom_train_functions
+from library.custom_train_functions import apply_snr_weight, get_weighted_text_embeddings


 # TODO 他のスクリプトと共通化する
 def generate_step_logs(args: argparse.Namespace, current_loss, avr_loss, lr_scheduler):
    logs = {"loss/current": current_loss, "loss/average": avr_loss}

-    if args.network_train_unet_only:
-        logs["lr/unet"] = float(lr_scheduler.get_last_lr()[0])
-    elif args.network_train_text_encoder_only:
-        logs["lr/textencoder"] = float(lr_scheduler.get_last_lr()[0])
-    else:
-        logs["lr/textencoder"] = float(lr_scheduler.get_last_lr()[0])
-        logs["lr/unet"] = float(lr_scheduler.get_last_lr()[-1])  # may be same to textencoder
+    lrs = lr_scheduler.get_last_lr()

-    if args.optimizer_type.lower() == "DAdaptation".lower():  # tracking d*lr value of unet.
-        logs["lr/d*lr"] = lr_scheduler.optimizers[-1].param_groups[0]["d"] * lr_scheduler.optimizers[-1].param_groups[0]["lr"]
+    if args.network_train_text_encoder_only or len(lrs) <= 2:  # not block lr (or single block)
+        if args.network_train_unet_only:
+            logs["lr/unet"] = float(lrs[0])
+        elif args.network_train_text_encoder_only:
+            logs["lr/textencoder"] = float(lrs[0])
+        else:
+            logs["lr/textencoder"] = float(lrs[0])
+            logs["lr/unet"] = float(lrs[-1])  # may be same to textencoder
+
+        if args.optimizer_type.lower() == "DAdaptation".lower():  # tracking d*lr value of unet.
+            logs["lr/d*lr"] = lr_scheduler.optimizers[-1].param_groups[0]["d"] * lr_scheduler.optimizers[-1].param_groups[0]["lr"]
+    else:
+        idx = 0
+        if not args.network_train_unet_only:
+            logs["lr/textencoder"] = float(lrs[0])
+            idx = 1
+
+        for i in range(idx, len(lrs)):
+            logs[f"lr/group{i}"] = float(lrs[i])
+            if args.optimizer_type.lower() == "DAdaptation".lower():
+                logs[f"lr/d*lr/group{i}"] = (
+                    lr_scheduler.optimizers[-1].param_groups[i]["d"] * lr_scheduler.optimizers[-1].param_groups[i]["lr"]
+                )

    return logs

@@ -57,8 +72,9 @@ def train(args):
    use_dreambooth_method = args.in_json is None
    use_user_config = args.dataset_config is not None

-    if args.seed is not None:
-        set_seed(args.seed)
+    if args.seed is None:
+        args.seed = random.randint(0, 2**32)
+    set_seed(args.seed)

    tokenizer = train_util.load_tokenizer(args)

@@ -100,6 +116,11 @@ def train(args):
    blueprint = blueprint_generator.generate(user_config, args, tokenizer=tokenizer)
    train_dataset_group = config_util.generate_dataset_group_by_blueprint(blueprint.dataset_group)

+    current_epoch = Value("i", 0)
+    current_step = Value("i", 0)
+    ds_for_collater = train_dataset_group if args.max_data_loader_n_workers == 0 else None
+    collater = train_util.collater_class(current_epoch, current_step, ds_for_collater)
+
    if args.debug_dataset:
        train_util.debug_dataset(train_dataset_group)
        return
@@ -123,12 +144,24 @@ def train(args):
    weight_dtype, save_dtype = train_util.prepare_dtype(args)

    # モデルを読み込む
-    text_encoder, vae, unet, _ = train_util.load_target_model(args, weight_dtype)
+    for pi in range(accelerator.state.num_processes):
+        # TODO: modify other training scripts as well
+        if pi == accelerator.state.local_process_index:
+            print(f"loading model for process {accelerator.state.local_process_index}/{accelerator.state.num_processes}")

-    # work on low-ram device
-    if args.lowram:
-        text_encoder.to("cuda")
-        unet.to("cuda")
+            text_encoder, vae, unet, _ = train_util.load_target_model(
+                args, weight_dtype, accelerator.device if args.lowram else "cpu"
+            )
+
+            # work on low-ram device
+            if args.lowram:
+                text_encoder.to(accelerator.device)
+                unet.to(accelerator.device)
+                vae.to(accelerator.device)
+
+            gc.collect()
+            torch.cuda.empty_cache()
+        accelerator.wait_for_everyone()

    # モデルに xformers とか memory efficient attention を組み込む
    train_util.replace_unet_modules(unet, args.mem_eff_attn, args.xformers)
@@ -139,7 +172,7 @@ def train(args):
        vae.requires_grad_(False)
        vae.eval()
        with torch.no_grad():
-            train_dataset_group.cache_latents(vae)
+            train_dataset_group.cache_latents(vae, args.vae_batch_size)
        vae.to("cpu")
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
@@ -162,15 +195,18 @@ def train(args):
    network = network_module.create_network(1.0, args.network_dim, args.network_alpha, vae, text_encoder, unet, **net_kwargs)
    if network is None:
        return
-
-    if args.network_weights is not None:
-        print("load network weights from:", args.network_weights)
-        network.load_weights(args.network_weights)
+    
+    if hasattr(network, "prepare_network"):
+        network.prepare_network(args)

    train_unet = not args.network_train_text_encoder_only
    train_text_encoder = not args.network_train_unet_only
    network.apply_to(text_encoder, unet, train_text_encoder, train_unet)

+    if args.network_weights is not None:
+        info = network.load_weights(args.network_weights)
+        print(f"load network weights from {args.network_weights}: {info}")
+
    if args.gradient_checkpointing:
        unet.enable_gradient_checkpointing()
        text_encoder.gradient_checkpointing_enable()
@@ -179,27 +215,39 @@ def train(args):
    # 学習に必要なクラスを準備する
    print("prepare optimizer, data loader etc.")

-    trainable_params = network.prepare_optimizer_params(args.text_encoder_lr, args.unet_lr)
+    # 後方互換性を確保するよ
+    try:
+        trainable_params = network.prepare_optimizer_params(args.text_encoder_lr, args.unet_lr, args.learning_rate)
+    except TypeError:
+        print("Deprecated: use prepare_optimizer_params(text_encoder_lr, unet_lr, learning_rate) instead of prepare_optimizer_params(text_encoder_lr, unet_lr)")
+        trainable_params = network.prepare_optimizer_params(args.text_encoder_lr, args.unet_lr)
+
    optimizer_name, optimizer_args, optimizer = train_util.get_optimizer(args, trainable_params)

    # dataloaderを準備する
    # DataLoaderのプロセス数：0はメインプロセスになる
    n_workers = min(args.max_data_loader_n_workers, os.cpu_count() - 1)  # cpu_count-1 ただし最大で指定された数まで
+
    train_dataloader = torch.utils.data.DataLoader(
        train_dataset_group,
        batch_size=1,
        shuffle=True,
-        collate_fn=collate_fn,
+        collate_fn=collater,
        num_workers=n_workers,
        persistent_workers=args.persistent_data_loader_workers,
    )

    # 学習ステップ数を計算する
    if args.max_train_epochs is not None:
-        args.max_train_steps = args.max_train_epochs * math.ceil(len(train_dataloader) / accelerator.num_processes)
+        args.max_train_steps = args.max_train_epochs * math.ceil(
+            len(train_dataloader) / accelerator.num_processes / args.gradient_accumulation_steps
+        )
        if is_main_process:
            print(f"override steps. steps for {args.max_train_epochs} epochs is / 指定エポックまでのステップ数: {args.max_train_steps}")

+    # データセット側にも学習ステップを送信
+    train_dataset_group.set_max_train_steps(args.max_train_steps)
+
    # lr schedulerを用意する
    lr_scheduler = train_util.get_scheduler_fix(args, optimizer, accelerator.num_processes)

@@ -262,9 +310,7 @@ def train(args):
        train_util.patch_accelerator_for_fp16_training(accelerator)

    # resumeする
-    if args.resume is not None:
-        print(f"resume training from state: {args.resume}")
-        accelerator.load_state(args.resume)
+    train_util.resume_from_local_or_hf_if_specified(accelerator, args)

    # epoch数を計算する
    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
@@ -325,6 +371,7 @@ def train(args):
        "ss_caption_tag_dropout_rate": args.caption_tag_dropout_rate,
        "ss_face_crop_aug_range": args.face_crop_aug_range,
        "ss_prior_loss_weight": args.prior_loss_weight,
+        "ss_min_snr_gamma": args.min_snr_gamma,
    }

    if use_user_config:
@@ -453,8 +500,6 @@ def train(args):
    # add extra args
    if args.network_args:
        metadata["ss_network_args"] = json.dumps(net_kwargs)
-        # for key, value in net_kwargs.items():
-        #   metadata["ss_arg_" + key] = value

    # model name and hash
    if args.pretrained_model_name_or_path is not None:
@@ -488,22 +533,23 @@ def train(args):
    noise_scheduler = DDPMScheduler(
        beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000, clip_sample=False
    )
-
    if accelerator.is_main_process:
        accelerator.init_trackers("network_train")

    loss_list = []
    loss_total = 0.0
+    del train_dataset_group
    for epoch in range(num_train_epochs):
        if is_main_process:
            print(f"epoch {epoch+1}/{num_train_epochs}")
-        train_dataset_group.set_current_epoch(epoch + 1)
+        current_epoch.value = epoch + 1

        metadata["ss_epoch"] = str(epoch + 1)

        network.on_epoch_start(text_encoder, unet)

        for step, batch in enumerate(train_dataloader):
+            current_step.value = global_step
            with accelerator.accumulate(network):
                with torch.no_grad():
                    if "latents" in batch and batch["latents"] is not None:
@@ -516,9 +562,17 @@ def train(args):

                with torch.set_grad_enabled(train_text_encoder):
                    # Get the text embedding for conditioning
-                    input_ids = batch["input_ids"].to(accelerator.device)
-                    encoder_hidden_states = train_util.get_hidden_states(args, input_ids, tokenizer, text_encoder, weight_dtype)
-
+                    if args.weighted_captions:
+                      encoder_hidden_states = get_weighted_text_embeddings(tokenizer,
+                        text_encoder,
+                        batch["captions"],
+                        accelerator.device,
+                        args.max_token_length // 75 if args.max_token_length else 1,
+                        clip_skip=args.clip_skip,
+                        )
+                    else:
+                      input_ids = batch["input_ids"].to(accelerator.device)
+                      encoder_hidden_states = train_util.get_hidden_states(args, input_ids, tokenizer, text_encoder, weight_dtype)
                # Sample noise that we'll add to the latents
                noise = torch.randn_like(latents, device=latents.device)
                if args.noise_offset:
@@ -528,7 +582,6 @@ def train(args):
                # Sample a random timestep for each image
                timesteps = torch.randint(0, noise_scheduler.config.num_train_timesteps, (b_size,), device=latents.device)
                timesteps = timesteps.long()
-
                # Add noise to the latents according to the noise magnitude at each timestep
                # (this is the forward diffusion process)
                noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)
@@ -549,6 +602,9 @@ def train(args):
                loss_weights = batch["loss_weights"]  # 各sampleごとのweight
                loss = loss * loss_weights

+                if args.min_snr_gamma:
+                    loss = apply_snr_weight(loss, timesteps, noise_scheduler, args.min_snr_gamma)
+
                loss = loss.mean()  # 平均なのでbatch_sizeで割る必要なし

                accelerator.backward(loss)
@@ -602,6 +658,8 @@ def train(args):
                metadata["ss_training_finished_at"] = str(time.time())
                print(f"saving checkpoint: {ckpt_file}")
                unwrap_model(network).save_weights(ckpt_file, save_dtype, minimum_metadata if args.no_metadata else metadata)
+                if args.huggingface_repo_id is not None:
+                    huggingface_util.upload(args, ckpt_file, "/" + ckpt_name)

            def remove_old_func(old_epoch_no):
                old_ckpt_name = train_util.EPOCH_FILE_NAME.format(model_name, old_epoch_no) + "." + args.save_model_as
@@ -641,10 +699,12 @@ def train(args):

        print(f"save trained model to {ckpt_file}")
        network.save_weights(ckpt_file, save_dtype, minimum_metadata if args.no_metadata else metadata)
+        if args.huggingface_repo_id is not None:
+            huggingface_util.upload(args, ckpt_file, "/" + ckpt_name, force_sync_upload=True)
        print("model saved.")


-if __name__ == "__main__":
+def setup_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser()

    train_util.add_sd_models_arguments(parser)
@@ -652,6 +712,7 @@ if __name__ == "__main__":
    train_util.add_training_arguments(parser, True)
    train_util.add_optimizer_arguments(parser)
    config_util.add_config_arguments(parser)
+    custom_train_functions.add_custom_train_arguments(parser)

    parser.add_argument("--no_metadata", action="store_true", help="do not save metadata in output model / メタデータを出力先モデルに保存しない")
    parser.add_argument(
@@ -687,7 +748,13 @@ if __name__ == "__main__":
        "--training_comment", type=str, default=None, help="arbitrary comment string stored in metadata / メタデータに記録する任意のコメント文字列"
    )

+    return parser
+
+
+if __name__ == "__main__":
+    parser = setup_parser()
+
    args = parser.parse_args()
    args = train_util.read_config_from_file(args, parser)

-    train(args)
+    train(args)
--- a/train_network_README-ja.md
+++ b/train_network_README-ja.md
@@ -188,6 +188,73 @@ gen_img_diffusers.pyに、--network_module、--network_weightsの各オプショ

 --network_mulオプションで0~1.0の数値を指定すると、LoRAの適用率を変えられます。

+## Diffusersのpipelineで生成する
+
+以下の例を参考にしてください。必要なファイルはnetworks/lora.pyのみです。Diffusersのバージョンは0.10.2以外では動作しない可能性があります。
+
+```python
+import torch
+from diffusers import StableDiffusionPipeline
+from networks.lora import LoRAModule, create_network_from_weights
+from safetensors.torch import load_file
+
+# if the ckpt is CompVis based, convert it to Diffusers beforehand with tools/convert_diffusers20_original_sd.py. See --help for more details.
+
+model_id_or_dir = r"model_id_on_hugging_face_or_dir"
+device = "cuda"
+
+# create pipe
+print(f"creating pipe from {model_id_or_dir}...")
+pipe = StableDiffusionPipeline.from_pretrained(model_id_or_dir, revision="fp16", torch_dtype=torch.float16)
+pipe = pipe.to(device)
+vae = pipe.vae
+text_encoder = pipe.text_encoder
+unet = pipe.unet
+
+# load lora networks
+print(f"loading lora networks...")
+
+lora_path1 = r"lora1.safetensors"
+sd = load_file(lora_path1)   # If the file is .ckpt, use torch.load instead.
+network1, sd = create_network_from_weights(0.5, None, vae, text_encoder,unet, sd)
+network1.apply_to(text_encoder, unet)
+network1.load_state_dict(sd)
+network1.to(device, dtype=torch.float16)
+
+# # You can merge weights instead of apply_to+load_state_dict. network.set_multiplier does not work
+# network.merge_to(text_encoder, unet, sd)
+
+lora_path2 = r"lora2.safetensors"
+sd = load_file(lora_path2) 
+network2, sd = create_network_from_weights(0.7, None, vae, text_encoder,unet, sd)
+network2.apply_to(text_encoder, unet)
+network2.load_state_dict(sd)
+network2.to(device, dtype=torch.float16)
+
+lora_path3 = r"lora3.safetensors"
+sd = load_file(lora_path3)
+network3, sd = create_network_from_weights(0.5, None, vae, text_encoder,unet, sd)
+network3.apply_to(text_encoder, unet)
+network3.load_state_dict(sd)
+network3.to(device, dtype=torch.float16)
+
+# prompts
+prompt = "masterpiece, best quality, 1girl, in white shirt, looking at viewer"
+negative_prompt = "bad quality, worst quality, bad anatomy, bad hands"
+
+# exec pipe
+print("generating image...")
+with torch.autocast("cuda"):
+    image = pipe(prompt, guidance_scale=7.5, negative_prompt=negative_prompt).images[0]
+
+# if not merged, you can use set_multiplier
+# network1.set_multiplier(0.8)
+# and generate image again...
+
+# save image
+image.save(r"by_diffusers..png")
+```
+
 ## 二つのモデルの差分からLoRAモデルを作成する

 [こちらのディスカッション](https://github.com/cloneofsimo/lora/discussions/56)を参考に実装したものです。数式はそのまま使わせていただきました（よく理解していませんが近似には特異値分解を用いるようです）。
--- a/train_textual_inversion.py
+++ b/train_textual_inversion.py
@@ -4,6 +4,7 @@ import gc
 import math
 import os
 import toml
+from multiprocessing import Value

 from tqdm import tqdm
 import torch
@@ -12,11 +13,14 @@ import diffusers
 from diffusers import DDPMScheduler

 import library.train_util as train_util
+import library.huggingface_util as huggingface_util
 import library.config_util as config_util
 from library.config_util import (
    ConfigSanitizer,
    BlueprintGenerator,
 )
+import library.custom_train_functions as custom_train_functions
+from library.custom_train_functions import apply_snr_weight

 imagenet_templates_small = [
    "a photo of a {}",
@@ -71,10 +75,6 @@ imagenet_style_templates_small = [
 ]


-def collate_fn(examples):
-    return examples[0]
-
-
 def train(args):
    if args.output_name is None:
        args.output_name = args.token_string
@@ -185,6 +185,11 @@ def train(args):
    blueprint = blueprint_generator.generate(user_config, args, tokenizer=tokenizer)
    train_dataset_group = config_util.generate_dataset_group_by_blueprint(blueprint.dataset_group)

+    current_epoch = Value('i',0)
+    current_step = Value('i',0)
+    ds_for_collater = train_dataset_group if args.max_data_loader_n_workers == 0 else None
+    collater = train_util.collater_class(current_epoch,current_step, ds_for_collater)
+
    # make captions: tokenstring tokenstring1 tokenstring2 ...tokenstringn という文字列に書き換える超乱暴な実装
    if use_template:
        print("use template for training captions. is object: {args.use_object_template}")
@@ -228,7 +233,7 @@ def train(args):
        vae.requires_grad_(False)
        vae.eval()
        with torch.no_grad():
-            train_dataset_group.cache_latents(vae)
+            train_dataset_group.cache_latents(vae, args.vae_batch_size)
        vae.to("cpu")
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
@@ -250,16 +255,19 @@ def train(args):
        train_dataset_group,
        batch_size=1,
        shuffle=True,
-        collate_fn=collate_fn,
+        collate_fn=collater,
        num_workers=n_workers,
        persistent_workers=args.persistent_data_loader_workers,
    )

    # 学習ステップ数を計算する
    if args.max_train_epochs is not None:
-        args.max_train_steps = args.max_train_epochs * len(train_dataloader)
+        args.max_train_steps = args.max_train_epochs * math.ceil(len(train_dataloader) / accelerator.num_processes / args.gradient_accumulation_steps)
        print(f"override steps. steps for {args.max_train_epochs} epochs is / 指定エポックまでのステップ数: {args.max_train_steps}")

+    # データセット側にも学習ステップを送信
+    train_dataset_group.set_max_train_steps(args.max_train_steps)
+
    # lr schedulerを用意する
    lr_scheduler = train_util.get_scheduler_fix(args, optimizer, accelerator.num_processes)

@@ -297,9 +305,7 @@ def train(args):
        text_encoder.to(weight_dtype)

    # resumeする
-    if args.resume is not None:
-        print(f"resume training from state: {args.resume}")
-        accelerator.load_state(args.resume)
+    train_util.resume_from_local_or_hf_if_specified(accelerator, args)

    # epoch数を計算する
    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
@@ -331,12 +337,14 @@ def train(args):

    for epoch in range(num_train_epochs):
        print(f"epoch {epoch+1}/{num_train_epochs}")
-        train_dataset_group.set_current_epoch(epoch + 1)
+        current_epoch.value = epoch+1

        text_encoder.train()

        loss_total = 0
+
        for step, batch in enumerate(train_dataloader):
+            current_step.value = global_step
            with accelerator.accumulate(text_encoder):
                with torch.no_grad():
                    if "latents" in batch and batch["latents"] is not None:
@@ -377,6 +385,9 @@ def train(args):

                loss = torch.nn.functional.mse_loss(noise_pred.float(), target.float(), reduction="none")
                loss = loss.mean([1, 2, 3])
+                
+                if args.min_snr_gamma:
+                  loss = apply_snr_weight(loss, timesteps, noise_scheduler, args.min_snr_gamma)

                loss_weights = batch["loss_weights"]  # 各sampleごとのweight
                loss = loss * loss_weights
@@ -440,6 +451,8 @@ def train(args):
                ckpt_file = os.path.join(args.output_dir, ckpt_name)
                print(f"saving checkpoint: {ckpt_file}")
                save_weights(ckpt_file, updated_embs, save_dtype)
+                if args.huggingface_repo_id is not None:
+                    huggingface_util.upload(args, ckpt_file, "/" + ckpt_name)

            def remove_old_func(old_epoch_no):
                old_ckpt_name = train_util.EPOCH_FILE_NAME.format(model_name, old_epoch_no) + "." + args.save_model_as
@@ -480,6 +493,8 @@ def train(args):

        print(f"save trained model to {ckpt_file}")
        save_weights(ckpt_file, updated_embs, save_dtype)
+        if args.huggingface_repo_id is not None:
+            huggingface_util.upload(args, ckpt_file, "/" + ckpt_name, force_sync_upload=True)
        print("model saved.")


@@ -526,7 +541,7 @@ def load_weights(file):
    return emb


-if __name__ == "__main__":
+def setup_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser()

    train_util.add_sd_models_arguments(parser)
@@ -534,6 +549,7 @@ if __name__ == "__main__":
    train_util.add_training_arguments(parser, True)
    train_util.add_optimizer_arguments(parser)
    config_util.add_config_arguments(parser)
+    custom_train_functions.add_custom_train_arguments(parser, False)

    parser.add_argument(
        "--save_model_as",
@@ -565,6 +581,12 @@ if __name__ == "__main__":
        help="ignore caption and use default templates for stype / キャプションは使わずデフォルトのスタイル用テンプレートで学習する",
    )

+    return parser
+
+
+if __name__ == "__main__":
+    parser = setup_parser()
+
    args = parser.parse_args()
    args = train_util.read_config_from_file(args, parser)

--- a/train_textual_inversion_XTI.py
+++ b/train_textual_inversion_XTI.py
@@ -0,0 +1,647 @@
+import importlib
+import argparse
+import gc
+import math
+import os
+import toml
+from multiprocessing import Value
+
+from tqdm import tqdm
+import torch
+from accelerate.utils import set_seed
+import diffusers
+from diffusers import DDPMScheduler
+
+import library.train_util as train_util
+import library.huggingface_util as huggingface_util
+import library.config_util as config_util
+from library.config_util import (
+    ConfigSanitizer,
+    BlueprintGenerator,
+)
+import library.custom_train_functions as custom_train_functions
+from library.custom_train_functions import apply_snr_weight
+from XTI_hijack import unet_forward_XTI, downblock_forward_XTI, upblock_forward_XTI
+
+imagenet_templates_small = [
+    "a photo of a {}",
+    "a rendering of a {}",
+    "a cropped photo of the {}",
+    "the photo of a {}",
+    "a photo of a clean {}",
+    "a photo of a dirty {}",
+    "a dark photo of the {}",
+    "a photo of my {}",
+    "a photo of the cool {}",
+    "a close-up photo of a {}",
+    "a bright photo of the {}",
+    "a cropped photo of a {}",
+    "a photo of the {}",
+    "a good photo of the {}",
+    "a photo of one {}",
+    "a close-up photo of the {}",
+    "a rendition of the {}",
+    "a photo of the clean {}",
+    "a rendition of a {}",
+    "a photo of a nice {}",
+    "a good photo of a {}",
+    "a photo of the nice {}",
+    "a photo of the small {}",
+    "a photo of the weird {}",
+    "a photo of the large {}",
+    "a photo of a cool {}",
+    "a photo of a small {}",
+]
+
+imagenet_style_templates_small = [
+    "a painting in the style of {}",
+    "a rendering in the style of {}",
+    "a cropped painting in the style of {}",
+    "the painting in the style of {}",
+    "a clean painting in the style of {}",
+    "a dirty painting in the style of {}",
+    "a dark painting in the style of {}",
+    "a picture in the style of {}",
+    "a cool painting in the style of {}",
+    "a close-up painting in the style of {}",
+    "a bright painting in the style of {}",
+    "a cropped painting in the style of {}",
+    "a good painting in the style of {}",
+    "a close-up painting in the style of {}",
+    "a rendition in the style of {}",
+    "a nice painting in the style of {}",
+    "a small painting in the style of {}",
+    "a weird painting in the style of {}",
+    "a large painting in the style of {}",
+]
+
+
+def train(args):
+    if args.output_name is None:
+        args.output_name = args.token_string
+    use_template = args.use_object_template or args.use_style_template
+
+    train_util.verify_training_args(args)
+    train_util.prepare_dataset_args(args, True)
+
+    if args.sample_every_n_steps is not None or args.sample_every_n_epochs is not None:
+        print(
+            "sample_every_n_steps and sample_every_n_epochs are not supported in this script currently / sample_every_n_stepsとsample_every_n_epochsは現在このスクリプトではサポートされていません"
+        )
+
+    cache_latents = args.cache_latents
+
+    if args.seed is not None:
+        set_seed(args.seed)
+
+    tokenizer = train_util.load_tokenizer(args)
+
+    # acceleratorを準備する
+    print("prepare accelerator")
+    accelerator, unwrap_model = train_util.prepare_accelerator(args)
+
+    # mixed precisionに対応した型を用意しておき適宜castする
+    weight_dtype, save_dtype = train_util.prepare_dtype(args)
+
+    # モデルを読み込む
+    text_encoder, vae, unet, _ = train_util.load_target_model(args, weight_dtype)
+
+    # Convert the init_word to token_id
+    if args.init_word is not None:
+        init_token_ids = tokenizer.encode(args.init_word, add_special_tokens=False)
+        if len(init_token_ids) > 1 and len(init_token_ids) != args.num_vectors_per_token:
+            print(
+                f"token length for init words is not same to num_vectors_per_token, init words is repeated or truncated / 初期化単語のトークン長がnum_vectors_per_tokenと合わないため、繰り返しまたは切り捨てが発生します: length {len(init_token_ids)}"
+            )
+    else:
+        init_token_ids = None
+
+    # add new word to tokenizer, count is num_vectors_per_token
+    token_strings = [args.token_string] + [f"{args.token_string}{i+1}" for i in range(args.num_vectors_per_token - 1)]
+    num_added_tokens = tokenizer.add_tokens(token_strings)
+    assert (
+        num_added_tokens == args.num_vectors_per_token
+    ), f"tokenizer has same word to token string. please use another one / 指定したargs.token_stringは既に存在します。別の単語を使ってください: {args.token_string}"
+
+    token_ids = tokenizer.convert_tokens_to_ids(token_strings)
+    print(f"tokens are added: {token_ids}")
+    assert min(token_ids) == token_ids[0] and token_ids[-1] == token_ids[0] + len(token_ids) - 1, f"token ids is not ordered"
+    assert len(tokenizer) - 1 == token_ids[-1], f"token ids is not end of tokenize: {len(tokenizer)}"
+
+    token_strings_XTI = []
+    XTI_layers = [
+        "IN01",
+        "IN02",
+        "IN04",
+        "IN05",
+        "IN07",
+        "IN08",
+        "MID",
+        "OUT03",
+        "OUT04",
+        "OUT05",
+        "OUT06",
+        "OUT07",
+        "OUT08",
+        "OUT09",
+        "OUT10",
+        "OUT11",
+    ]
+    for layer_name in XTI_layers:
+        token_strings_XTI += [f"{t}_{layer_name}" for t in token_strings]
+
+    tokenizer.add_tokens(token_strings_XTI)
+    token_ids_XTI = tokenizer.convert_tokens_to_ids(token_strings_XTI)
+    print(f"tokens are added (XTI): {token_ids_XTI}")
+    # Resize the token embeddings as we are adding new special tokens to the tokenizer
+    text_encoder.resize_token_embeddings(len(tokenizer))
+
+    # Initialise the newly added placeholder token with the embeddings of the initializer token
+    token_embeds = text_encoder.get_input_embeddings().weight.data
+    if init_token_ids is not None:
+        for i, token_id in enumerate(token_ids_XTI):
+            token_embeds[token_id] = token_embeds[init_token_ids[(i // 16) % len(init_token_ids)]]
+            # print(token_id, token_embeds[token_id].mean(), token_embeds[token_id].min())
+
+    # load weights
+    if args.weights is not None:
+        embeddings = load_weights(args.weights)
+        assert len(token_ids) == len(
+            embeddings
+        ), f"num_vectors_per_token is mismatch for weights / 指定した重みとnum_vectors_per_tokenの値が異なります: {len(embeddings)}"
+        # print(token_ids, embeddings.size())
+        for token_id, embedding in zip(token_ids_XTI, embeddings):
+            token_embeds[token_id] = embedding
+            # print(token_id, token_embeds[token_id].mean(), token_embeds[token_id].min())
+        print(f"weighs loaded")
+
+    print(f"create embeddings for {args.num_vectors_per_token} tokens, for {args.token_string}")
+
+    # データセットを準備する
+    blueprint_generator = BlueprintGenerator(ConfigSanitizer(True, True, False))
+    if args.dataset_config is not None:
+        print(f"Load dataset config from {args.dataset_config}")
+        user_config = config_util.load_user_config(args.dataset_config)
+        ignored = ["train_data_dir", "reg_data_dir", "in_json"]
+        if any(getattr(args, attr) is not None for attr in ignored):
+            print(
+                "ignore following options because config file is found: {0} / 設定ファイルが利用されるため以下のオプションは無視されます: {0}".format(
+                    ", ".join(ignored)
+                )
+            )
+    else:
+        use_dreambooth_method = args.in_json is None
+        if use_dreambooth_method:
+            print("Use DreamBooth method.")
+            user_config = {
+                "datasets": [
+                    {"subsets": config_util.generate_dreambooth_subsets_config_by_subdirs(args.train_data_dir, args.reg_data_dir)}
+                ]
+            }
+        else:
+            print("Train with captions.")
+            user_config = {
+                "datasets": [
+                    {
+                        "subsets": [
+                            {
+                                "image_dir": args.train_data_dir,
+                                "metadata_file": args.in_json,
+                            }
+                        ]
+                    }
+                ]
+            }
+
+    blueprint = blueprint_generator.generate(user_config, args, tokenizer=tokenizer)
+    train_dataset_group = config_util.generate_dataset_group_by_blueprint(blueprint.dataset_group)
+    train_dataset_group.enable_XTI(XTI_layers, token_strings=token_strings)
+    current_epoch = Value("i", 0)
+    current_step = Value("i", 0)
+    ds_for_collater = train_dataset_group if args.max_data_loader_n_workers == 0 else None
+    collater = train_util.collater_class(current_epoch, current_step, ds_for_collater)
+
+    # make captions: tokenstring tokenstring1 tokenstring2 ...tokenstringn という文字列に書き換える超乱暴な実装
+    if use_template:
+        print("use template for training captions. is object: {args.use_object_template}")
+        templates = imagenet_templates_small if args.use_object_template else imagenet_style_templates_small
+        replace_to = " ".join(token_strings)
+        captions = []
+        for tmpl in templates:
+            captions.append(tmpl.format(replace_to))
+        train_dataset_group.add_replacement("", captions)
+
+        if args.num_vectors_per_token > 1:
+            prompt_replacement = (args.token_string, replace_to)
+        else:
+            prompt_replacement = None
+    else:
+        if args.num_vectors_per_token > 1:
+            replace_to = " ".join(token_strings)
+            train_dataset_group.add_replacement(args.token_string, replace_to)
+            prompt_replacement = (args.token_string, replace_to)
+        else:
+            prompt_replacement = None
+
+    if args.debug_dataset:
+        train_util.debug_dataset(train_dataset_group, show_input_ids=True)
+        return
+    if len(train_dataset_group) == 0:
+        print("No data found. Please verify arguments / 画像がありません。引数指定を確認してください")
+        return
+
+    if cache_latents:
+        assert (
+            train_dataset_group.is_latent_cacheable()
+        ), "when caching latents, either color_aug or random_crop cannot be used / latentをキャッシュするときはcolor_augとrandom_cropは使えません"
+
+    # モデルに xformers とか memory efficient attention を組み込む
+    train_util.replace_unet_modules(unet, args.mem_eff_attn, args.xformers)
+    diffusers.models.UNet2DConditionModel.forward = unet_forward_XTI
+    diffusers.models.unet_2d_blocks.CrossAttnDownBlock2D.forward = downblock_forward_XTI
+    diffusers.models.unet_2d_blocks.CrossAttnUpBlock2D.forward = upblock_forward_XTI
+
+    # 学習を準備する
+    if cache_latents:
+        vae.to(accelerator.device, dtype=weight_dtype)
+        vae.requires_grad_(False)
+        vae.eval()
+        with torch.no_grad():
+            train_dataset_group.cache_latents(vae, args.vae_batch_size)
+        vae.to("cpu")
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+        gc.collect()
+
+    if args.gradient_checkpointing:
+        unet.enable_gradient_checkpointing()
+        text_encoder.gradient_checkpointing_enable()
+
+    # 学習に必要なクラスを準備する
+    print("prepare optimizer, data loader etc.")
+    trainable_params = text_encoder.get_input_embeddings().parameters()
+    _, _, optimizer = train_util.get_optimizer(args, trainable_params)
+
+    # dataloaderを準備する
+    # DataLoaderのプロセス数：0はメインプロセスになる
+    n_workers = min(args.max_data_loader_n_workers, os.cpu_count() - 1)  # cpu_count-1 ただし最大で指定された数まで
+    train_dataloader = torch.utils.data.DataLoader(
+        train_dataset_group,
+        batch_size=1,
+        shuffle=True,
+        collate_fn=collater,
+        num_workers=n_workers,
+        persistent_workers=args.persistent_data_loader_workers,
+    )
+
+    # 学習ステップ数を計算する
+    if args.max_train_epochs is not None:
+        args.max_train_steps = args.max_train_epochs * math.ceil(
+            len(train_dataloader) / accelerator.num_processes / args.gradient_accumulation_steps
+        )
+        print(f"override steps. steps for {args.max_train_epochs} epochs is / 指定エポックまでのステップ数: {args.max_train_steps}")
+
+    # データセット側にも学習ステップを送信
+    train_dataset_group.set_max_train_steps(args.max_train_steps)
+
+    # lr schedulerを用意する
+    lr_scheduler = train_util.get_scheduler_fix(args, optimizer, accelerator.num_processes)
+
+    # acceleratorがなんかよろしくやってくれるらしい
+    text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
+        text_encoder, optimizer, train_dataloader, lr_scheduler
+    )
+
+    index_no_updates = torch.arange(len(tokenizer)) < token_ids_XTI[0]
+    # print(len(index_no_updates), torch.sum(index_no_updates))
+    orig_embeds_params = unwrap_model(text_encoder).get_input_embeddings().weight.data.detach().clone()
+
+    # Freeze all parameters except for the token embeddings in text encoder
+    text_encoder.requires_grad_(True)
+    text_encoder.text_model.encoder.requires_grad_(False)
+    text_encoder.text_model.final_layer_norm.requires_grad_(False)
+    text_encoder.text_model.embeddings.position_embedding.requires_grad_(False)
+    # text_encoder.text_model.embeddings.token_embedding.requires_grad_(True)
+
+    unet.requires_grad_(False)
+    unet.to(accelerator.device, dtype=weight_dtype)
+    if args.gradient_checkpointing:  # according to TI example in Diffusers, train is required
+        unet.train()
+    else:
+        unet.eval()
+
+    if not cache_latents:
+        vae.requires_grad_(False)
+        vae.eval()
+        vae.to(accelerator.device, dtype=weight_dtype)
+
+    # 実験的機能：勾配も含めたfp16学習を行う　PyTorchにパッチを当ててfp16でのgrad scaleを有効にする
+    if args.full_fp16:
+        train_util.patch_accelerator_for_fp16_training(accelerator)
+        text_encoder.to(weight_dtype)
+
+    # resumeする
+    train_util.resume_from_local_or_hf_if_specified(accelerator, args)
+
+    # epoch数を計算する
+    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
+    num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)
+    if (args.save_n_epoch_ratio is not None) and (args.save_n_epoch_ratio > 0):
+        args.save_every_n_epochs = math.floor(num_train_epochs / args.save_n_epoch_ratio) or 1
+
+    # 学習する
+    total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
+    print("running training / 学習開始")
+    print(f"  num train images * repeats / 学習画像の数×繰り返し回数: {train_dataset_group.num_train_images}")
+    print(f"  num reg images / 正則化画像の数: {train_dataset_group.num_reg_images}")
+    print(f"  num batches per epoch / 1epochのバッチ数: {len(train_dataloader)}")
+    print(f"  num epochs / epoch数: {num_train_epochs}")
+    print(f"  batch size per device / バッチサイズ: {args.train_batch_size}")
+    print(f"  total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ（並列学習、勾配合計含む）: {total_batch_size}")
+    print(f"  gradient ccumulation steps / 勾配を合計するステップ数 = {args.gradient_accumulation_steps}")
+    print(f"  total optimization steps / 学習ステップ数: {args.max_train_steps}")
+
+    progress_bar = tqdm(range(args.max_train_steps), smoothing=0, disable=not accelerator.is_local_main_process, desc="steps")
+    global_step = 0
+
+    noise_scheduler = DDPMScheduler(
+        beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000, clip_sample=False
+    )
+
+    if accelerator.is_main_process:
+        accelerator.init_trackers("textual_inversion")
+
+    for epoch in range(num_train_epochs):
+        print(f"epoch {epoch+1}/{num_train_epochs}")
+        current_epoch.value = epoch + 1
+
+        text_encoder.train()
+
+        loss_total = 0
+
+        for step, batch in enumerate(train_dataloader):
+            current_step.value = global_step
+            with accelerator.accumulate(text_encoder):
+                with torch.no_grad():
+                    if "latents" in batch and batch["latents"] is not None:
+                        latents = batch["latents"].to(accelerator.device)
+                    else:
+                        # latentに変換
+                        latents = vae.encode(batch["images"].to(dtype=weight_dtype)).latent_dist.sample()
+                    latents = latents * 0.18215
+                b_size = latents.shape[0]
+
+                # Get the text embedding for conditioning
+                input_ids = batch["input_ids"].to(accelerator.device)
+                # weight_dtype) use float instead of fp16/bf16 because text encoder is float
+                encoder_hidden_states = torch.stack(
+                    [
+                        train_util.get_hidden_states(args, s, tokenizer, text_encoder, weight_dtype)
+                        for s in torch.split(input_ids, 1, dim=1)
+                    ]
+                )
+
+                # Sample noise that we'll add to the latents
+                noise = torch.randn_like(latents, device=latents.device)
+                if args.noise_offset:
+                    # https://www.crosslabs.org//blog/diffusion-with-offset-noise
+                    noise += args.noise_offset * torch.randn((latents.shape[0], latents.shape[1], 1, 1), device=latents.device)
+
+                # Sample a random timestep for each image
+                timesteps = torch.randint(0, noise_scheduler.config.num_train_timesteps, (b_size,), device=latents.device)
+                timesteps = timesteps.long()
+
+                # Add noise to the latents according to the noise magnitude at each timestep
+                # (this is the forward diffusion process)
+                noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)
+
+                # Predict the noise residual
+                noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states=encoder_hidden_states).sample
+
+                if args.v_parameterization:
+                    # v-parameterization training
+                    target = noise_scheduler.get_velocity(latents, noise, timesteps)
+                else:
+                    target = noise
+
+                loss = torch.nn.functional.mse_loss(noise_pred.float(), target.float(), reduction="none")
+                loss = loss.mean([1, 2, 3])
+
+                if args.min_snr_gamma:
+                    loss = apply_snr_weight(loss, timesteps, noise_scheduler, args.min_snr_gamma)
+
+                loss_weights = batch["loss_weights"]  # 各sampleごとのweight
+                loss = loss * loss_weights
+
+                loss = loss.mean()  # 平均なのでbatch_sizeで割る必要なし
+
+                accelerator.backward(loss)
+                if accelerator.sync_gradients and args.max_grad_norm != 0.0:
+                    params_to_clip = text_encoder.get_input_embeddings().parameters()
+                    accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
+
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.zero_grad(set_to_none=True)
+
+                # Let's make sure we don't update any embedding weights besides the newly added token
+                with torch.no_grad():
+                    unwrap_model(text_encoder).get_input_embeddings().weight[index_no_updates] = orig_embeds_params[
+                        index_no_updates
+                    ]
+
+            # Checks if the accelerator has performed an optimization step behind the scenes
+            if accelerator.sync_gradients:
+                progress_bar.update(1)
+                global_step += 1
+                # TODO: fix sample_images
+                # train_util.sample_images(
+                #     accelerator, args, None, global_step, accelerator.device, vae, tokenizer, text_encoder, unet, prompt_replacement
+                # )
+
+            current_loss = loss.detach().item()
+            if args.logging_dir is not None:
+                logs = {"loss": current_loss, "lr": float(lr_scheduler.get_last_lr()[0])}
+                if args.optimizer_type.lower() == "DAdaptation".lower():  # tracking d*lr value
+                    logs["lr/d*lr"] = (
+                        lr_scheduler.optimizers[0].param_groups[0]["d"] * lr_scheduler.optimizers[0].param_groups[0]["lr"]
+                    )
+                accelerator.log(logs, step=global_step)
+
+            loss_total += current_loss
+            avr_loss = loss_total / (step + 1)
+            logs = {"loss": avr_loss}  # , "lr": lr_scheduler.get_last_lr()[0]}
+            progress_bar.set_postfix(**logs)
+
+            if global_step >= args.max_train_steps:
+                break
+
+        if args.logging_dir is not None:
+            logs = {"loss/epoch": loss_total / len(train_dataloader)}
+            accelerator.log(logs, step=epoch + 1)
+
+        accelerator.wait_for_everyone()
+
+        updated_embs = unwrap_model(text_encoder).get_input_embeddings().weight[token_ids_XTI].data.detach().clone()
+
+        if args.save_every_n_epochs is not None:
+            model_name = train_util.DEFAULT_EPOCH_NAME if args.output_name is None else args.output_name
+
+            def save_func():
+                ckpt_name = train_util.EPOCH_FILE_NAME.format(model_name, epoch + 1) + "." + args.save_model_as
+                ckpt_file = os.path.join(args.output_dir, ckpt_name)
+                print(f"saving checkpoint: {ckpt_file}")
+                save_weights(ckpt_file, updated_embs, save_dtype)
+                if args.huggingface_repo_id is not None:
+                    huggingface_util.upload(args, ckpt_file, "/" + ckpt_name)
+
+            def remove_old_func(old_epoch_no):
+                old_ckpt_name = train_util.EPOCH_FILE_NAME.format(model_name, old_epoch_no) + "." + args.save_model_as
+                old_ckpt_file = os.path.join(args.output_dir, old_ckpt_name)
+                if os.path.exists(old_ckpt_file):
+                    print(f"removing old checkpoint: {old_ckpt_file}")
+                    os.remove(old_ckpt_file)
+
+            saving = train_util.save_on_epoch_end(args, save_func, remove_old_func, epoch + 1, num_train_epochs)
+            if saving and args.save_state:
+                train_util.save_state_on_epoch_end(args, accelerator, model_name, epoch + 1)
+
+        # TODO: fix sample_images
+        # train_util.sample_images(
+        #     accelerator, args, epoch + 1, global_step, accelerator.device, vae, tokenizer, text_encoder, unet, prompt_replacement
+        # )
+
+        # end of epoch
+
+    is_main_process = accelerator.is_main_process
+    if is_main_process:
+        text_encoder = unwrap_model(text_encoder)
+
+    accelerator.end_training()
+
+    if args.save_state:
+        train_util.save_state_on_train_end(args, accelerator)
+
+    updated_embs = text_encoder.get_input_embeddings().weight[token_ids_XTI].data.detach().clone()
+
+    del accelerator  # この後メモリを使うのでこれは消す
+
+    if is_main_process:
+        os.makedirs(args.output_dir, exist_ok=True)
+
+        model_name = train_util.DEFAULT_LAST_OUTPUT_NAME if args.output_name is None else args.output_name
+        ckpt_name = model_name + "." + args.save_model_as
+        ckpt_file = os.path.join(args.output_dir, ckpt_name)
+
+        print(f"save trained model to {ckpt_file}")
+        save_weights(ckpt_file, updated_embs, save_dtype)
+        if args.huggingface_repo_id is not None:
+            huggingface_util.upload(args, ckpt_file, "/" + ckpt_name, force_sync_upload=True)
+        print("model saved.")
+
+
+def save_weights(file, updated_embs, save_dtype):
+    updated_embs = updated_embs.reshape(16, -1, updated_embs.shape[-1])
+    updated_embs = updated_embs.chunk(16)
+    XTI_layers = [
+        "IN01",
+        "IN02",
+        "IN04",
+        "IN05",
+        "IN07",
+        "IN08",
+        "MID",
+        "OUT03",
+        "OUT04",
+        "OUT05",
+        "OUT06",
+        "OUT07",
+        "OUT08",
+        "OUT09",
+        "OUT10",
+        "OUT11",
+    ]
+    state_dict = {}
+    for i, layer_name in enumerate(XTI_layers):
+        state_dict[layer_name] = updated_embs[i].squeeze(0).detach().clone().to("cpu").to(save_dtype)
+
+    # if save_dtype is not None:
+    #     for key in list(state_dict.keys()):
+    #         v = state_dict[key]
+    #         v = v.detach().clone().to("cpu").to(save_dtype)
+    #         state_dict[key] = v
+
+    if os.path.splitext(file)[1] == ".safetensors":
+        from safetensors.torch import save_file
+
+        save_file(state_dict, file)
+    else:
+        torch.save(state_dict, file)  # can be loaded in Web UI
+
+
+def load_weights(file):
+    if os.path.splitext(file)[1] == ".safetensors":
+        from safetensors.torch import load_file
+
+        data = load_file(file)
+    else:
+        raise ValueError(f"NOT XTI: {file}")
+
+    if len(data.values()) != 16:
+        raise ValueError(f"NOT XTI: {file}")
+
+    emb = torch.concat([x for x in data.values()])
+
+    return emb
+
+
+def setup_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser()
+
+    train_util.add_sd_models_arguments(parser)
+    train_util.add_dataset_arguments(parser, True, True, False)
+    train_util.add_training_arguments(parser, True)
+    train_util.add_optimizer_arguments(parser)
+    config_util.add_config_arguments(parser)
+    custom_train_functions.add_custom_train_arguments(parser, False)
+
+    parser.add_argument(
+        "--save_model_as",
+        type=str,
+        default="pt",
+        choices=[None, "ckpt", "pt", "safetensors"],
+        help="format to save the model (default is .pt) / モデル保存時の形式（デフォルトはpt）",
+    )
+
+    parser.add_argument("--weights", type=str, default=None, help="embedding weights to initialize / 学習するネットワークの初期重み")
+    parser.add_argument(
+        "--num_vectors_per_token", type=int, default=1, help="number of vectors per token / トークンに割り当てるembeddingsの要素数"
+    )
+    parser.add_argument(
+        "--token_string",
+        type=str,
+        default=None,
+        help="token string used in training, must not exist in tokenizer / 学習時に使用されるトークン文字列、tokenizerに存在しない文字であること",
+    )
+    parser.add_argument("--init_word", type=str, default=None, help="words to initialize vector / ベクトルを初期化に使用する単語、複数可")
+    parser.add_argument(
+        "--use_object_template",
+        action="store_true",
+        help="ignore caption and use default templates for object / キャプションは使わずデフォルトの物体用テンプレートで学習する",
+    )
+    parser.add_argument(
+        "--use_style_template",
+        action="store_true",
+        help="ignore caption and use default templates for stype / キャプションは使わずデフォルトのスタイル用テンプレートで学習する",
+    )
+
+    return parser
+
+
+if __name__ == "__main__":
+    parser = setup_parser()
+
+    args = parser.parse_args()
+    args = train_util.read_config_from_file(args, parser)
+
+    train(args)
Author	SHA1	Message	Date
Kohya S	5050971ac6	Merge pull request #388 from kohya-ss/dev add weighted caption for training	2023-04-08 22:00:46 +09:00
Kohya S	08c54dcf22	update readme	2023-04-08 21:58:22 +09:00
Kohya S	6a5f87d874	disable weighted captions in TI/XTI training	2023-04-08 21:45:57 +09:00
Kohya S	a876f2d3fb	format by black	2023-04-08 21:36:35 +09:00
Kohya S	a75f5898e6	Merge pull request #336 from AI-Casanova/weighted_captions Proof of Concept: Weighted captions	2023-04-08 21:31:05 +09:00
AI-Casanova	dbab72153f	Clean up custom_train_functions.py Removed commented out lines from earlier bugfix.	2023-04-08 00:44:56 -05:00
AI-Casanova	0d54609435	Merge branch 'kohya-ss:main' into weighted_captions	2023-04-07 14:55:40 -05:00
Kohya S	b5c60d7d62	Merge pull request #381 from kohya-ss/dev feature to upload to huggingface etc.	2023-04-06 08:20:07 +09:00
Kohya S	defefd79c5	Merge branch 'main' into dev	2023-04-06 08:16:31 +09:00
Kohya S	27834df444	update readme	2023-04-06 08:16:02 +09:00
Kohya S	5c020bed49	Add attension couple+reginal LoRA	2023-04-06 08:11:54 +09:00
Kohya S	c775ec1255	Add about using LoRA with Diffusers standard pipe	2023-04-06 08:10:41 +09:00
AI-Casanova	7527436549	Merge branch 'kohya-ss:main' into weighted_captions	2023-04-05 17:07:15 -05:00
Kohya S	541539a144	change method name, repo is private in default etc	2023-04-05 23:16:49 +09:00
Kohya S	74220bb52c	Merge pull request #348 from ddPn08/dev Added a function to upload to Huggingface and resume from Huggingface.	2023-04-05 21:47:36 +09:00
Kohya S	8eb60baf3a	Merge pull request #374 from kohya-ss/dev block learning rate, block dim(rank) etc.	2023-04-04 08:33:18 +09:00
Kohya S	4b47e8ecb0	update readme	2023-04-04 08:27:30 +09:00
Kohya S	76bac2c1c5	add backward compatiblity	2023-04-04 08:27:11 +09:00
Kohya S	0fcdda7175	Merge pull request #373 from rockerBOO/meta-min_snr_gamma Add min_snr_gamma to metadata	2023-04-04 07:57:50 +09:00
Kohya S	e4eb3e63e6	improve compatibility	2023-04-04 07:48:48 +09:00
rockerBOO	626d4b433a	Add min_snr_gamma to metadata	2023-04-03 12:38:20 -04:00
Kohya S	83c7e03d05	Fix network_weights not working in train_network	2023-04-03 22:45:28 +09:00
Kohya S	959561473c	Merge branch 'main' into dev	2023-04-03 22:09:17 +09:00
Kohya S	7209eb74cc	update readme	2023-04-03 22:08:58 +09:00
Kohya S	53cc3583df	fix potential issue with dtype	2023-04-03 21:46:12 +09:00
Kohya S	82c2553f07	Merge pull request #353 from Riyaaaaa/patch-1 fix typo	2023-04-03 21:45:03 +09:00
Kohya S	6f6f9b537f	Merge pull request #364 from shirayu/check_needless_num_warmup_steps Check needless num_warmup_steps	2023-04-03 21:38:52 +09:00
Kohya S	f407f5a686	Merge pull request #352 from rockerBOO/dataset-config Open dataset_config json file before load	2023-04-03 21:31:55 +09:00
Kohya S	6134619998	Add block dim(rank) feature	2023-04-03 21:19:49 +09:00
Kohya S	817a9268ff	update readme for block weight lr	2023-04-03 08:43:26 +09:00
Kohya S	3beddf341e	Suppor LR graphs for each block, base lr	2023-04-03 08:43:11 +09:00
AI-Casanova	1892c82a60	Reinstantiate weighted captions after a necessary revert to Main	2023-04-02 19:43:34 +00:00
ddPn08	3f339cda6f	small fix	2023-04-02 23:21:17 +09:00
ddPn08	16ba1cec69	change async uploading to optional	2023-04-02 17:45:26 +09:00
ddPn08	8bfa50e283	small fix	2023-04-02 17:39:23 +09:00
ddPn08	c4a11e5a5a	fix help	2023-04-02 17:39:23 +09:00
ddPn08	3cc4939dd3	Implement huggingface upload for all scripts	2023-04-02 17:39:22 +09:00
ddPn08	b5c7937f8d	don't run when not needed	2023-04-02 17:39:21 +09:00
ddPn08	b5ff4e816f	resume from huggingface repository	2023-04-02 17:39:21 +09:00
ddPn08	a7d302e196	write a random seed to metadata	2023-04-02 17:39:20 +09:00
ddPn08	45381b188c	small fix	2023-04-02 17:39:20 +09:00
ddPn08	054fb3308c	use access token	2023-04-02 17:39:19 +09:00
ddPn08	d42431d73a	Added feature to upload to huggingface	2023-04-02 17:39:10 +09:00
Kohya S	c639cb7d5d	support older type hint	2023-04-02 16:18:04 +09:00
Kohya S	97e65bf93f	change 'stratify' to 'block', add en message	2023-04-02 16:10:09 +09:00
Kohya S	36c8a4aee7	Merge pull request #355 from u-haru/feature/stratified_lr LoRA レイヤー別学習率の実装、state_dict読み込みの際のdevice指定削除、typo修正	2023-04-02 15:25:17 +09:00
u-haru	19340d82e6	層別学習率を使わない場合にparamsをまとめる	2023-04-02 12:57:55 +09:00
u-haru	058e442072	レイヤー数変更(hako-mikan/sd-webui-lora-block-weight参考)	2023-04-02 04:02:34 +09:00
Yuta Hayashibe	9577a9f38d	Check needless num_warmup_steps	2023-04-01 20:33:20 +09:00
u-haru	786971d443	Merge branch 'dev' into feature/stratified_lr	2023-04-01 15:08:41 +09:00
Kohya S	f037b09c2d	Merge pull request #360 from kohya-ss/dev fix for merge_lora.py	2023-04-01 09:25:57 +09:00
Kohya S	18d69d8e5e	update readme	2023-04-01 09:21:27 +09:00
Kohya S	770a56193e	fix conv2d3x3 is not merged	2023-04-01 09:17:37 +09:00
Kohya S	4627b389ff	fix device not specified in merge_lora.py	2023-04-01 09:15:57 +09:00
Kohya S	1cd07770a4	format by black	2023-04-01 09:13:47 +09:00
u-haru	1e164b6ec3	specify device when loading state_dict	2023-03-31 12:52:39 +09:00
u-haru	41ecccb2a9	Merge branch 'kohya-ss:main' into feature/stratified_lr	2023-03-31 12:47:56 +09:00
Kohya S	c93cbbc373	Merge pull request #357 from kohya-ss/dev Fix device issue in load_file, reduce vram usage	2023-03-31 09:07:49 +09:00
Kohya S	8cecc676cf	Fix device issue in load_file, reduce vram usage	2023-03-31 09:05:51 +09:00
u-haru	94441fa746	繰り返し回数のないディレクトリの名前表示修正	2023-03-31 02:26:54 +09:00
Atsumu Ono	ccb0ef518a	fix typo	2023-03-31 01:45:49 +09:00
u-haru	3032a47af4	cosineをsineのreversedに変更	2023-03-31 01:42:57 +09:00
u-haru	1b75dbd4f2	引数名に_lrを追加	2023-03-31 01:40:29 +09:00
u-haru	dade23a414	stratified_zero_thresholdに変更	2023-03-31 01:14:03 +09:00
rockerBOO	313f3e8286	Open dataset_config json file before load	2023-03-30 12:08:04 -04:00
u-haru	4dacc52bde	implement stratified_lr	2023-03-31 00:39:35 +09:00
u-haru	b1dffe8d9a	ファイルロードができないバグ修正(Exception: device cuda is invalid)	2023-03-31 00:11:11 +09:00
Kohya S	ea1cf4acee	Merge pull request #350 from kohya-ss/dev fix gen not working	2023-03-30 22:30:47 +09:00
Kohya S	cd5e3baace	Merge branch 'main' into dev	2023-03-30 22:29:19 +09:00
Kohya S	e76ea7cd7d	fix not working	2023-03-30 22:28:55 +09:00
Kohya S	d68ba2f9de	Merge pull request #349 from kohya-ss/dev P+, reduce ram usage etc.	2023-03-30 22:07:03 +09:00
Kohya S	5fc80b7a5b	update readme	2023-03-30 22:03:13 +09:00
Kohya S	31069e1dc5	add comments about debice for clarify	2023-03-30 21:44:40 +09:00
Kohya S	6c28dfb417	Merge pull request #332 from guaneec/ddp-lowram Reduce peak RAM usage	2023-03-30 21:37:37 +09:00
Kohya S	2d6faa9860	support LoRA merge in advance	2023-03-30 21:34:36 +09:00
Kohya S	cb53a77334	show warning message for sample images in XTI	2023-03-30 21:33:57 +09:00
Kohya S	4d91dc0d30	Merge branch 'dev' of https://github.com/kohya-ss/sd-scripts into dev	2023-03-30 21:23:18 +09:00
Kohya S	935d4774a9	Merge pull request #327 from jakaline-dev/main P+: Extended Textual Conditioning in Text-to-Image Generation	2023-03-30 19:44:57 +09:00
Jakaline-dev	24e3d4b464	disabled sampling (for now)	2023-03-30 02:20:04 +09:00
Jakaline-dev	b0c33a4294	Merge remote-tracking branch 'upstream/main'	2023-03-30 01:35:38 +09:00
Kohya S	bf3674c1db	format by black	2023-03-29 21:23:27 +09:00
Kohya S	b996f5a6d6	Merge pull request #339 from kohya-ss/dev fix an issue with num_workers=0	2023-03-28 19:47:46 +09:00
Kohya S	472f516e7c	update readme	2023-03-28 19:44:43 +09:00
Kohya S	c838efcfa8	Merge branch 'main' into dev	2023-03-28 19:43:10 +09:00
Kohya S	4f70e5dca6	fix to work with num_workers=0	2023-03-28 19:42:47 +09:00
Kohya S	0138a917d8	Update README.md	2023-03-28 08:43:41 +09:00
Kohya S	49b29f2db2	Merge pull request #333 from kohya-ss/dev min snr weighting etc.	2023-03-27 22:44:13 +09:00
Kohya S	99eaf1fd65	fix typo	2023-03-27 21:38:01 +09:00
Kohya S	5fa20b5348	update readme	2023-03-27 21:37:10 +09:00
Kohya S	895b0b6ca7	Fix saving issue if epoch/step not in checkpoint	2023-03-27 21:22:32 +09:00
Kohya S	238f01bc9c	fix images are used twice, update debug dataset	2023-03-27 20:48:21 +09:00
Kohya S	43a08b4061	add ja comment	2023-03-27 20:47:27 +09:00
Kohya S	066b1bb57e	fix do not mean in batch dim when min_snr_gamma	2023-03-27 20:47:11 +09:00
guaneec	3cdae0cbd2	Reduce peak RAM usage	2023-03-27 14:34:17 +08:00
Kohya S	14891523ce	fix seed for each dataset to make shuffling same	2023-03-26 22:17:03 +09:00
Kohya S	559a1aeeda	Merge pull request #328 from mgz-dev/resize_lora-fixes update resize_lora.py (fix out of bounds and index)	2023-03-26 17:19:09 +09:00
Kohya S	a18558ddfe	Merge pull request #308 from AI-Casanova/min-SNR Efficient Diffusion Training via Min-SNR Weighting Strategy	2023-03-26 17:12:03 +09:00
Kohya S	6732df93e2	Merge branch 'dev' into min-SNR	2023-03-26 17:10:53 +09:00
Kohya S	4f42f759ea	Merge pull request #322 from u-haru/feature/token_warmup タグ数を徐々に増やしながら学習するオプションの追加、persistent_workersに関する軽微なバグ修正	2023-03-26 17:05:59 +09:00
mgz-dev	c9b157b536	update resize_lora.py (fix out of bounds and index) Fix error where index may go out of bounds when using certain dynamic parameters. Fix index and rank issue (previously some parts of code was incorrectly using python index position rather than rank, which is -1 dim).	2023-03-25 19:56:14 -05:00
AI-Casanova	4c06bfad60	Fix for TypeError from bf16 precision: Thanks to mgz-dev	2023-03-26 00:01:29 +00:00
Jakaline-dev	a35d7ef227	Implement XTI	2023-03-26 05:26:10 +09:00
u-haru	a4b34a9c3c	blueprint_args_conflictは不要なため削除、shuffleが毎回行われる不具合修正	2023-03-26 03:26:55 +09:00
u-haru	5a3d564a30	print削除	2023-03-26 02:26:08 +09:00
u-haru	4dc1124f93	lora以外も対応	2023-03-26 02:19:55 +09:00
u-haru	9c80da6ac5	Merge branch 'feature/token_warmup' of https://github.com/u-haru/sd-scripts into feature/token_warmup	2023-03-26 01:45:15 +09:00
u-haru	292cdb8379	データセットにepoch、stepが通達されないバグ修正	2023-03-26 01:44:25 +09:00
u-haru	5ec90990de	データセットにepoch、stepが通達されないバグ修正	2023-03-26 01:41:24 +09:00
Kohya S	e203270e31	support TI embeds trained by WebUI(?)	2023-03-24 20:46:42 +09:00
Kohya S	b2c5b96f2a	format by black	2023-03-24 20:19:05 +09:00
u-haru	1b89b2a10e	シャッフル前にタグを切り詰めるように変更	2023-03-24 13:44:30 +09:00
u-haru	143c26e552	競合時にpersistant_data_loader側を無効にするように変更	2023-03-24 13:08:56 +09:00
AI-Casanova	518a18aeff	(ACTUAL) Min-SNR Weighting Strategy: Fixed SNR calculation to authors implementation	2023-03-23 12:34:49 +00:00
AI-Casanova	a3c7d711e4	Min-SNR Weighting Strategy: Fixed SNR calculation to authors implementation	2023-03-23 05:43:46 +00:00
u-haru	dbadc40ec2	persistent_workersを有効にした際にキャプションが変化しなくなるバグ修正	2023-03-23 12:33:03 +09:00
u-haru	447c56bf50	typo修正、stepをglobal_stepに修正、バグ修正	2023-03-23 09:53:14 +09:00
u-haru	a9b26b73e0	implement token warmup	2023-03-23 07:37:14 +09:00
AI-Casanova	64c923230e	Min-SNR Weighting Strategy: Refactored and added to all trainers	2023-03-22 01:27:29 +00:00
AI-Casanova	795a6bd2d8	Merge branch 'kohya-ss:main' into min-SNR	2023-03-21 13:19:15 -05:00
Kohya S	aee343a9ee	Merge pull request #310 from kohya-ss/dev faster latents caching etc.	2023-03-21 22:19:26 +09:00
Kohya S	2c5949c155	update readme	2023-03-21 22:17:20 +09:00
Kohya S	193674e16c	fix to support dynamic rank/alpha	2023-03-21 21:59:51 +09:00
Kohya S	4f92b6266c	fix do not starting script	2023-03-21 21:29:10 +09:00
Kohya S	2d86f63e15	update steps calc with max_train_epochs	2023-03-21 21:21:12 +09:00
Kohya S	88751f58f6	Merge branch 'dev' of https://github.com/kohya-ss/sd-scripts into dev	2023-03-21 21:10:44 +09:00
Kohya S	7b324bcc3b	support extensions of image files with uppercases	2023-03-21 21:10:34 +09:00
Kohya S	1645698ec0	Merge pull request #306 from robertsmieja/main Extract parser setup to helper function	2023-03-21 21:09:23 +09:00
Kohya S	5aa5a07260	Merge pull request #292 from tsukimiya/hotfix/max_train_steps Fix: simultaneous use of gradient_accumulation_steps and max_train_epochs	2023-03-21 21:02:29 +09:00
Kohya S	6d9f3bc0b2	fix different reso in batch	2023-03-21 18:33:46 +09:00
Kohya S	1816ac3271	add vae_batch_size option for faster caching	2023-03-21 18:15:57 +09:00
Kohya S	cca3804503	Merge branch 'main' into dev	2023-03-21 15:05:41 +09:00
Kohya S	cb08fa0379	fix no npz with full path	2023-03-21 15:05:25 +09:00
AI-Casanova	a265225972	Min-SNR Weighting Strategy	2023-03-20 22:51:38 +00:00
Robert Smieja	eb66e5ebac	Extract parser setup to helper function - Allows users who `import` the scripts to examine the parser programmatically	2023-03-20 00:06:47 -04:00
tsukimiya	9d4cf8b03b	Merge remote-tracking branch 'origin/hotfix/max_train_steps' into hotfix/max_train_steps # Conflicts: # train_network.py	2023-03-19 23:55:51 +09:00
tsukimiya	a167a592e2	Fixed an issue where max_train_steps was not set correctly when max_train_epochs was specified and gradient_accumulation_steps was set to 2 or more.	2023-03-19 23:54:38 +09:00
tsukimiya	5dad64b684	Fixed an issue where max_train_steps was not set correctly when max_train_epochs was specified and gradient_accumulation_steps was set to 2 or more.	2023-03-13 14:37:28 +09:00