Merge branch 'dev' into stable-cascade

update readme
make LoRA compatible with ComfyUI #1119
2026-04-06 21:52:27 +00:00 · 2024-02-25 20:03:39 +09:00 · 2024-02-25 20:03:00 +09:00 · 2024-02-25 20:01:37 +09:00 · 2024-02-25 09:39:53 +09:00 · 2024-02-25 08:58:27 +09:00
72 changed files with 6309 additions and 4696 deletions
--- a/.github/workflows/typos.yml
+++ b/.github/workflows/typos.yml
@@ -18,4 +18,4 @@ jobs:
      - uses: actions/checkout@v4

      - name: typos-action
-        uses: crate-ci/typos@v1.24.3
+        uses: crate-ci/typos@v1.16.26
--- a/README-ja.md
+++ b/README-ja.md
@@ -1,12 +1,12 @@
+SDXLがサポートされました。sdxlブランチはmainブランチにマージされました。リポジトリを更新したときにはUpgradeの手順を実行してください。また accelerate のバージョンが上がっていますので、accelerate config を再度実行してください。
+
+SDXL学習については[こちら](./README.md#sdxl-training)をご覧ください（英語です）。
+
 ## リポジトリについて
 Stable Diffusionの学習、画像生成、その他のスクリプトを入れたリポジトリです。

 [README in English](./README.md) ←更新情報はこちらにあります

-開発中のバージョンはdevブランチにあります。最新の変更点はdevブランチをご確認ください。
-
-FLUX.1およびSD3/SD3.5対応はsd3ブランチで行っています。それらの学習を行う場合はsd3ブランチをご利用ください。
-
 GUIやPowerShellスクリプトなど、より使いやすくする機能が[bmaltais氏のリポジトリ](https://github.com/bmaltais/kohya_ss)で提供されています（英語です）のであわせてご覧ください。bmaltais氏に感謝します。

 以下のスクリプトがあります。
@@ -21,7 +21,6 @@ GUIやPowerShellスクリプトなど、より使いやすくする機能が[bma

 * [学習について、共通編](./docs/train_README-ja.md) : データ整備やオプションなど
    * [データセット設定](./docs/config_README-ja.md)
-* [SDXL学習](./docs/train_SDXL-en.md) （英語版）
 * [DreamBoothの学習について](./docs/train_db_README-ja.md)
 * [fine-tuningのガイド](./docs/fine_tune_README_ja.md):
 * [LoRAの学習について](./docs/train_network_README-ja.md)
@@ -36,8 +35,6 @@ Python 3.10.6およびGitが必要です。
 - Python 3.10.6: https://www.python.org/ftp/python/3.10.6/python-3.10.6-amd64.exe
 - git: https://git-scm.com/download/win

-Python 3.10.x、3.11.x、3.12.xでも恐らく動作しますが、3.10.6でテストしています。
-
 PowerShellを使う場合、venvを使えるようにするためには以下の手順でセキュリティ設定を変更してください。
 （venvに限らずスクリプトの実行が可能になりますので注意してください。）

@@ -47,7 +44,9 @@ PowerShellを使う場合、venvを使えるようにするためには以下の

 ## Windows環境でのインストール

-スクリプトはPyTorch 2.1.2でテストしています。PyTorch 2.2以降でも恐らく動作します。
+スクリプトはPyTorch 2.0.1でテストしています。PyTorch 1.12.1でも動作すると思われます。
+
+以下の例ではPyTorchは2.0.1／CUDA 11.8版をインストールします。CUDA 11.6版やPyTorch 1.12.1を使う場合は適宜書き換えください。

 （なお、python -m venv～の行で「python」とだけ表示された場合、py -m venv～のようにpythonをpyに変更してください。）

@@ -60,23 +59,21 @@ cd sd-scripts
 python -m venv venv
 .\venv\Scripts\activate

-pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu118
+pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 --index-url https://download.pytorch.org/whl/cu118
 pip install --upgrade -r requirements.txt
-pip install xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118
+pip install xformers==0.0.20

 accelerate config
 ```

 コマンドプロンプトでも同一です。

-注：`bitsandbytes==0.44.0`、`prodigyopt==1.0`、`lion-pytorch==0.0.6` は `requirements.txt` に含まれるようになりました。他のバージョンを使う場合は適宜インストールしてください。
-
-この例では PyTorch および xfomers は2.1.2／CUDA 11.8版をインストールします。CUDA 12.1版やPyTorch 1.12.1を使う場合は適宜書き換えください。たとえば CUDA 12.1版の場合は `pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu121` および `pip install xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu121` としてください。
-
-PyTorch 2.2以降を用いる場合は、`torch==2.1.2` と `torchvision==0.16.2` 、および `xformers==0.0.23.post1` を適宜変更してください。
+（注:``python -m venv venv`` のほうが ``python -m venv --system-site-packages venv`` より安全そうなため書き換えました。globalなpythonにパッケージがインストールしてあると、後者だといろいろと問題が起きます。）

 accelerate configの質問には以下のように答えてください。（bf16で学習する場合、最後の質問にはbf16と答えてください。）

+※0.15.0から日本語環境では選択のためにカーソルキーを押すと落ちます（……）。数字キーの0、1、2……で選択できますので、そちらを使ってください。
+
 ```txt
 - This machine
 - No distributed training
@@ -90,6 +87,41 @@ accelerate configの質問には以下のように答えてください。（bf1
 ※場合によって ``ValueError: fp16 mixed precision requires a GPU`` というエラーが出ることがあるようです。この場合、6番目の質問（
 ``What GPU(s) (by id) should be used for training on this machine as a comma-separated list? [all]:``）に「0」と答えてください。（id `0`のGPUが使われます。）

+### オプション：`bitsandbytes`（8bit optimizer）を使う
+
+`bitsandbytes`はオプションになりました。Linuxでは通常通りpipでインストールできます（0.41.1または以降のバージョンを推奨）。
+
+Windowsでは0.35.0または0.41.1を推奨します。
+
+- `bitsandbytes` 0.35.0: 安定しているとみられるバージョンです。AdamW8bitは使用できますが、他のいくつかの8bit optimizer、学習時の`full_bf16`オプションは使用できません。
+- `bitsandbytes` 0.41.1: Lion8bit、PagedAdamW8bit、PagedLion8bitをサポートします。`full_bf16`が使用できます。
+
+注：`bitsandbytes` 0.35.0から0.41.0までのバージョンには問題があるようです。 https://github.com/TimDettmers/bitsandbytes/issues/659
+
+以下の手順に従い、`bitsandbytes`をインストールしてください。
+
+### 0.35.0を使う場合
+
+PowerShellの例です。コマンドプロンプトではcpの代わりにcopyを使ってください。
+
+```powershell
+cd sd-scripts
+.\venv\Scripts\activate
+pip install bitsandbytes==0.35.0
+
+cp .\bitsandbytes_windows\*.dll .\venv\Lib\site-packages\bitsandbytes\
+cp .\bitsandbytes_windows\cextension.py .\venv\Lib\site-packages\bitsandbytes\cextension.py
+cp .\bitsandbytes_windows\main.py .\venv\Lib\site-packages\bitsandbytes\cuda_setup\main.py
+```
+
+### 0.41.1を使う場合
+
+jllllll氏の配布されている[こちら](https://github.com/jllllll/bitsandbytes-windows-webui) または他の場所から、Windows用のwhlファイルをインストールしてください。
+
+```powershell
+python -m pip install bitsandbytes==0.41.1 --prefer-binary --extra-index-url=https://jllllll.github.io/bitsandbytes-windows-webui
+```
+
 ## アップグレード

 新しいリリースがあった場合、以下のコマンドで更新できます。
@@ -119,47 +151,4 @@ Conv2d 3x3への拡大は [cloneofsimo氏](https://github.com/cloneofsimo/lora)

 [BLIP](https://github.com/salesforce/BLIP): BSD-3-Clause

-## その他の情報

-### LoRAの名称について
-
-`train_network.py` がサポートするLoRAについて、混乱を避けるため名前を付けました。ドキュメントは更新済みです。以下は当リポジトリ内の独自の名称です。
-
-1. __LoRA-LierLa__ : (LoRA for __Li__ n __e__ a __r__  __La__ yers、リエラと読みます)
-
-    Linear 層およびカーネルサイズ 1x1 の Conv2d 層に適用されるLoRA
-
-2. __LoRA-C3Lier__ : (LoRA for __C__ olutional layers with __3__ x3 Kernel and  __Li__ n __e__ a __r__ layers、セリアと読みます)
-
-    1.に加え、カーネルサイズ 3x3 の Conv2d 層に適用されるLoRA
-
-デフォルトではLoRA-LierLaが使われます。LoRA-C3Lierを使う場合は `--network_args` に `conv_dim` を指定してください。
-
-<!-- 
-LoRA-LierLa は[Web UI向け拡張](https://github.com/kohya-ss/sd-webui-additional-networks)、またはAUTOMATIC1111氏のWeb UIのLoRA機能で使用することができます。
-
-LoRA-C3Lierを使いWeb UIで生成するには拡張を使用してください。
-->
-
-### 学習中のサンプル画像生成
-
-プロンプトファイルは例えば以下のようになります。
-
-```
-# prompt 1
-masterpiece, best quality, (1girl), in white shirts, upper body, looking at viewer, simple background --n low quality, worst quality, bad anatomy,bad composition, poor, low effort --w 768 --h 768 --d 1 --l 7.5 --s 28
-
-# prompt 2
-masterpiece, best quality, 1boy, in business suit, standing at street, looking back --n (low quality, worst quality), bad anatomy,bad composition, poor, low effort --w 576 --h 832 --d 2 --l 5.5 --s 40
-```
-
-  `#` で始まる行はコメントになります。`--n` のように「ハイフン二個＋英小文字」の形でオプションを指定できます。以下が使用可能できます。
-
-  * `--n` Negative prompt up to the next option.
-  * `--w` Specifies the width of the generated image.
-  * `--h` Specifies the height of the generated image.
-  * `--d` Specifies the seed of the generated image.
-  * `--l` Specifies the CFG scale of the generated image.
-  * `--s` Specifies the number of steps in the generation.
-
-  `( )` や `[ ]` などの重みづけも動作します。
--- a/README.md
+++ b/README.md
@@ -1,17 +1,183 @@
+# Training Stable Cascade Stage C 
+
+This is an experimental feature. There may be bugs.
+
+__Feb 25, 2024 Update:__  Fixed a bug that the LoRA weights trained can be loaded in ComfyUI. If you still have a problem, please let me know.
+
+__Feb 25, 2024 Update:__ Fixed a bug that Stage C training with mixed precision behaves the same as `--full_bf16` (fp16) regardless of `--full_bf16` (fp16) specified. 
+
+This is because the Stage C weights were loaded in bf16/fp16. With this fix, the memory usage without `--full_bf16` (fp16) specified will increase, so you may need to specify `--full_bf16` (fp16) as needed.
+
+__Feb 22, 2024 Update:__ Fixed a bug that LoRA is not applied to some modules (to_q/k/v and to_out) in Attention. Also, the model structure of Stage C has been changed, and you can choose xformers and SDPA (SDPA was used before). Please specify `--sdpa` or `--xformers` option.
+
+__Feb 20, 2024 Update:__ There was a problem with the preprocessing of the EfficientNetEncoder, and the latents became invalid (the saturation of the training results decreases). If you have cached `_sc_latents.npz` files with `--cache_latents_to_disk`, please delete them before training.
+
+## Usage
+
+Training is run with `stable_cascade_train_stage_c.py`.
+
+The main options are the same as `sdxl_train.py`. The following options have been added.
+
+- `--effnet_checkpoint_path`: Specifies the path to the EfficientNetEncoder weights.
+- `--stage_c_checkpoint_path`: Specifies the path to the Stage C weights.
+- `--text_model_checkpoint_path`: Specifies the path to the Text Encoder weights. If omitted, the model from Hugging Face will be used.
+- `--save_text_model`: Saves the model downloaded from Hugging Face to `--text_model_checkpoint_path`.
+- `--previewer_checkpoint_path`: Specifies the path to the Previewer weights. Used to generate sample images during training.
+- `--adaptive_loss_weight`: Uses [Adaptive Loss Weight](https://github.com/Stability-AI/StableCascade/blob/master/gdf/loss_weights.py) . If omitted, P2LossWeight is used. The official settings use Adaptive Loss Weight.
+
+The learning rate is set to 1e-4 in the official settings.
+
+The first time, specify `--text_model_checkpoint_path` and `--save_text_model` to save the Text Encoder weights. From the next time, specify `--text_model_checkpoint_path` to load the saved weights.
+
+Sample image generation during training is done with Perviewer. Perviewer is a simple decoder that converts EfficientNetEncoder latents to images.
+
+Some of the options for SDXL are simply ignored or cause an error (especially noise-related options such as `--noise_offset`). `--vae_batch_size` and `--no_half_vae` are applied directly to the EfficientNetEncoder (when `bf16` is specified for mixed precision, `--no_half_vae` is not necessary).
+
+Options for latents and Text Encoder output caches can be used as is, but since the EfficientNetEncoder is much lighter than the VAE, you may not need to use the cache unless memory is particularly tight.
+
+`--gradient_checkpointing`, `--full_bf16`, and `--full_fp16` (untested) to reduce memory consumption can be used as is.
+
+A scale of about 4 is suitable for sample image generation.
+
+Since the official settings use `bf16` for training, training with `fp16` may be unstable.
+
+The code for training the Text Encoder is also written, but it is untested.
+
+### Command line sample
+
+```batch
+accelerate launch  --mixed_precision bf16 --num_cpu_threads_per_process 1 stable_cascade_train_stage_c.py --mixed_precision bf16 --save_precision bf16 --max_data_loader_n_workers 2 --persistent_data_loader_workers --gradient_checkpointing --learning_rate 1e-4 --optimizer_type adafactor --optimizer_args "scale_parameter=False" "relative_step=False" "warmup_init=False" --max_train_epochs 10 --save_every_n_epochs 1 --save_precision bf16 --output_dir ../output --output_name sc_test - --stage_c_checkpoint_path ../models/stage_c_bf16.safetensors --effnet_checkpoint_path ../models/effnet_encoder.safetensors --previewer_checkpoint_path ../models/previewer.safetensors --dataset_config ../dataset/config_bs1.toml --sample_every_n_epochs 1 --sample_prompts ../dataset/prompts.txt --adaptive_loss_weight
+```
+
+### About the dataset for fine tuning
+
+If the latents cache files for SD/SDXL exist (extension `*.npz`), it will be read and an error will occur during training. Please move them to another location in advance.
+
+After that, run `finetune/prepare_buckets_latents.py` with the `--stable_cascade` option to create latents cache files for Stable Cascade (suffix `_sc_latents.npz` is added).
+
+## LoRA training
+
+`stable_cascade_train_c_network.py` is used for LoRA training. The main options are the same as `train_network.py`, and the same options as `stable_cascade_train_stage_c.py` have been added.
+
+__This is an experimental feature, so the format of the saved weights may change in the future and become incompatible.__
+
+There is no compatibility with the official LoRA, and the implementation of Text Encoder embedding training (Pivotal Tuning) in the official implementation is not implemented here.
+
+Text Encoder LoRA training is implemented, but untested.
+
+## Image generation
+
+Basic image generation functionality is available in `stable_cascade_gen_img.py`. See `--help` for usage.
+
+When using LoRA, specify `--network_module networks.lora --network_mul 1 --network_weights lora_weights.safetensors`.
+
+The following prompt options are available.
+
+  * `--n` Negative prompt up to the next option.
+  * `--w` Specifies the width of the generated image.
+  * `--h` Specifies the height of the generated image.
+  * `--d` Specifies the seed of the generated image.
+  * `--l` Specifies the CFG scale of the generated image.
+  * `--s` Specifies the number of steps in the generation.
+  * `--t` Specifies the t_start of the generation.
+  * `--f` Specifies the shift of the generation.
+
+# Stable Cascade Stage C の学習
+
+実験的機能です。不具合があるかもしれません。
+
+__2024/2/25 追記:__ 学習される LoRA の重みが ComfyUI で読み込めるよう修正しました。依然として不具合がある場合にはご連絡ください。
+
+__2024/2/25 追記:__ Mixed precision 時のStage C の学習が、 `--full_bf16` (fp16) の指定に関わらず `--full_bf16` (fp16) 指定時と同じ動作となる（と思われる）不具合を修正しました。
+
+Stage C の重みを bf16/fp16 で読み込んでいたためです。この修正により `--full_bf16` (fp16) 未指定時のメモリ使用量が増えますので、必要に応じて `--full_bf16` (fp16) を指定してください。
+
+__2024/2/22 追記:__ LoRA が一部のモジュール（Attention の to_q/k/v および to_out）に適用されない不具合を修正しました。また Stage C のモデル構造を変更し xformers と SDPA を選べるようになりました（今までは SDPA が使用されていました）。`--sdpa` または `--xformers` オプションを指定してください。
+
+__2024/2/20 追記:__ EfficientNetEncoder の前処理に不具合があり、latents が不正になっていました（学習結果の彩度が低下する現象が起きます）。`--cache_latents_to_disk` でキャッシュした `_sc_latents.npz` がある場合、いったん削除してから学習してください。
+
+## 使い方
+
+学習は `stable_cascade_train_stage_c.py` で行います。
+
+主なオプションは `sdxl_train.py` と同様です。以下のオプションが追加されています。
+
+- `--effnet_checkpoint_path` : EfficientNetEncoder の重みのパスを指定します。
+- `--stage_c_checkpoint_path` : Stage C の重みのパスを指定します。
+- `--text_model_checkpoint_path` : Text Encoder の重みのパスを指定します。省略時は Hugging Face のモデルを使用します。
+- `--save_text_model` : `--text_model_checkpoint_path` にHugging Face からダウンロードしたモデルを保存します。
+- `--previewer_checkpoint_path` : Previewer の重みのパスを指定します。学習中のサンプル画像生成に使用します。
+- `--adaptive_loss_weight` :  [Adaptive Loss Weight](https://github.com/Stability-AI/StableCascade/blob/master/gdf/loss_weights.py) を用います。省略時は P2LossWeight が使用されます。公式では Adaptive Loss Weight が使用されているようです。
+
+学習率は、公式の設定では 1e-4 のようです。
+
+初回は `--text_model_checkpoint_path` と `--save_text_model` を指定して、Text Encoder の重みを保存すると良いでしょう。次からは `--text_model_checkpoint_path` を指定して、保存した重みを読み込むことができます。
+
+学習中のサンプル画像生成は Perviewer で行われます。Previewer は EfficientNetEncoder の latents を画像に変換する簡易的な decoder です。
+
+SDXL の向けの一部のオプションは単に無視されるか、エラーになります（特に `--noise_offset` などのノイズ関係）。`--vae_batch_size` および `--no_half_vae` はそのまま EfficientNetEncoder に適用されます（mixed precision に `bf16` 指定時は `--no_half_vae` は不要のようです）。
+
+latents および Text Encoder 出力キャッシュのためのオプションはそのまま使用できますが、EfficientNetEncoder は VAE よりもかなり軽量のため、メモリが特に厳しい場合以外はキャッシュを使用する必要はないかもしれません。
+
+メモリ消費を抑えるための `--gradient_checkpointing` 、`--full_bf16`、`--full_fp16`（未テスト）はそのまま使用できます。
+
+サンプル画像生成時の Scale には 4 程度が適しているようです。
+
+公式の設定では学習に `bf16` を用いているため、`fp16` での学習は不安定かもしれません。
+
+Text Encoder 学習のコードも書いてありますが、未テストです。
+
+### コマンドラインのサンプル
+
+[Command-line-sample](#command-line-sample)を参照してください。
+
+
+###  fine tuning方式のデータセットについて
+
+SD/SDXL 向けの latents キャッシュファイル（拡張子 `*.npz`）が存在するとそれを読み込んでしまい学習時にエラーになります。あらかじめ他の場所に退避しておいてください。
+
+その後、`finetune/prepare_buckets_latents.py` をオプション `--stable_cascade` を指定して実行すると、Stable Cascade 向けの latents キャッシュファイル（接尾辞 `_sc_latents.npz` が付きます）が作成されます。
+
+
+## LoRA 等の学習
+
+LoRA の学習は `stable_cascade_train_c_network.py` で行います。主なオプションは `train_network.py` と同様で、`stable_cascade_train_stage_c.py` と同様のオプションが追加されています。
+
+__実験的機能のため、保存される重みのフォーマットは将来的に変更され、互換性がなくなる可能性があります。__
+
+公式の LoRA と重みの互換性はありません。また公式で実装されている Text Encoder の embedding 学習（Pivotal Tuning）も実装されていません。
+
+Text Encoder の LoRA 学習は実装してありますが、未テストです。
+
+## 画像生成
+
+最低限の画像生成機能が `stable_cascade_gen_img.py` にあります。使用法は `--help` を参照してください。
+
+LoRA 使用時は `--network_module networks.lora --network_mul 1 --network_weights lora_weights.safetensors` のように指定します。
+
+プロンプトオプションとして以下が使用できます。
+
+  * `--n` Negative prompt up to the next option.
+  * `--w` Specifies the width of the generated image.
+  * `--h` Specifies the height of the generated image.
+  * `--d` Specifies the seed of the generated image.
+  * `--l` Specifies the CFG scale of the generated image.
+  * `--s` Specifies the number of steps in the generation.
+  * `--t` Specifies the t_start of the generation.
+  * `--f` Specifies the shift of the generation.
+
+
+---  
+
+__SDXL is now supported. The sdxl branch has been merged into the main branch. If you update the repository, please follow the upgrade instructions. Also, the version of accelerate has been updated, so please run accelerate config again.__ The documentation for SDXL training is [here](./README.md#sdxl-training).
+
 This repository contains training, generation and utility scripts for Stable Diffusion.

 [__Change History__](#change-history) is moved to the bottom of the page. 
 更新履歴は[ページ末尾](#change-history)に移しました。

-Latest update: 2025-03-21 (Version 0.9.1)
-
 [日本語版READMEはこちら](./README-ja.md)

-The development version is in the `dev` branch. Please check the dev branch for the latest changes.
-
-FLUX.1 and SD3/SD3.5 support is done in the `sd3` branch. If you want to train them, please use the sd3 branch.
-
-
 For easier use (GUI and PowerShell scripts etc...), please visit [the repository maintained by bmaltais](https://github.com/bmaltais/kohya_ss). Thanks to @bmaltais!

 This repository contains the scripts for:
@@ -25,9 +191,9 @@ This repository contains the scripts for:

 ## About requirements.txt

-The file does not contain requirements for PyTorch. Because the version of PyTorch depends on the environment, it is not included in the file. Please install PyTorch first according to the environment. See installation instructions below.
+These files do not contain requirements for PyTorch. Because the versions of them depend on your environment. Please install PyTorch at first (see installation guide below.) 

-The scripts are tested with Pytorch 2.1.2. PyTorch 2.2 or later will work. Please install the appropriate version of PyTorch and xformers.
+The scripts are tested with Pytorch 2.0.1. 1.12.1 is not tested but should work.

 ## Links to usage documentation

@@ -37,13 +203,11 @@ Most of the documents are written in Japanese.

 * [Training guide - common](./docs/train_README-ja.md) : data preparation, options etc... 
  * [Chinese version](./docs/train_README-zh.md)
-* [SDXL training](./docs/train_SDXL-en.md) (English version)
 * [Dataset config](./docs/config_README-ja.md) 
-  * [English version](./docs/config_README-en.md)
 * [DreamBooth training guide](./docs/train_db_README-ja.md)
 * [Step by Step fine-tuning guide](./docs/fine_tune_README_ja.md):
-* [Training LoRA](./docs/train_network_README-ja.md)
-* [Training Textual Inversion](./docs/train_ti_README-ja.md)
+* [training LoRA](./docs/train_network_README-ja.md)
+* [training Textual Inversion](./docs/train_ti_README-ja.md)
 * [Image generation](./docs/gen_img_README-ja.md)
 * note.com [Model conversion](https://note.com/kohya_ss/n/n374f316fe4ad)

@@ -54,8 +218,6 @@ Python 3.10.6 and Git:
 - Python 3.10.6: https://www.python.org/ftp/python/3.10.6/python-3.10.6-amd64.exe
 - git: https://git-scm.com/download/win

-Python 3.10.x, 3.11.x, and 3.12.x will work but not tested.
-
 Give unrestricted script access to powershell so venv can work:

 - Open an administrator powershell window
@@ -73,20 +235,14 @@ cd sd-scripts
 python -m venv venv
 .\venv\Scripts\activate

-pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu118
+pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 --index-url https://download.pytorch.org/whl/cu118
 pip install --upgrade -r requirements.txt
-pip install xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118
+pip install xformers==0.0.20

 accelerate config
 ```

-If `python -m venv` shows only `python`, change `python` to `py`.
-
-Note: Now `bitsandbytes==0.44.0`, `prodigyopt==1.0` and `lion-pytorch==0.0.6` are included in the requirements.txt. If you'd like to use the another version, please install it manually.
-
-This installation is for CUDA 11.8. If you use a different version of CUDA, please install the appropriate version of PyTorch and xformers. For example, if you use CUDA 12, please install `pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu121` and `pip install xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu121`.
-
-If you use PyTorch 2.2 or later, please change `torch==2.1.2` and `torchvision==0.16.2` and `xformers==0.0.23.post1` to the appropriate version.
+__Note:__ Now bitsandbytes is optional. Please install any version of bitsandbytes as needed. Installation instructions are in the following section.

 <!-- 
 cp .\bitsandbytes_windows\*.dll .\venv\Lib\site-packages\bitsandbytes\
@@ -105,13 +261,48 @@ Answers to accelerate config:
 - fp16
 ```

-If you'd like to use bf16, please answer `bf16` to the last question.
-
-Note: Some user reports ``ValueError: fp16 mixed precision requires a GPU`` is occurred in training. In this case, answer `0` for the 6th question: 
+note: Some user reports ``ValueError: fp16 mixed precision requires a GPU`` is occurred in training. In this case, answer `0` for the 6th question: 
 ``What GPU(s) (by id) should be used for training on this machine as a comma-separated list? [all]:`` 

 (Single GPU with id `0` will be used.)

+### Optional: Use `bitsandbytes` (8bit optimizer)
+
+For 8bit optimizer, you need to install `bitsandbytes`. For Linux, please install `bitsandbytes` as usual (0.41.1 or later is recommended.)
+
+For Windows, there are several versions of `bitsandbytes`:
+
+- `bitsandbytes` 0.35.0: Stable version. AdamW8bit is available. `full_bf16` is not available.
+- `bitsandbytes` 0.41.1: Lion8bit, PagedAdamW8bit and PagedLion8bit are available. `full_bf16` is available.
+
+Note: `bitsandbytes`above 0.35.0 till 0.41.0 seems to have an issue: https://github.com/TimDettmers/bitsandbytes/issues/659
+
+Follow the instructions below to install `bitsandbytes` for Windows.
+
+### bitsandbytes 0.35.0 for Windows
+
+Open a regular Powershell terminal and type the following inside:
+
+```powershell
+cd sd-scripts
+.\venv\Scripts\activate
+pip install bitsandbytes==0.35.0
+
+cp .\bitsandbytes_windows\*.dll .\venv\Lib\site-packages\bitsandbytes\
+cp .\bitsandbytes_windows\cextension.py .\venv\Lib\site-packages\bitsandbytes\cextension.py
+cp .\bitsandbytes_windows\main.py .\venv\Lib\site-packages\bitsandbytes\cuda_setup\main.py
+```
+
+This will install `bitsandbytes` 0.35.0 and copy the necessary files to the `bitsandbytes` directory.
+
+### bitsandbytes 0.41.1 for Windows
+
+Install the Windows version whl file from [here](https://github.com/jllllll/bitsandbytes-windows-webui) or other sources, like:
+
+```powershell
+python -m pip install bitsandbytes==0.41.1 --prefer-binary --extra-index-url=https://jllllll.github.io/bitsandbytes-windows-webui
+```
+
 ## Upgrade

 When a new release comes out you can upgrade your repo with the following command:
@@ -125,10 +316,6 @@ pip install --use-pep517 --upgrade -r requirements.txt

 Once the commands have completed successfully you should be ready to use the new version.

-### Upgrade PyTorch
-
-If you want to upgrade PyTorch, you can upgrade it with `pip install` command in [Windows Installation](#windows-installation) section. `xformers` is also required to be upgraded when PyTorch is upgraded.
-
 ## Credits

 The implementation for LoRA is based on [cloneofsimo's repo](https://github.com/cloneofsimo/lora). Thank you for great work!
@@ -146,414 +333,209 @@ The majority of scripts is licensed under ASL 2.0 (including codes from Diffuser
 [BLIP](https://github.com/salesforce/BLIP): BSD-3-Clause


+## SDXL training
+
+The documentation in this section will be moved to a separate document later.
+
+### Training scripts for SDXL
+
+- `sdxl_train.py` is a script for SDXL fine-tuning. The usage is almost the same as `fine_tune.py`, but it also supports DreamBooth dataset.
+  - `--full_bf16` option is added. Thanks to KohakuBlueleaf!
+    - This option enables the full bfloat16 training (includes gradients). This option is useful to reduce the GPU memory usage. 
+    - The full bfloat16 training might be unstable. Please use it at your own risk.
+  - The different learning rates for each U-Net block are now supported in sdxl_train.py. Specify with `--block_lr` option. Specify 23 values separated by commas like `--block_lr 1e-3,1e-3 ... 1e-3`.
+    - 23 values correspond to `0: time/label embed, 1-9: input blocks 0-8, 10-12: mid blocks 0-2, 13-21: output blocks 0-8, 22: out`.
+- `prepare_buckets_latents.py` now supports SDXL fine-tuning.
+
+- `sdxl_train_network.py` is a script for LoRA training for SDXL. The usage is almost the same as `train_network.py`.
+
+- Both scripts has following additional options:
+  - `--cache_text_encoder_outputs` and `--cache_text_encoder_outputs_to_disk`: Cache the outputs of the text encoders. This option is useful to reduce the GPU memory usage. This option cannot be used with options for shuffling or dropping the captions.
+  - `--no_half_vae`: Disable the half-precision (mixed-precision) VAE. VAE for SDXL seems to produce NaNs in some cases. This option is useful to avoid the NaNs.
+
+- `--weighted_captions` option is not supported yet for both scripts.
+
+- `sdxl_train_textual_inversion.py` is a script for Textual Inversion training for SDXL. The usage is almost the same as `train_textual_inversion.py`.
+  - `--cache_text_encoder_outputs` is not supported.
+  - There are two options for captions:
+    1. Training with captions. All captions must include the token string. The token string is replaced with multiple tokens.
+    2. Use `--use_object_template` or `--use_style_template` option. The captions are generated from the template. The existing captions are ignored.
+  - See below for the format of the embeddings.
+
+- `--min_timestep` and `--max_timestep` options are added to each training script. These options can be used to train U-Net with different timesteps. The default values are 0 and 1000.
+
+### Utility scripts for SDXL
+
+- `tools/cache_latents.py` is added. This script can be used to cache the latents to disk in advance. 
+  - The options are almost the same as `sdxl_train.py'. See the help message for the usage.
+  - Please launch the script as follows:
+    `accelerate launch  --num_cpu_threads_per_process 1 tools/cache_latents.py ...`
+  - This script should work with multi-GPU, but it is not tested in my environment.
+
+- `tools/cache_text_encoder_outputs.py` is added. This script can be used to cache the text encoder outputs to disk in advance. 
+  - The options are almost the same as `cache_latents.py` and `sdxl_train.py`. See the help message for the usage.
+
+- `sdxl_gen_img.py` is added. This script can be used to generate images with SDXL, including LoRA, Textual Inversion and ControlNet-LLLite. See the help message for the usage.
+
+### Tips for SDXL training
+
+- The default resolution of SDXL is 1024x1024.
+- The fine-tuning can be done with 24GB GPU memory with the batch size of 1. For 24GB GPU, the following options are recommended __for the fine-tuning with 24GB GPU memory__:
+  - Train U-Net only.
+  - Use gradient checkpointing.
+  - Use `--cache_text_encoder_outputs` option and caching latents.
+  - Use Adafactor optimizer. RMSprop 8bit or Adagrad 8bit may work. AdamW 8bit doesn't seem to work.
+- The LoRA training can be done with 8GB GPU memory (10GB recommended). For reducing the GPU memory usage, the following options are recommended:
+  - Train U-Net only.
+  - Use gradient checkpointing.
+  - Use `--cache_text_encoder_outputs` option and caching latents.
+  - Use one of 8bit optimizers or Adafactor optimizer.
+  - Use lower dim (4 to 8 for 8GB GPU).
+- `--network_train_unet_only` option is highly recommended for SDXL LoRA. Because SDXL has two text encoders, the result of the training will be unexpected.
+- PyTorch 2 seems to use slightly less GPU memory than PyTorch 1.
+- `--bucket_reso_steps` can be set to 32 instead of the default value 64. Smaller values than 32 will not work for SDXL training.
+
+Example of the optimizer settings for Adafactor with the fixed learning rate:
+```toml
+optimizer_type = "adafactor"
+optimizer_args = [ "scale_parameter=False", "relative_step=False", "warmup_init=False" ]
+lr_scheduler = "constant_with_warmup"
+lr_warmup_steps = 100
+learning_rate = 4e-7 # SDXL original learning rate
+```
+
+### Format of Textual Inversion embeddings for SDXL
+
+```python
+from safetensors.torch import save_file
+
+state_dict = {"clip_g": embs_for_text_encoder_1280, "clip_l": embs_for_text_encoder_768}
+save_file(state_dict, file)
+```
+
+### ControlNet-LLLite
+
+ControlNet-LLLite, a novel method for ControlNet with SDXL, is added. See [documentation](./docs/train_lllite_README.md) for details.
+
+
 ## Change History

-### Mar 21, 2025 /  2025-03-21 Version 0.9.1
+### Feb 24, 2024 / 2024/2/24: v0.8.4
+
+- The log output has been improved. PR [#905](https://github.com/kohya-ss/sd-scripts/pull/905) Thanks to shirayu!
+  - The log is formatted by default. The `rich` library is required. Please see [Upgrade](#upgrade) and update the library.
+  - If `rich` is not installed, the log output will be the same as before.
+  - The following options are available in each training script:
+  - `--console_log_simple` option can be used to switch to the previous log output.
+  - `--console_log_level` option can be used to specify the log level. The default is `INFO`.
+  - `--console_log_file` option can be used to output the log to a file. The default is `None` (output to the console).
+- The sample image generation during multi-GPU training is now done with multiple GPUs. PR [#1061](https://github.com/kohya-ss/sd-scripts/pull/1061) Thanks to DKnight54!
+- The support for mps devices is improved. PR [#1054](https://github.com/kohya-ss/sd-scripts/pull/1054) Thanks to akx! If mps device exists instead of CUDA, the mps device is used automatically.
+- The `--new_conv_rank` option to specify the new rank of Conv2d is added to `networks/resize_lora.py`. PR [#1102](https://github.com/kohya-ss/sd-scripts/pull/1102) Thanks to mgz-dev!
+- An option `--highvram` to disable the optimization for environments with little VRAM is added to the training scripts. If you specify it when there is enough VRAM, the operation will be faster.
+  - Currently, only the cache part of latents is optimized.
+- The IPEX support is improved. PR [#1086](https://github.com/kohya-ss/sd-scripts/pull/1086) Thanks to Disty0!
+- Fixed a bug that `svd_merge_lora.py` crashes in some cases. PR [#1087](https://github.com/kohya-ss/sd-scripts/pull/1087) Thanks to mgz-dev!
+- DyLoRA is fixed to work with SDXL. PR [#1126](https://github.com/kohya-ss/sd-scripts/pull/1126) Thanks to tamlog06!
+- The common image generation script `gen_img.py` for SD 1/2 and SDXL is added. The basic functions are the same as the scripts for SD 1/2 and SDXL, but some new features are added.
+  - External scripts to generate prompts can be supported. It can be called with `--from_module` option. (The documentation will be added later)
+  - The normalization method after prompt weighting can be specified with `--emb_normalize_mode` option. `original` is the original method, `abs` is the normalization with the average of the absolute values, `none` is no normalization.
+- Gradual Latent Hires fix is added to each generation script. See [here](./docs/gen_img_README-ja.md#about-gradual-latent) for details.
+
+- ログ出力が改善されました。 PR [#905](https://github.com/kohya-ss/sd-scripts/pull/905) shirayu 氏に感謝します。
+  - デフォルトでログが成形されます。`rich` ライブラリが必要なため、[Upgrade](#upgrade) を参照し更新をお願いします。
+  - `rich` がインストールされていない場合は、従来のログ出力になります。
+  - 各学習スクリプトでは以下のオプションが有効です。
+  - `--console_log_simple` オプションで従来のログ出力に切り替えられます。
+  - `--console_log_level` でログレベルを指定できます。デフォルトは `INFO` です。
+  - `--console_log_file` でログファイルを出力できます。デフォルトは `None`（コンソールに出力） です。
+- 複数 GPU 学習時に学習中のサンプル画像生成を複数 GPU で行うようになりました。 PR [#1061](https://github.com/kohya-ss/sd-scripts/pull/1061) DKnight54 氏に感謝します。
+- mps デバイスのサポートが改善されました。 PR [#1054](https://github.com/kohya-ss/sd-scripts/pull/1054) akx 氏に感謝します。CUDA ではなく mps が存在する場合には自動的に mps デバイスを使用します。
+- `networks/resize_lora.py` に Conv2d の新しいランクを指定するオプション `--new_conv_rank` が追加されました。 PR [#1102](https://github.com/kohya-ss/sd-scripts/pull/1102) mgz-dev 氏に感謝します。
+- 学習スクリプトに VRAMが少ない環境向け最適化を無効にするオプション `--highvram` を追加しました。VRAM に余裕がある場合に指定すると動作が高速化されます。
+  - 現在は latents のキャッシュ部分のみ高速化されます。
+- IPEX サポートが改善されました。 PR [#1086](https://github.com/kohya-ss/sd-scripts/pull/1086) Disty0 氏に感謝します。
+- `svd_merge_lora.py` が場合によってエラーになる不具合が修正されました。 PR [#1087](https://github.com/kohya-ss/sd-scripts/pull/1087) mgz-dev 氏に感謝します。
+- DyLoRA が SDXL で動くよう修正されました。PR [#1126](https://github.com/kohya-ss/sd-scripts/pull/1126) tamlog06 氏に感謝します。
+- SD 1/2 および SDXL 共通の生成スクリプト `gen_img.py` を追加しました。基本的な機能は SD 1/2、SDXL 向けスクリプトと同じですが、いくつかの新機能が追加されています。
+  - プロンプトを動的に生成する外部スクリプトをサポートしました。 `--from_module` で呼び出せます。（ドキュメントはのちほど追加します）
+  - プロンプト重みづけ後の正規化方法を `--emb_normalize_mode` で指定できます。`original` は元の方法、`abs` は絶対値の平均値で正規化、`none` は正規化を行いません。
+- Gradual Latent Hires fix を各生成スクリプトに追加しました。詳細は [こちら](./docs/gen_img_README-ja.md#about-gradual-latent)。
+
+
+### Jan 27, 2024 / 2024/1/27: v0.8.3
+
+- Fixed a bug that the training crashes when `--fp8_base` is specified with `--save_state`. PR [#1079](https://github.com/kohya-ss/sd-scripts/pull/1079) Thanks to feffy380!
+  - `safetensors` is updated. Please see [Upgrade](#upgrade) and update the library.
+- Fixed a bug that the training crashes when `network_multiplier` is specified with multi-GPU training. PR [#1084](https://github.com/kohya-ss/sd-scripts/pull/1084) Thanks to fireicewolf!
+- Fixed a bug that the training crashes when training ControlNet-LLLite.
+
+- `--fp8_base` 指定時に `--save_state` での保存がエラーになる不具合が修正されました。 PR [#1079](https://github.com/kohya-ss/sd-scripts/pull/1079) feffy380 氏に感謝します。
+  - `safetensors` がバージョンアップされていますので、[Upgrade](#upgrade) を参照し更新をお願いします。
+- 複数 GPU での学習時に `network_multiplier` を指定するとクラッシュする不具合が修正されました。 PR [#1084](https://github.com/kohya-ss/sd-scripts/pull/1084) fireicewolf 氏に感謝します。
+- ControlNet-LLLite の学習がエラーになる不具合を修正しました。 
+
+### Jan 23, 2024 / 2024/1/23: v0.8.2
+
+- [Experimental] The `--fp8_base` option is added to the training scripts for LoRA etc. The base model (U-Net, and Text Encoder when training modules for Text Encoder) can be trained with fp8. PR [#1057](https://github.com/kohya-ss/sd-scripts/pull/1057) Thanks to KohakuBlueleaf!
+  - Please specify `--fp8_base` in `train_network.py` or `sdxl_train_network.py`.
+  - PyTorch 2.1 or later is required.
+  - If you use xformers with PyTorch 2.1, please see [xformers repository](https://github.com/facebookresearch/xformers) and install the appropriate version according to your CUDA version.
+  - The sample image generation during training consumes a lot of memory. It is recommended to turn it off.
+
+- [Experimental] The network multiplier can be specified for each dataset in the training scripts for LoRA etc.
+  - This is an experimental option and may be removed or changed in the future.
+  - For example, if you train with state A as `1.0` and state B as `-1.0`, you may be able to generate by switching between state A and B depending on the LoRA application rate.
+  - Also, if you prepare five states and train them as `0.2`, `0.4`, `0.6`, `0.8`, and `1.0`, you may be able to generate by switching the states smoothly depending on the application rate.
+  - Please specify `network_multiplier` in `[[datasets]]` in `.toml` file.
+- Some options are added to `networks/extract_lora_from_models.py` to reduce the memory usage.
+  - `--load_precision` option can be used to specify the precision when loading the model. If the model is saved in fp16, you can reduce the memory usage by specifying `--load_precision fp16` without losing precision.
+  - `--load_original_model_to` option can be used to specify the device to load the original model. `--load_tuned_model_to` option can be used to specify the device to load the derived model. The default is `cpu` for both options, but you can specify `cuda` etc. You can reduce the memory usage by loading one of them to GPU. This option is available only for SDXL.
+
+- The gradient synchronization in LoRA training with multi-GPU is improved. PR [#1064](https://github.com/kohya-ss/sd-scripts/pull/1064) Thanks to KohakuBlueleaf!
+- The code for Intel IPEX support is improved. PR [#1060](https://github.com/kohya-ss/sd-scripts/pull/1060) Thanks to akx!
+- Fixed a bug in multi-GPU Textual Inversion training.
+
+- （実験的）　LoRA等の学習スクリプトで、ベースモデル（U-Net、および Text Encoder のモジュール学習時は Text Encoder も）の重みを fp8 にして学習するオプションが追加されました。 PR [#1057](https://github.com/kohya-ss/sd-scripts/pull/1057) KohakuBlueleaf 氏に感謝します。
+  - `train_network.py` または `sdxl_train_network.py` で `--fp8_base` を指定してください。
+  - PyTorch 2.1 以降が必要です。
+  - PyTorch 2.1 で xformers を使用する場合は、[xformers のリポジトリ](https://github.com/facebookresearch/xformers) を参照し、CUDA バージョンに応じて適切なバージョンをインストールしてください。
+  - 学習中のサンプル画像生成はメモリを大量に消費するため、オフにすることをお勧めします。
+- (実験的)　LoRA 等の学習で、データセットごとに異なるネットワーク適用率を指定できるようになりました。 
+  - 実験的オプションのため、将来的に削除または仕様変更される可能性があります。
+  - たとえば状態 A を `1.0`、状態 B を `-1.0` として学習すると、LoRA の適用率に応じて状態 A と B を切り替えつつ生成できるかもしれません。
+  - また、五段階の状態を用意し、それぞれ `0.2`、`0.4`、`0.6`、`0.8`、`1.0` として学習すると、適用率でなめらかに状態を切り替えて生成できるかもしれません。 
+  - `.toml` ファイルで `[[datasets]]` に `network_multiplier` を指定してください。
+- `networks/extract_lora_from_models.py` に使用メモリ量を削減するいくつかのオプションを追加しました。 
+  - `--load_precision` で読み込み時の精度を指定できます。モデルが fp16 で保存されている場合は `--load_precision fp16` を指定して精度を変えずにメモリ量を削減できます。
+  - `--load_original_model_to` で元モデルを読み込むデバイスを、`--load_tuned_model_to` で派生モデルを読み込むデバイスを指定できます。デフォルトは両方とも `cpu` ですがそれぞれ `cuda` 等を指定できます。片方を GPU に読み込むことでメモリ量を削減できます。SDXL の場合のみ有効です。
+- マルチ GPU での LoRA 等の学習時に勾配の同期が改善されました。 PR [#1064](https://github.com/kohya-ss/sd-scripts/pull/1064) KohakuBlueleaf 氏に感謝します。
+- Intel IPEX サポートのコードが改善されました。PR [#1060](https://github.com/kohya-ss/sd-scripts/pull/1060) akx 氏に感謝します。
+- マルチ GPU での Textual Inversion 学習の不具合を修正しました。
+
+- `.toml` example for network multiplier / ネットワーク適用率の `.toml` の記述例
+
+```toml
+[general]
+[[datasets]]
+resolution = 512
+batch_size = 8
+network_multiplier = 1.0
+
+... subset settings ...
+
+[[datasets]]
+resolution = 512
+batch_size = 8
+network_multiplier = -1.0
+
+... subset settings ...
+```

- Fixed a bug where some of LoRA modules for CLIP Text Encoder were not trained. Thank you Nekotekina for PR [#1964](https://github.com/kohya-ss/sd-scripts/pull/1964)
-  - The LoRA modules for CLIP Text Encoder are now 264 modules, which is the same as before. Only 88 modules were trained in the previous version. 
-
-### Jan 17, 2025 /  2025-01-17 Version 0.9.0
-
- __important__ The dependent libraries are updated. Please see [Upgrade](#upgrade) and update the libraries.
-  - bitsandbytes, transformers, accelerate and huggingface_hub are updated. 
-  - If you encounter any issues, please report them.
-
- The dev branch is merged into main. The documentation is delayed, and I apologize for that. I will gradually improve it.
- The state just before the merge is released as Version 0.8.8, so please use it if you encounter any issues.
- The following changes are included.
-
-#### Changes
-
- Fixed a bug where the loss weight was incorrect when `--debiased_estimation_loss` was specified with `--v_parameterization`. PR [#1715](https://github.com/kohya-ss/sd-scripts/pull/1715) Thanks to catboxanon! See [the PR](https://github.com/kohya-ss/sd-scripts/pull/1715) for details.
-  - Removed the warning when `--v_parameterization` is specified in SDXL and SD1.5. PR [#1717](https://github.com/kohya-ss/sd-scripts/pull/1717)
-
- There was a bug where the min_bucket_reso/max_bucket_reso in the dataset configuration did not create the correct resolution bucket if it was not divisible by bucket_reso_steps. These values are now warned and automatically rounded to a divisible value. Thanks to Maru-mee for raising the issue. Related PR [#1632](https://github.com/kohya-ss/sd-scripts/pull/1632)
-
- `bitsandbytes` is updated to 0.44.0. Now you can use `AdEMAMix8bit` and `PagedAdEMAMix8bit` in the training script. PR [#1640](https://github.com/kohya-ss/sd-scripts/pull/1640) Thanks to sdbds!
-  - There is no abbreviation, so please specify the full path like `--optimizer_type bitsandbytes.optim.AdEMAMix8bit` (not bnb but bitsandbytes).
-
- Fixed a bug in the cache of latents. When `flip_aug`, `alpha_mask`, and `random_crop` are different in multiple subsets in the dataset configuration file (.toml), the last subset is used instead of reflecting them correctly.
-
- Fixed an issue where the timesteps in the batch were the same when using Huber loss. PR [#1628](https://github.com/kohya-ss/sd-scripts/pull/1628) Thanks to recris!
-
- Improvements in OFT (Orthogonal Finetuning) Implementation
-  1. Optimization of Calculation Order:
-      - Changed the calculation order in the forward method from (Wx)R to W(xR).
-      - This has improved computational efficiency and processing speed.
-  2. Correction of Bias Application:
-      - In the previous implementation, R was incorrectly applied to the bias.
-      - The new implementation now correctly handles bias by using F.conv2d and F.linear.
-  3. Efficiency Enhancement in Matrix Operations:
-      - Introduced einsum in both the forward and merge_to methods.
-      - This has optimized matrix operations, resulting in further speed improvements.
-  4. Proper Handling of Data Types:
-      - Improved to use torch.float32 during calculations and convert results back to the original data type.
-      - This maintains precision while ensuring compatibility with the original model.
-  5. Unified Processing for Conv2d and Linear Layers:
-     - Implemented a consistent method for applying OFT to both layer types.
-  - These changes have made the OFT implementation more efficient and accurate, potentially leading to improved model performance and training stability.
-
-  - Additional Information
-    * Recommended α value for OFT constraint: We recommend using α values between 1e-4 and 1e-2. This differs slightly from the original implementation of "(α\*out_dim\*out_dim)". Our implementation uses "(α\*out_dim)", hence we recommend higher values than the 1e-5 suggested in the original implementation.
-
-    * Performance Improvement: Training speed has been improved by approximately 30%.
-
-    * Inference Environment: This implementation is compatible with and operates within Stable Diffusion web UI (SD1/2 and SDXL).
-
- The INVERSE_SQRT, COSINE_WITH_MIN_LR, and WARMUP_STABLE_DECAY learning rate schedules are now available in the transformers library. See PR [#1393](https://github.com/kohya-ss/sd-scripts/pull/1393) for details. Thanks to sdbds!
-  - See the [transformers documentation](https://huggingface.co/docs/transformers/v4.44.2/en/main_classes/optimizer_schedules#schedules) for details on each scheduler.
-  - `--lr_warmup_steps` and `--lr_decay_steps` can now be specified as a ratio of the number of training steps, not just the step value. Example: `--lr_warmup_steps=0.1` or `--lr_warmup_steps=10%`, etc.
-
- When enlarging images in the script (when the size of the training image is small and bucket_no_upscale is not specified), it has been changed to use Pillow's resize and LANCZOS interpolation instead of OpenCV2's resize and Lanczos4 interpolation. The quality of the image enlargement may be slightly improved. PR [#1426](https://github.com/kohya-ss/sd-scripts/pull/1426) Thanks to sdbds!
-
- Sample image generation during training now works on non-CUDA devices. PR [#1433](https://github.com/kohya-ss/sd-scripts/pull/1433) Thanks to millie-v!
-
- `--v_parameterization` is available in `sdxl_train.py`. The results are unpredictable, so use with caution. PR [#1505](https://github.com/kohya-ss/sd-scripts/pull/1505) Thanks to liesened!
-
- Fused optimizer is available for SDXL training. PR [#1259](https://github.com/kohya-ss/sd-scripts/pull/1259) Thanks to 2kpr!
-  - The memory usage during training is significantly reduced by integrating the optimizer's backward pass with step. The training results are the same as before, but if you have plenty of memory, the speed will be slower.
-  - Specify the `--fused_backward_pass` option in `sdxl_train.py`. At this time, only AdaFactor is supported. Gradient accumulation is not available.
-  - Setting mixed precision to `no` seems to use less memory than `fp16` or `bf16`.
-  - Training is possible with a memory usage of about 17GB with a batch size of 1 and fp32. If you specify the `--full_bf16` option, you can further reduce the memory usage (but the accuracy will be lower). With the same memory usage as before, you can increase the batch size.
-  - PyTorch 2.1 or later is required because it uses the new API `Tensor.register_post_accumulate_grad_hook(hook)`.
-  - Mechanism: Normally, backward -> step is performed for each parameter, so all gradients need to be temporarily stored in memory. "Fuse backward and step" reduces memory usage by performing backward/step for each parameter and reflecting the gradient immediately. The more parameters there are, the greater the effect, so it is not effective in other training scripts (LoRA, etc.) where the memory usage peak is elsewhere, and there are no plans to implement it in those training scripts.
-
- Optimizer groups feature is added to SDXL training. PR [#1319](https://github.com/kohya-ss/sd-scripts/pull/1319)
-  - Memory usage is reduced by the same principle as Fused optimizer. The training results and speed are the same as Fused optimizer.
-  - Specify the number of groups like `--fused_optimizer_groups 10` in `sdxl_train.py`. Increasing the number of groups reduces memory usage but slows down training. Since the effect is limited to a certain number, it is recommended to specify 4-10.
-  - Any optimizer can be used, but optimizers that automatically calculate the learning rate (such as D-Adaptation and Prodigy) cannot be used. Gradient accumulation is not available.
-  - `--fused_optimizer_groups` cannot be used with `--fused_backward_pass`. When using AdaFactor, the memory usage is slightly larger than with Fused optimizer. PyTorch 2.1 or later is required.
-  - Mechanism: While Fused optimizer performs backward/step for individual parameters within the optimizer, optimizer groups reduce memory usage by grouping parameters and creating multiple optimizers to perform backward/step for each group. Fused optimizer requires implementation on the optimizer side, while optimizer groups are implemented only on the training script side.
-
- LoRA+ is supported. PR [#1233](https://github.com/kohya-ss/sd-scripts/pull/1233) Thanks to rockerBOO!
-  - LoRA+ is a method to improve training speed by increasing the learning rate of the UP side (LoRA-B) of LoRA. Specify the multiple. The original paper recommends 16, but adjust as needed. Please see the PR for details.
-  - Specify `loraplus_lr_ratio` with `--network_args`. Example: `--network_args "loraplus_lr_ratio=16"`
-  - `loraplus_unet_lr_ratio` and `loraplus_lr_ratio` can be specified separately for U-Net and Text Encoder.
-    - Example: `--network_args "loraplus_unet_lr_ratio=16" "loraplus_text_encoder_lr_ratio=4"` or `--network_args "loraplus_lr_ratio=16" "loraplus_text_encoder_lr_ratio=4"` etc.
-  - `network_module` `networks.lora` and `networks.dylora` are available.
-
- The feature to use the transparency (alpha channel) of the image as a mask in the loss calculation has been added. PR [#1223](https://github.com/kohya-ss/sd-scripts/pull/1223) Thanks to u-haru!
-  - The transparent part is ignored during training. Specify the `--alpha_mask` option in the training script or specify `alpha_mask = true` in the dataset configuration file.
-  - See [About masked loss](./docs/masked_loss_README.md) for details.
-
- LoRA training in SDXL now supports block-wise learning rates and block-wise dim (rank). PR [#1331](https://github.com/kohya-ss/sd-scripts/pull/1331) 
-  - Specify the learning rate and dim (rank) for each block.
-  - See [Block-wise learning rates in LoRA](./docs/train_network_README-ja.md#階層別学習率) for details (Japanese only).
-
- Negative learning rates can now be specified during SDXL model training. PR [#1277](https://github.com/kohya-ss/sd-scripts/pull/1277) Thanks to Cauldrath!
-  - The model is trained to move away from the training images, so the model is easily collapsed. Use with caution. A value close to 0 is recommended.
-  - When specifying from the command line, use `=` like `--learning_rate=-1e-7`.
-
- Training scripts can now output training settings to wandb or Tensor Board logs. Specify the `--log_config` option. PR [#1285](https://github.com/kohya-ss/sd-scripts/pull/1285)  Thanks to ccharest93, plucked, rockerBOO, and VelocityRa!
-  - Some settings, such as API keys and directory specifications, are not output due to security issues.
-
- The ControlNet training script `train_controlnet.py` for SD1.5/2.x was not working, but it has been fixed. PR [#1284](https://github.com/kohya-ss/sd-scripts/pull/1284) Thanks to sdbds!
-
- `train_network.py` and `sdxl_train_network.py` now restore the order/position of data loading from DataSet when resuming training. PR [#1353](https://github.com/kohya-ss/sd-scripts/pull/1353) [#1359](https://github.com/kohya-ss/sd-scripts/pull/1359) Thanks to KohakuBlueleaf!
-  - This resolves the issue where the order of data loading from DataSet changes when resuming training.
-  - Specify the `--skip_until_initial_step` option to skip data loading until the specified step. If not specified, data loading starts from the beginning of the DataSet (same as before).
-  - If `--resume` is specified, the step saved in the state is used.
-  - Specify the `--initial_step` or `--initial_epoch` option to skip data loading until the specified step or epoch. Use these options in conjunction with `--skip_until_initial_step`. These options can be used without `--resume` (use them when resuming training with `--network_weights`).
-
- An option `--disable_mmap_load_safetensors` is added to disable memory mapping when loading the model's .safetensors in SDXL. PR [#1266](https://github.com/kohya-ss/sd-scripts/pull/1266) Thanks to Zovjsra!
-  - It seems that the model file loading is faster in the WSL environment etc.
-  - Available in `sdxl_train.py`, `sdxl_train_network.py`, `sdxl_train_textual_inversion.py`, and `sdxl_train_control_net_lllite.py`.
-
- When there is an error in the cached latents file on disk, the file name is now displayed. PR [#1278](https://github.com/kohya-ss/sd-scripts/pull/1278) Thanks to Cauldrath!
-
- Fixed an error that occurs when specifying `--max_dataloader_n_workers` in `tag_images_by_wd14_tagger.py` when Onnx is not used. PR [#1291](
-https://github.com/kohya-ss/sd-scripts/pull/1291) issue [#1290](
-https://github.com/kohya-ss/sd-scripts/pull/1290) Thanks to frodo821!
-
- Fixed a bug that `caption_separator` cannot be specified in the subset in the dataset settings .toml file.  [#1312](https://github.com/kohya-ss/sd-scripts/pull/1312) and [#1313](https://github.com/kohya-ss/sd-scripts/pull/1312) Thanks to rockerBOO!
-
- Fixed a potential bug in ControlNet-LLLite training. PR [#1322](https://github.com/kohya-ss/sd-scripts/pull/1322) Thanks to aria1th!
-
- Fixed some bugs when using DeepSpeed. Related [#1247](https://github.com/kohya-ss/sd-scripts/pull/1247)
-
- Added a prompt option `--f` to `gen_imgs.py` to specify the file name when saving. Also, Diffusers-based keys for LoRA weights are now supported.
-
-#### 変更点
-
- devブランチがmainにマージされました。ドキュメントの整備が遅れており申し訳ありません。少しずつ整備していきます。
- マージ直前の状態が Version 0.8.8 としてリリースされていますので、問題があればそちらをご利用ください。
- 以下の変更が含まれます。
-
- SDXL の学習時に Fused optimizer が使えるようになりました。PR [#1259](https://github.com/kohya-ss/sd-scripts/pull/1259) 2kpr 氏に感謝します。
-  - optimizer の backward pass に step を統合することで学習時のメモリ使用量を大きく削減します。学習結果は未適用時と同一ですが、メモリが潤沢にある場合は速度は遅くなります。
-  - `sdxl_train.py` に `--fused_backward_pass` オプションを指定してください。現時点では optimizer は AdaFactor のみ対応しています。また gradient accumulation は使えません。
-  - mixed precision は `no` のほうが `fp16` や `bf16` よりも使用メモリ量が少ないようです。
-  - バッチサイズ 1、fp32 で 17GB 程度で学習可能なようです。`--full_bf16` オプションを指定するとさらに削減できます（精度は劣ります）。以前と同じメモリ使用量ではバッチサイズを増やせます。
-  - PyTorch 2.1 以降の新 API `Tensor.register_post_accumulate_grad_hook(hook)` を使用しているため、PyTorch 2.1 以降が必要です。
-  - 仕組み：通常は backward -> step の順で行うためすべての勾配を一時的にメモリに保持する必要があります。「backward と step の統合」はパラメータごとに backward/step を行って、勾配をすぐ反映することでメモリ使用量を削減します。パラメータ数が多いほど効果が大きいため、SDXL の学習以外（LoRA 等）ではほぼ効果がなく（メモリ使用量のピークが他の場所にあるため）、それらの学習スクリプトへの実装予定もありません。
-
- SDXL の学習時に optimizer group 機能を追加しました。PR [#1319](https://github.com/kohya-ss/sd-scripts/pull/1319)
-  - Fused optimizer と同様の原理でメモリ使用量を削減します。学習結果や速度についても同様です。
-  - `sdxl_train.py` に `--fused_optimizer_groups 10` のようにグループ数を指定してください。グループ数を増やすとメモリ使用量が削減されますが、速度は遅くなります。ある程度の数までしか効果がないため、4~10 程度を指定すると良いでしょう。
-  - 任意の optimizer が使えますが、学習率を自動計算する optimizer （D-Adaptation や Prodigy など）は使えません。gradient accumulation は使えません。
-  - `--fused_optimizer_groups` は `--fused_backward_pass` と併用できません。AdaFactor 使用時は Fused optimizer よりも若干メモリ使用量は大きくなります。PyTorch 2.1 以降が必要です。
-  - 仕組み：Fused optimizer が optimizer 内で個別のパラメータについて backward/step を行っているのに対して、optimizer groups はパラメータをグループ化して複数の optimizer を作成し、それぞれ backward/step を行うことでメモリ使用量を削減します。Fused optimizer は optimizer 側の実装が必要ですが、optimizer groups は学習スクリプト側のみで実装されています。やはり SDXL の学習でのみ効果があります。
-
- LoRA+ がサポートされました。PR [#1233](https://github.com/kohya-ss/sd-scripts/pull/1233) rockerBOO 氏に感謝します。
-  - LoRA の UP 側（LoRA-B）の学習率を上げることで学習速度の向上を図る手法です。倍数で指定します。元の論文では 16 が推奨されていますが、データセット等にもよりますので、適宜調整してください。PR もあわせてご覧ください。
-  - `--network_args` で `loraplus_lr_ratio` を指定します。例：`--network_args "loraplus_lr_ratio=16"`
-  - `loraplus_unet_lr_ratio` と `loraplus_lr_ratio` で、U-Net および Text Encoder に個別の値を指定することも可能です。
-    - 例：`--network_args "loraplus_unet_lr_ratio=16" "loraplus_text_encoder_lr_ratio=4"` または `--network_args "loraplus_lr_ratio=16" "loraplus_text_encoder_lr_ratio=4"` など
-  - `network_module` の `networks.lora` および `networks.dylora` で使用可能です。
-
- 画像の透明度（アルファチャネル）をロス計算時のマスクとして使用する機能が追加されました。PR [#1223](https://github.com/kohya-ss/sd-scripts/pull/1223) u-haru 氏に感謝します。
-  - 透明部分が学習時に無視されるようになります。学習スクリプトに `--alpha_mask` オプションを指定するか、データセット設定ファイルに `alpha_mask = true` を指定してください。
-  - 詳細は [マスクロスについて](./docs/masked_loss_README-ja.md) をご覧ください。
-
- SDXL の LoRA で階層別学習率、階層別 dim (rank) をサポートしました。PR [#1331](https://github.com/kohya-ss/sd-scripts/pull/1331) 
-  - ブロックごとに学習率および dim (rank) を指定することができます。
-  - 詳細は [LoRA の階層別学習率](./docs/train_network_README-ja.md#階層別学習率) をご覧ください。
-
- `sdxl_train.py` での SDXL モデル学習時に負の学習率が指定できるようになりました。PR [#1277](https://github.com/kohya-ss/sd-scripts/pull/1277) Cauldrath 氏に感謝します。
-  - 学習画像から離れるように学習するため、モデルは容易に崩壊します。注意して使用してください。0 に近い値を推奨します。
-  - コマンドラインから指定する場合、`--learning_rate=-1e-7` のように`=` を使ってください。
-
- 各学習スクリプトで学習設定を wandb や Tensor Board などのログに出力できるようになりました。`--log_config` オプションを指定してください。PR [#1285](https://github.com/kohya-ss/sd-scripts/pull/1285)  ccharest93 氏、plucked 氏、rockerBOO 氏および VelocityRa 氏に感謝します。
-  - API キーや各種ディレクトリ指定など、一部の設定はセキュリティ上の問題があるため出力されません。
-
- SD1.5/2.x 用の ControlNet 学習スクリプト `train_controlnet.py` が動作しなくなっていたのが修正されました。PR [#1284](https://github.com/kohya-ss/sd-scripts/pull/1284) sdbds 氏に感謝します。
-
- `train_network.py` および `sdxl_train_network.py` で、学習再開時に DataSet の読み込み順についても復元できるようになりました。PR [#1353](https://github.com/kohya-ss/sd-scripts/pull/1353) [#1359](https://github.com/kohya-ss/sd-scripts/pull/1359) KohakuBlueleaf 氏に感謝します。
-  - これにより、学習再開時に DataSet の読み込み順が変わってしまう問題が解消されます。
-  - `--skip_until_initial_step` オプションを指定すると、指定したステップまで DataSet 読み込みをスキップします。指定しない場合の動作は変わりません（DataSet の最初から読み込みます）
-  - `--resume` オプションを指定すると、state に保存されたステップ数が使用されます。
-  - `--initial_step` または `--initial_epoch` オプションを指定すると、指定したステップまたはエポックまで DataSet 読み込みをスキップします。これらのオプションは `--skip_until_initial_step` と併用してください。またこれらのオプションは `--resume` と併用しなくても使えます（`--network_weights` を用いた学習再開時などにお使いください ）。
-
- SDXL でモデルの .safetensors を読み込む際にメモリマッピングを無効化するオプション `--disable_mmap_load_safetensors` が追加されました。PR [#1266](https://github.com/kohya-ss/sd-scripts/pull/1266) Zovjsra 氏に感謝します。
-  - WSL 環境等でモデルファイルの読み込みが高速化されるようです。
-  - `sdxl_train.py`、`sdxl_train_network.py`、`sdxl_train_textual_inversion.py`、`sdxl_train_control_net_lllite.py` で使用可能です。
-
- ディスクにキャッシュされた latents ファイルに何らかのエラーがあったとき、そのファイル名が表示されるようになりました。 PR [#1278](https://github.com/kohya-ss/sd-scripts/pull/1278) Cauldrath 氏に感謝します。
-
- `tag_images_by_wd14_tagger.py` で Onnx 未使用時に `--max_dataloader_n_workers` を指定するとエラーになる不具合が修正されました。 PR [#1291](
-https://github.com/kohya-ss/sd-scripts/pull/1291) issue [#1290](
-https://github.com/kohya-ss/sd-scripts/pull/1290) frodo821 氏に感謝します。
-
- データセット設定の .toml ファイルで、`caption_separator` が subset に指定できない不具合が修正されました。 PR [#1312](https://github.com/kohya-ss/sd-scripts/pull/1312) および [#1313](https://github.com/kohya-ss/sd-scripts/pull/1313) rockerBOO 氏に感謝します。
-
- ControlNet-LLLite 学習時の潜在バグが修正されました。 PR [#1322](https://github.com/kohya-ss/sd-scripts/pull/1322) aria1th 氏に感謝します。
-
- DeepSpeed 使用時のいくつかのバグを修正しました。関連 [#1247](https://github.com/kohya-ss/sd-scripts/pull/1247)
-
- `gen_imgs.py` のプロンプトオプションに、保存時のファイル名を指定する `--f` オプションを追加しました。また同スクリプトで Diffusers ベースのキーを持つ LoRA の重みに対応しました。
-
-
-### Oct 27, 2024 / 2024-10-27:
-
- `svd_merge_lora.py` VRAM usage has been reduced. However, main memory usage will increase (32GB is sufficient).
- This will be included in the next release.
- `svd_merge_lora.py` のVRAM使用量を削減しました。ただし、メインメモリの使用量は増加します（32GBあれば十分です）。
- これは次回リリースに含まれます。
-
-### Oct 26, 2024 / 2024-10-26: 
-
- Fixed a bug in `svd_merge_lora.py`, `sdxl_merge_lora.py`, and `resize_lora.py` where the hash value of LoRA metadata was not correctly calculated when the `save_precision` was different from the  `precision` used in the calculation. See issue [#1722](https://github.com/kohya-ss/sd-scripts/pull/1722) for details. Thanks to JujoHotaru for raising the issue.
- It will be included in the next release.
-
- `svd_merge_lora.py`、`sdxl_merge_lora.py`、`resize_lora.py`で、保存時の精度が計算時の精度と異なる場合、LoRAメタデータのハッシュ値が正しく計算されない不具合を修正しました。詳細は issue [#1722](https://github.com/kohya-ss/sd-scripts/pull/1722) をご覧ください。問題提起していただいた JujoHotaru 氏に感謝します。
- 以上は次回リリースに含まれます。
-
-### Sep 13, 2024 / 2024-09-13: 
-
- `sdxl_merge_lora.py` now supports OFT. Thanks to Maru-mee for the PR [#1580](https://github.com/kohya-ss/sd-scripts/pull/1580). 
- `svd_merge_lora.py` now supports LBW. Thanks to terracottahaniwa. See PR [#1575](https://github.com/kohya-ss/sd-scripts/pull/1575) for details.
- `sdxl_merge_lora.py` also supports LBW. 
- See [LoRA Block Weight](https://github.com/hako-mikan/sd-webui-lora-block-weight) by hako-mikan for details on LBW.
- These will be included in the next release.
-
- `sdxl_merge_lora.py` が OFT をサポートされました。PR [#1580](https://github.com/kohya-ss/sd-scripts/pull/1580) Maru-mee 氏に感謝します。
- `svd_merge_lora.py` で LBW がサポートされました。PR [#1575](https://github.com/kohya-ss/sd-scripts/pull/1575) terracottahaniwa 氏に感謝します。
- `sdxl_merge_lora.py` でも LBW がサポートされました。
- LBW の詳細は hako-mikan 氏の [LoRA Block Weight](https://github.com/hako-mikan/sd-webui-lora-block-weight) をご覧ください。
- 以上は次回リリースに含まれます。
-
-### Jun 23, 2024 / 2024-06-23: 
-
- Fixed `cache_latents.py` and `cache_text_encoder_outputs.py` not working. (Will be included in the next release.)
-
- `cache_latents.py` および `cache_text_encoder_outputs.py` が動作しなくなっていたのを修正しました。（次回リリースに含まれます。）
-
-### Apr 7, 2024 / 2024-04-07: v0.8.7
-
- The default value of `huber_schedule` in Scheduled Huber Loss is changed from `exponential` to `snr`, which is expected to give better results.
-
- Scheduled Huber Loss の `huber_schedule` のデフォルト値を `exponential` から、より良い結果が期待できる `snr` に変更しました。
-
-### Apr 7, 2024 / 2024-04-07: v0.8.6
-
-#### Highlights
-
- The dependent libraries are updated. Please see [Upgrade](#upgrade) and update the libraries.
-  - Especially `imagesize` is newly added, so if you cannot update the libraries immediately, please install with `pip install imagesize==1.4.1` separately.
-  - `bitsandbytes==0.43.0`, `prodigyopt==1.0`, `lion-pytorch==0.0.6` are included in the requirements.txt.
-    - `bitsandbytes` no longer requires complex procedures as it now officially supports Windows.  
-  - Also, the PyTorch version is updated to 2.1.2 (PyTorch does not need to be updated immediately). In the upgrade procedure, PyTorch is not updated, so please manually install or update torch, torchvision, xformers if necessary (see [Upgrade PyTorch](#upgrade-pytorch)).
- When logging to wandb is enabled, the entire command line is exposed. Therefore, it is recommended to write wandb API key and HuggingFace token in the configuration file (`.toml`). Thanks to bghira for raising the issue.
-  - A warning is displayed at the start of training if such information is included in the command line.
-  - Also, if there is an absolute path, the path may be exposed, so it is recommended to specify a relative path or write it in the configuration file. In such cases, an INFO log is displayed.
-  - See [#1123](https://github.com/kohya-ss/sd-scripts/pull/1123) and PR [#1240](https://github.com/kohya-ss/sd-scripts/pull/1240) for details.
- Colab seems to stop with log output. Try specifying `--console_log_simple` option in the training script to disable rich logging.
- Other improvements include the addition of masked loss, scheduled Huber Loss, DeepSpeed support, dataset settings improvements, and image tagging improvements. See below for details.
-
-#### Training scripts
-
- `train_network.py` and `sdxl_train_network.py` are modified to record some dataset settings in the metadata of the trained model (`caption_prefix`, `caption_suffix`, `keep_tokens_separator`, `secondary_separator`, `enable_wildcard`).
- Fixed a bug that U-Net and Text Encoders are included in the state in `train_network.py` and `sdxl_train_network.py`. The saving and loading of the state are faster, the file size is smaller, and the memory usage when loading is reduced.
- DeepSpeed is supported. PR [#1101](https://github.com/kohya-ss/sd-scripts/pull/1101)  and [#1139](https://github.com/kohya-ss/sd-scripts/pull/1139) Thanks to BootsofLagrangian! See PR [#1101](https://github.com/kohya-ss/sd-scripts/pull/1101) for details.
- The masked loss is supported in each training script. PR [#1207](https://github.com/kohya-ss/sd-scripts/pull/1207) See [Masked loss](#about-masked-loss) for details.
- Scheduled Huber Loss has been introduced to each training scripts. PR [#1228](https://github.com/kohya-ss/sd-scripts/pull/1228/) Thanks to kabachuha for the PR and cheald, drhead, and others for the discussion! See the PR and [Scheduled Huber Loss](#about-scheduled-huber-loss) for details.
- The options `--noise_offset_random_strength` and `--ip_noise_gamma_random_strength` are added to each training script. These options can be used to vary the noise offset and ip noise gamma in the range of 0 to the specified value. PR [#1177](https://github.com/kohya-ss/sd-scripts/pull/1177) Thanks to KohakuBlueleaf!
- The options `--save_state_on_train_end` are added to each training script. PR [#1168](https://github.com/kohya-ss/sd-scripts/pull/1168) Thanks to gesen2egee!
- The options `--sample_every_n_epochs` and `--sample_every_n_steps` in each training script now display a warning and ignore them when a number less than or equal to `0` is specified. Thanks to S-Del for raising the issue.
-
-#### Dataset settings
-
- The [English version of the dataset settings documentation](./docs/config_README-en.md) is added. PR [#1175](https://github.com/kohya-ss/sd-scripts/pull/1175) Thanks to darkstorm2150!
- The `.toml` file for the dataset config is now read in UTF-8 encoding. PR [#1167](https://github.com/kohya-ss/sd-scripts/pull/1167) Thanks to Horizon1704!
- Fixed a bug that the last subset settings are applied to all images when multiple subsets of regularization images are specified in the dataset settings. The settings for each subset are correctly applied to each image. PR [#1205](https://github.com/kohya-ss/sd-scripts/pull/1205) Thanks to feffy380!
- Some features are added to the dataset subset settings.
-  - `secondary_separator` is added to specify the tag separator that is not the target of shuffling or dropping. 
-    - Specify `secondary_separator=";;;"`. When you specify `secondary_separator`, the part is not shuffled or dropped. 
-  - `enable_wildcard` is added. When set to `true`, the wildcard notation `{aaa|bbb|ccc}` can be used. The multi-line caption is also enabled.
-  - `keep_tokens_separator` is updated to be used twice in the caption. When you specify `keep_tokens_separator="|||"`, the part divided by the second `|||` is not shuffled or dropped and remains at the end.
-  - The existing features `caption_prefix` and `caption_suffix` can be used together. `caption_prefix` and `caption_suffix` are processed first, and then `enable_wildcard`, `keep_tokens_separator`, shuffling and dropping, and `secondary_separator` are processed in order.
-  - See [Dataset config](./docs/config_README-en.md) for details.
- The dataset with DreamBooth method supports caching image information (size, caption). PR [#1178](https://github.com/kohya-ss/sd-scripts/pull/1178) and [#1206](https://github.com/kohya-ss/sd-scripts/pull/1206) Thanks to KohakuBlueleaf! See [DreamBooth method specific options](./docs/config_README-en.md#dreambooth-specific-options) for details.
-
-#### Image tagging
-
- The support for v3 repositories is added to `tag_image_by_wd14_tagger.py` (`--onnx` option only). PR [#1192](https://github.com/kohya-ss/sd-scripts/pull/1192) Thanks to sdbds!
-  - Onnx may need to be updated. Onnx is not installed by default, so please install or update it with `pip install onnx==1.15.0 onnxruntime-gpu==1.17.1` etc. Please also check the comments in `requirements.txt`.
- The model is now saved in the subdirectory as `--repo_id` in `tag_image_by_wd14_tagger.py` . This caches multiple repo_id models. Please delete unnecessary files under `--model_dir`.
- Some options are added to `tag_image_by_wd14_tagger.py`.
-  - Some are added in PR [#1216](https://github.com/kohya-ss/sd-scripts/pull/1216) Thanks to Disty0!
-  - Output rating tags `--use_rating_tags` and `--use_rating_tags_as_last_tag`
-  - Output character tags first `--character_tags_first`
-  - Expand character tags and series `--character_tag_expand`
-  - Specify tags to output first `--always_first_tags`
-  - Replace tags `--tag_replacement`
-  - See [Tagging documentation](./docs/wd14_tagger_README-en.md) for details.
- Fixed an error when specifying `--beam_search` and a value of 2 or more for `--num_beams` in `make_captions.py`.
-
-#### About Masked loss
-
-The masked loss is supported in each training script. To enable the masked loss, specify the `--masked_loss` option.
-
-The feature is not fully tested, so there may be bugs. If you find any issues, please open an Issue.
-
-ControlNet dataset is used to specify the mask. The mask images should be the RGB images. The pixel value 255 in R channel is treated as the mask (the loss is calculated only for the pixels with the mask), and 0 is treated as the non-mask. The pixel values 0-255 are converted to 0-1 (i.e., the pixel value 128 is treated as the half weight of the loss). See details for the dataset specification in the [LLLite documentation](./docs/train_lllite_README.md#preparing-the-dataset).
-
-#### About Scheduled Huber Loss
-
-Scheduled Huber Loss has been introduced to each training scripts. This is a method to improve robustness against outliers or anomalies (data corruption) in the training data.
-
-With the traditional MSE (L2) loss function, the impact of outliers could be significant, potentially leading to a degradation in the quality of generated images. On the other hand, while the Huber loss function can suppress the influence of outliers, it tends to compromise the reproduction of fine details in images.
-
-To address this, the proposed method employs a clever application of the Huber loss function. By scheduling the use of Huber loss in the early stages of training (when noise is high) and MSE in the later stages, it strikes a balance between outlier robustness and fine detail reproduction.
-
-Experimental results have confirmed that this method achieves higher accuracy on data containing outliers compared to pure Huber loss or MSE. The increase in computational cost is minimal.
-
-The newly added arguments loss_type, huber_schedule, and huber_c allow for the selection of the loss function type (Huber, smooth L1, MSE), scheduling method (exponential, constant, SNR), and Huber's parameter. This enables optimization based on the characteristics of the dataset.
-
-See PR [#1228](https://github.com/kohya-ss/sd-scripts/pull/1228/) for details.
-
- `loss_type`: Specify the loss function type. Choose `huber` for Huber loss, `smooth_l1` for smooth L1 loss, and `l2` for MSE loss. The default is `l2`, which is the same as before.
- `huber_schedule`: Specify the scheduling method. Choose `exponential`, `constant`, or `snr`. The default is `snr`.
- `huber_c`: Specify the Huber's parameter. The default is `0.1`.

 Please read [Releases](https://github.com/kohya-ss/sd-scripts/releases) for recent updates.
-
-#### 主要な変更点
-
- 依存ライブラリが更新されました。[アップグレード](./README-ja.md#アップグレード) を参照しライブラリを更新してください。
-  - 特に `imagesize` が新しく追加されていますので、すぐにライブラリの更新ができない場合は `pip install imagesize==1.4.1` で個別にインストールしてください。
-  - `bitsandbytes==0.43.0`、`prodigyopt==1.0`、`lion-pytorch==0.0.6` が requirements.txt に含まれるようになりました。
-    - `bitsandbytes` が公式に Windows をサポートしたため複雑な手順が不要になりました。
-  - また PyTorch のバージョンを 2.1.2 に更新しました。PyTorch はすぐに更新する必要はありません。更新時は、アップグレードの手順では PyTorch が更新されませんので、torch、torchvision、xformers を手動でインストールしてください。
- wandb へのログ出力が有効の場合、コマンドライン全体が公開されます。そのため、コマンドラインに wandb の API キーや HuggingFace のトークンなどが含まれる場合、設定ファイル（`.toml`）への記載をお勧めします。問題提起していただいた bghira 氏に感謝します。
-  - このような場合には学習開始時に警告が表示されます。
-  - また絶対パスの指定がある場合、そのパスが公開される可能性がありますので、相対パスを指定するか設定ファイルに記載することをお勧めします。このような場合は INFO ログが表示されます。
-  - 詳細は [#1123](https://github.com/kohya-ss/sd-scripts/pull/1123) および PR [#1240](https://github.com/kohya-ss/sd-scripts/pull/1240) をご覧ください。
- Colab での動作時、ログ出力で停止してしまうようです。学習スクリプトに `--console_log_simple` オプションを指定し、rich のロギングを無効してお試しください。
- その他、マスクロス追加、Scheduled Huber Loss 追加、DeepSpeed 対応、データセット設定の改善、画像タグ付けの改善などがあります。詳細は以下をご覧ください。
-
-#### 学習スクリプト
-
- `train_network.py` および `sdxl_train_network.py` で、学習したモデルのメタデータに一部のデータセット設定が記録されるよう修正しました（`caption_prefix`、`caption_suffix`、`keep_tokens_separator`、`secondary_separator`、`enable_wildcard`）。
- `train_network.py` および `sdxl_train_network.py` で、state に U-Net および Text Encoder が含まれる不具合を修正しました。state の保存、読み込みが高速化され、ファイルサイズも小さくなり、また読み込み時のメモリ使用量も削減されます。
- DeepSpeed がサポートされました。PR [#1101](https://github.com/kohya-ss/sd-scripts/pull/1101) 、[#1139](https://github.com/kohya-ss/sd-scripts/pull/1139) BootsofLagrangian 氏に感謝します。詳細は PR [#1101](https://github.com/kohya-ss/sd-scripts/pull/1101) をご覧ください。
- 各学習スクリプトでマスクロスをサポートしました。PR [#1207](https://github.com/kohya-ss/sd-scripts/pull/1207) 詳細は [マスクロスについて](#マスクロスについて) をご覧ください。
- 各学習スクリプトに Scheduled Huber Loss を追加しました。PR [#1228](https://github.com/kohya-ss/sd-scripts/pull/1228/) ご提案いただいた kabachuha 氏、および議論を深めてくださった cheald 氏、drhead 氏を始めとする諸氏に感謝します。詳細は当該 PR および [Scheduled Huber Loss について](#scheduled-huber-loss-について) をご覧ください。
- 各学習スクリプトに、noise offset、ip noise gammaを、それぞれ 0~指定した値の範囲で変動させるオプション `--noise_offset_random_strength` および `--ip_noise_gamma_random_strength` が追加されました。 PR [#1177](https://github.com/kohya-ss/sd-scripts/pull/1177) KohakuBlueleaf 氏に感謝します。
- 各学習スクリプトに、学習終了時に state を保存する `--save_state_on_train_end` オプションが追加されました。 PR [#1168](https://github.com/kohya-ss/sd-scripts/pull/1168) gesen2egee 氏に感謝します。
- 各学習スクリプトで `--sample_every_n_epochs` および `--sample_every_n_steps` オプションに `0` 以下の数値を指定した時、警告を表示するとともにそれらを無視するよう変更しました。問題提起していただいた S-Del 氏に感謝します。
-
-#### データセット設定
-
- データセット設定の `.toml` ファイルが UTF-8 encoding で読み込まれるようになりました。PR [#1167](https://github.com/kohya-ss/sd-scripts/pull/1167) Horizon1704 氏に感謝します。
- データセット設定で、正則化画像のサブセットを複数指定した時、最後のサブセットの各種設定がすべてのサブセットの画像に適用される不具合が修正されました。それぞれのサブセットの設定が、それぞれの画像に正しく適用されます。PR [#1205](https://github.com/kohya-ss/sd-scripts/pull/1205) feffy380 氏に感謝します。
- データセットのサブセット設定にいくつかの機能を追加しました。
-  - シャッフルの対象とならないタグ分割識別子の指定 `secondary_separator` を追加しました。`secondary_separator=";;;"` のように指定します。`secondary_separator` で区切ることで、その部分はシャッフル、drop 時にまとめて扱われます。
-  - `enable_wildcard` を追加しました。`true` にするとワイルドカード記法 `{aaa|bbb|ccc}` が使えます。また複数行キャプションも有効になります。
-  - `keep_tokens_separator` をキャプション内に 2 つ使えるようにしました。たとえば `keep_tokens_separator="|||"` と指定したとき、`1girl, hatsune miku, vocaloid ||| stage, mic ||| best quality, rating: general` とキャプションを指定すると、二番目の `|||` で分割された部分はシャッフル、drop されず末尾に残ります。
-  - 既存の機能 `caption_prefix` と `caption_suffix` とあわせて使えます。`caption_prefix` と `caption_suffix` は一番最初に処理され、その後、ワイルドカード、`keep_tokens_separator`、シャッフルおよび drop、`secondary_separator` の順に処理されます。
-  - 詳細は [データセット設定](./docs/config_README-ja.md) をご覧ください。
- DreamBooth 方式の DataSet で画像情報（サイズ、キャプション）をキャッシュする機能が追加されました。PR [#1178](https://github.com/kohya-ss/sd-scripts/pull/1178)、[#1206](https://github.com/kohya-ss/sd-scripts/pull/1206) KohakuBlueleaf 氏に感謝します。詳細は [データセット設定](./docs/config_README-ja.md#dreambooth-方式専用のオプション) をご覧ください。
- データセット設定の[英語版ドキュメント](./docs/config_README-en.md) が追加されました。PR [#1175](https://github.com/kohya-ss/sd-scripts/pull/1175) darkstorm2150 氏に感謝します。
-
-#### 画像のタグ付け
-
- `tag_image_by_wd14_tagger.py` で v3 のリポジトリがサポートされました（`--onnx` 指定時のみ有効）。 PR [#1192](https://github.com/kohya-ss/sd-scripts/pull/1192) sdbds 氏に感謝します。
-  - Onnx のバージョンアップが必要になるかもしれません。デフォルトでは Onnx はインストールされていませんので、`pip install onnx==1.15.0 onnxruntime-gpu==1.17.1` 等でインストール、アップデートしてください。`requirements.txt` のコメントもあわせてご確認ください。
- `tag_image_by_wd14_tagger.py` で、モデルを`--repo_id` のサブディレクトリに保存するようにしました。これにより複数のモデルファイルがキャッシュされます。`--model_dir` 直下の不要なファイルは削除願います。
- `tag_image_by_wd14_tagger.py` にいくつかのオプションを追加しました。
-  - 一部は PR [#1216](https://github.com/kohya-ss/sd-scripts/pull/1216) で追加されました。Disty0 氏に感謝します。
-  - レーティングタグを出力する `--use_rating_tags` および `--use_rating_tags_as_last_tag`
-  - キャラクタタグを最初に出力する `--character_tags_first`
-  - キャラクタタグとシリーズを展開する `--character_tag_expand`
-  - 常に最初に出力するタグを指定する `--always_first_tags`
-  - タグを置換する `--tag_replacement`
-  - 詳細は [タグ付けに関するドキュメント](./docs/wd14_tagger_README-ja.md) をご覧ください。
- `make_captions.py` で `--beam_search` を指定し `--num_beams` に2以上の値を指定した時のエラーを修正しました。
-
-#### マスクロスについて
-
-各学習スクリプトでマスクロスをサポートしました。マスクロスを有効にするには `--masked_loss` オプションを指定してください。
-
-機能は完全にテストされていないため、不具合があるかもしれません。その場合は Issue を立てていただけると助かります。
-
-マスクの指定には ControlNet データセットを使用します。マスク画像は RGB 画像である必要があります。R チャンネルのピクセル値 255 がロス計算対象、0 がロス計算対象外になります。0-255 の値は、0-1 の範囲に変換されます（つまりピクセル値 128 の部分はロスの重みが半分になります）。データセットの詳細は [LLLite ドキュメント](./docs/train_lllite_README-ja.md#データセットの準備) をご覧ください。
-
-#### Scheduled Huber Loss について
-
-各学習スクリプトに、学習データ中の異常値や外れ値（data corruption）への耐性を高めるための手法、Scheduled Huber Lossが導入されました。
-
-従来のMSE（L2）損失関数では、異常値の影響を大きく受けてしまい、生成画像の品質低下を招く恐れがありました。一方、Huber損失関数は異常値の影響を抑えられますが、画像の細部再現性が損なわれがちでした。
-
-この手法ではHuber損失関数の適用を工夫し、学習の初期段階（ノイズが大きい場合）ではHuber損失を、後期段階ではMSEを用いるようスケジューリングすることで、異常値耐性と細部再現性のバランスを取ります。
-
-実験の結果では、この手法が純粋なHuber損失やMSEと比べ、異常値を含むデータでより高い精度を達成することが確認されています。また計算コストの増加はわずかです。
-
-具体的には、新たに追加された引数loss_type、huber_schedule、huber_cで、損失関数の種類（Huber, smooth L1, MSE）とスケジューリング方法（exponential, constant, SNR）を選択できます。これによりデータセットに応じた最適化が可能になります。
-
-詳細は PR [#1228](https://github.com/kohya-ss/sd-scripts/pull/1228/) をご覧ください。
-
- `loss_type` : 損失関数の種類を指定します。`huber` で Huber損失、`smooth_l1` で smooth L1 損失、`l2` で MSE 損失を選択します。デフォルトは `l2` で、従来と同様です。
- `huber_schedule` : スケジューリング方法を指定します。`exponential` で指数関数的、`constant` で一定、`snr` で信号対雑音比に基づくスケジューリングを選択します。デフォルトは `snr` です。
- `huber_c` : Huber損失のパラメータを指定します。デフォルトは `0.1` です。
-
-PR 内でいくつかの比較が共有されています。この機能を試す場合、最初は `--loss_type smooth_l1 --huber_schedule snr --huber_c 0.1` などで試してみるとよいかもしれません。
-
 最近の更新情報は [Release](https://github.com/kohya-ss/sd-scripts/releases) をご覧ください。

-## Additional Information
-
 ### Naming of LoRA

 The LoRA supported by `train_network.py` has been named to avoid confusion. The documentation has been updated. The following are the names of LoRA types in this repository.
@@ -566,14 +548,27 @@ The LoRA supported by `train_network.py` has been named to avoid confusion. The

    In addition to 1., LoRA for Conv2d layers with 3x3 kernel 
    
-LoRA-LierLa is the default LoRA type for `train_network.py` (without `conv_dim` network arg). 
-<!-- 
-LoRA-LierLa can be used with [our extension](https://github.com/kohya-ss/sd-webui-additional-networks) for AUTOMATIC1111's Web UI, or with the built-in LoRA feature of the Web UI.
+LoRA-LierLa is the default LoRA type for `train_network.py` (without `conv_dim` network arg). LoRA-LierLa can be used with [our extension](https://github.com/kohya-ss/sd-webui-additional-networks) for AUTOMATIC1111's Web UI, or with the built-in LoRA feature of the Web UI.

-To use LoRA-C3Lier with Web UI, please use our extension. 
-->
+To use LoRA-C3Lier with Web UI, please use our extension.

-### Sample image generation during training
+### LoRAの名称について
+
+`train_network.py` がサポートするLoRAについて、混乱を避けるため名前を付けました。ドキュメントは更新済みです。以下は当リポジトリ内の独自の名称です。
+
+1. __LoRA-LierLa__ : (LoRA for __Li__ n __e__ a __r__  __La__ yers、リエラと読みます)
+
+    Linear 層およびカーネルサイズ 1x1 の Conv2d 層に適用されるLoRA
+
+2. __LoRA-C3Lier__ : (LoRA for __C__ olutional layers with __3__ x3 Kernel and  __Li__ n __e__ a __r__ layers、セリアと読みます)
+
+    1.に加え、カーネルサイズ 3x3 の Conv2d 層に適用されるLoRA
+
+LoRA-LierLa は[Web UI向け拡張](https://github.com/kohya-ss/sd-webui-additional-networks)、またはAUTOMATIC1111氏のWeb UIのLoRA機能で使用することができます。
+
+LoRA-C3Lierを使いWeb UIで生成するには拡張を使用してください。
+
+## Sample image generation during training
  A prompt file might look like this, for example

 ```
@@ -594,3 +589,26 @@ masterpiece, best quality, 1boy, in business suit, standing at street, looking b
  * `--s` Specifies the number of steps in the generation.

  The prompt weighting such as `( )` and `[ ]` are working.
+
+## サンプル画像生成
+プロンプトファイルは例えば以下のようになります。
+
+```
+# prompt 1
+masterpiece, best quality, (1girl), in white shirts, upper body, looking at viewer, simple background --n low quality, worst quality, bad anatomy,bad composition, poor, low effort --w 768 --h 768 --d 1 --l 7.5 --s 28
+
+# prompt 2
+masterpiece, best quality, 1boy, in business suit, standing at street, looking back --n (low quality, worst quality), bad anatomy,bad composition, poor, low effort --w 576 --h 832 --d 2 --l 5.5 --s 40
+```
+
+  `#` で始まる行はコメントになります。`--n` のように「ハイフン二個＋英小文字」の形でオプションを指定できます。以下が使用可能できます。
+
+  * `--n` Negative prompt up to the next option.
+  * `--w` Specifies the width of the generated image.
+  * `--h` Specifies the height of the generated image.
+  * `--d` Specifies the seed of the generated image.
+  * `--l` Specifies the CFG scale of the generated image.
+  * `--s` Specifies the number of steps in the generation.
+
+  `( )` や `[ ]` などの重みづけも動作します。
+
--- a/_typos.toml
+++ b/_typos.toml
@@ -2,7 +2,6 @@
 # Instruction:  https://github.com/marketplace/actions/typos-action#getting-started

 [default.extend-identifiers]
-ddPn08="ddPn08"

 [default.extend-words]
 NIN="NIN"
@@ -28,7 +27,6 @@ rik="rik"
 koo="koo"
 yos="yos"
 wn="wn"
-hime="hime"


 [files]
--- a/docs/config_README-en.md
+++ b/docs/config_README-en.md
@@ -1,386 +0,0 @@
-Original Source by kohya-ss
-
-First version:
-A.I Translation by Model: NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO, editing by Darkstorm2150
-
-Some parts are manually added.
-
-# Config Readme
-
-This README is about the configuration files that can be passed with the `--dataset_config` option.
-
-## Overview
-
-By passing a configuration file, users can make detailed settings.
-
-* Multiple datasets can be configured
-   * For example, by setting `resolution` for each dataset, they can be mixed and trained.
-   * In training methods that support both the DreamBooth approach and the fine-tuning approach, datasets of the DreamBooth method and the fine-tuning method can be mixed.
-* Settings can be changed for each subset
-   * A subset is a partition of the dataset by image directory or metadata. Several subsets make up a dataset.
-   * Options such as `keep_tokens` and `flip_aug` can be set for each subset. On the other hand, options such as `resolution` and `batch_size` can be set for each dataset, and their values are common among subsets belonging to the same dataset. More details will be provided later.
-
-The configuration file format can be JSON or TOML. Considering the ease of writing, it is recommended to use [TOML](https://toml.io/ja/v1.0.0-rc.2). The following explanation assumes the use of TOML.
-
-
-Here is an example of a configuration file written in TOML.
-
-```toml
-[general]
-shuffle_caption = true
-caption_extension = '.txt'
-keep_tokens = 1
-
-# This is a DreamBooth-style dataset
-[[datasets]]
-resolution = 512
-batch_size = 4
-keep_tokens = 2
-
-  [[datasets.subsets]]
-  image_dir = 'C:\hoge'
-  class_tokens = 'hoge girl'
-  # This subset uses keep_tokens = 2 (the value of the parent datasets)
-
-  [[datasets.subsets]]
-  image_dir = 'C:\fuga'
-  class_tokens = 'fuga boy'
-  keep_tokens = 3
-
-  [[datasets.subsets]]
-  is_reg = true
-  image_dir = 'C:\reg'
-  class_tokens = 'human'
-  keep_tokens = 1
-
-# This is a fine-tuning dataset
-[[datasets]]
-resolution = [768, 768]
-batch_size = 2
-
-  [[datasets.subsets]]
-  image_dir = 'C:\piyo'
-  metadata_file = 'C:\piyo\piyo_md.json'
-  # This subset uses keep_tokens = 1 (the value of [general])
-```
-
-In this example, three directories are trained as a DreamBooth-style dataset at 512x512 (batch size 4), and one directory is trained as a fine-tuning dataset at 768x768 (batch size 2).
-
-## Settings for datasets and subsets
-
-Settings for datasets and subsets are divided into several registration locations.
-
-* `[general]`
-    * This is where options that apply to all datasets or all subsets are specified.
-    * If there are options with the same name in the dataset-specific or subset-specific settings, the dataset-specific or subset-specific settings take precedence.
-* `[[datasets]]`
-    * `datasets` is where settings for datasets are registered. This is where options that apply individually to each dataset are specified.
-	* If there are subset-specific settings, the subset-specific settings take precedence.
-* `[[datasets.subsets]]`
-    * `datasets.subsets` is where settings for subsets are registered. This is where options that apply individually to each subset are specified.
-
-Here is an image showing the correspondence between image directories and registration locations in the previous example.
-
-```
-C:\
-├─ hoge  ->  [[datasets.subsets]] No.1  ┐                        ┐
-├─ fuga  ->  [[datasets.subsets]] No.2  |->  [[datasets]] No.1   |->  [general]
-├─ reg   ->  [[datasets.subsets]] No.3  ┘                        |
-└─ piyo  ->  [[datasets.subsets]] No.4  -->  [[datasets]] No.2   ┘
-```
-
-The image directory corresponds to each `[[datasets.subsets]]`. Then, multiple `[[datasets.subsets]]` are combined to form one `[[datasets]]`. All `[[datasets]]` and `[[datasets.subsets]]` belong to `[general]`.
-
-The available options for each registration location may differ, but if the same option is specified, the value in the lower registration location will take precedence. You can check how the `keep_tokens` option is handled in the previous example for better understanding.
-
-Additionally, the available options may vary depending on the method that the learning approach supports.
-
-* Options specific to the DreamBooth method
-* Options specific to the fine-tuning method
-* Options available when using the caption dropout technique
-
-When using both the DreamBooth method and the fine-tuning method, they can be used together with a learning approach that supports both.
-When using them together, a point to note is that the method is determined based on the dataset, so it is not possible to mix DreamBooth method subsets and fine-tuning method subsets within the same dataset.
-In other words, if you want to use both methods together, you need to set up subsets of different methods belonging to different datasets.
-
-In terms of program behavior, if the `metadata_file` option exists, it is determined to be a subset of fine-tuning. Therefore, for subsets belonging to the same dataset, as long as they are either "all have the `metadata_file` option" or "all have no `metadata_file` option," there is no problem.
-
-Below, the available options will be explained. For options with the same name as the command-line argument, the explanation will be omitted in principle. Please refer to other READMEs.
-
-### Common options for all learning methods
-
-These are options that can be specified regardless of the learning method.
-
-#### Data set specific options
-
-These are options related to the configuration of the data set. They cannot be described in `datasets.subsets`.
-
-
-| Option Name | Example Setting | `[general]` | `[[datasets]]` |
-| ---- | ---- | ---- | ---- |
-| `batch_size` | `1` | o | o |
-| `bucket_no_upscale` | `true` | o | o |
-| `bucket_reso_steps` | `64` | o | o |
-| `enable_bucket` | `true` | o | o |
-| `max_bucket_reso` | `1024` | o | o |
-| `min_bucket_reso` | `128` | o | o |
-| `resolution` | `256`, `[512, 512]` | o | o |
-
-* `batch_size`
-    * This corresponds to the command-line argument `--train_batch_size`.
-* `max_bucket_reso`, `min_bucket_reso`
-    * Specify the maximum and minimum resolutions of the bucket. It must be divisible by `bucket_reso_steps`.
-
-These settings are fixed per dataset. That means that subsets belonging to the same dataset will share these settings. For example, if you want to prepare datasets with different resolutions, you can define them as separate datasets as shown in the example above, and set different resolutions for each.
-
-#### Options for Subsets
-
-These options are related to subset configuration.
-
-| Option Name | Example | `[general]` | `[[datasets]]` | `[[dataset.subsets]]` |
-| ---- | ---- | ---- | ---- | ---- |
-| `color_aug` | `false` | o | o | o |
-| `face_crop_aug_range` | `[1.0, 3.0]` | o | o | o |
-| `flip_aug` | `true` | o | o | o |
-| `keep_tokens` | `2` | o | o | o |
-| `num_repeats` | `10` | o | o | o |
-| `random_crop` | `false` | o | o | o |
-| `shuffle_caption` | `true` | o | o | o |
-| `caption_prefix` | `"masterpiece, best quality, "` | o | o | o |
-| `caption_suffix` | `", from side"` | o | o | o |
-| `caption_separator` |  (not specified) | o | o | o |
-| `keep_tokens_separator` | `“|||”` | o | o | o |
-| `secondary_separator` | `“;;;”` | o | o | o |
-| `enable_wildcard` | `true` | o | o | o |
-
-* `num_repeats`
-    * Specifies the number of repeats for images in a subset. This is equivalent to `--dataset_repeats` in fine-tuning but can be specified for any training method.
-* `caption_prefix`, `caption_suffix`
-    * Specifies the prefix and suffix strings to be appended to the captions. Shuffling is performed with these strings included. Be cautious when using `keep_tokens`.
-* `caption_separator`
-    * Specifies the string to separate the tags. The default is `,`. This option is usually not necessary to set.
-* `keep_tokens_separator`
-    * Specifies the string to separate the parts to be fixed in the caption. For example, if you specify `aaa, bbb ||| ccc, ddd, eee, fff ||| ggg, hhh`, the parts `aaa, bbb` and `ggg, hhh` will remain, and the rest will be shuffled and dropped. The comma in between is not necessary. As a result, the prompt will be `aaa, bbb, eee, ccc, fff, ggg, hhh` or `aaa, bbb, fff, ccc, eee, ggg, hhh`, etc.
-* `secondary_separator`
-    * Specifies an additional separator. The part separated by this separator is treated as one tag and is shuffled and dropped. It is then replaced by `caption_separator`. For example, if you specify `aaa;;;bbb;;;ccc`, it will be replaced by `aaa,bbb,ccc` or dropped together.
-* `enable_wildcard`
-    * Enables wildcard notation. This will be explained later.
-
-### DreamBooth-specific options
-
-DreamBooth-specific options only exist as subsets-specific options.
-
-#### Subset-specific options
-
-Options related to the configuration of DreamBooth subsets.
-
-| Option Name | Example Setting | `[general]` | `[[datasets]]` | `[[dataset.subsets]]` |
-| ---- | ---- | ---- | ---- | ---- |
-| `image_dir` | `'C:\hoge'` | - | - | o (required) |
-| `caption_extension` | `".txt"` | o | o | o |
-| `class_tokens` | `"sks girl"` | - | - | o |
-| `cache_info` | `false` | o | o | o |
-| `is_reg` | `false` | - | - | o |
-
-Firstly, note that for `image_dir`, the path to the image files must be specified as being directly in the directory. Unlike the previous DreamBooth method, where images had to be placed in subdirectories, this is not compatible with that specification. Also, even if you name the folder something like "5_cat", the number of repeats of the image and the class name will not be reflected. If you want to set these individually, you will need to explicitly specify them using `num_repeats` and `class_tokens`.
-
-* `image_dir`
-    * Specifies the path to the image directory. This is a required option.
-    * Images must be placed directly under the directory.
-* `class_tokens`
-    * Sets the class tokens.
-    * Only used during training when a corresponding caption file does not exist. The determination of whether or not to use it is made on a per-image basis. If `class_tokens` is not specified and a caption file is not found, an error will occur.
-* `cache_info`
-    * Specifies whether to cache the image size and caption. If not specified, it is set to `false`. The cache is saved in `metadata_cache.json` in `image_dir`.
-    * Caching speeds up the loading of the dataset after the first time. It is effective when dealing with thousands of images or more.
-* `is_reg`
-    * Specifies whether the subset images are for normalization. If not specified, it is set to `false`, meaning that the images are not for normalization.
-
-### Fine-tuning method specific options
-
-The options for the fine-tuning method only exist for subset-specific options.
-
-#### Subset-specific options
-
-These options are related to the configuration of the fine-tuning method's subsets.
-
-| Option name | Example setting | `[general]` | `[[datasets]]` | `[[dataset.subsets]]` |
-| ---- | ---- | ---- | ---- | ---- |
-| `image_dir` | `'C:\hoge'` | - | - | o |
-| `metadata_file` | `'C:\piyo\piyo_md.json'` | - | - | o (required) |
-
-* `image_dir`
-    * Specify the path to the image directory. Unlike the DreamBooth method, specifying it is not mandatory, but it is recommended to do so.
-        * The case where it is not necessary to specify is when the `--full_path` is added to the command line when generating the metadata file.
-    * The images must be placed directly under the directory.
-* `metadata_file`
-    * Specify the path to the metadata file used for the subset. This is a required option.
-        * It is equivalent to the command-line argument `--in_json`.
-    * Due to the specification that a metadata file must be specified for each subset, it is recommended to avoid creating a metadata file with images from different directories as a single metadata file. It is strongly recommended to prepare a separate metadata file for each image directory and register them as separate subsets.
-
-### Options available when caption dropout method can be used
-
-The options available when the caption dropout method can be used exist only for subsets. Regardless of whether it's the DreamBooth method or fine-tuning method, if it supports caption dropout, it can be specified.
-
-#### Subset-specific options
-
-Options related to the setting of subsets that caption dropout can be used for.
-
-| Option Name | `[general]` | `[[datasets]]` | `[[dataset.subsets]]` |
-| ---- | ---- | ---- | ---- |
-| `caption_dropout_every_n_epochs` | o | o | o |
-| `caption_dropout_rate` | o | o | o |
-| `caption_tag_dropout_rate` | o | o | o |
-
-## Behavior when there are duplicate subsets
-
-In the case of the DreamBooth dataset, if there are multiple `image_dir` directories with the same content, they are considered to be duplicate subsets. For the fine-tuning dataset, if there are multiple `metadata_file` files with the same content, they are considered to be duplicate subsets. If duplicate subsets exist in the dataset, subsequent subsets will be ignored.
-
-However, if they belong to different datasets, they are not considered duplicates. For example, if you have subsets with the same `image_dir` in different datasets, they will not be considered duplicates. This is useful when you want to train with the same image but with different resolutions.
-
-```toml
-# If data sets exist separately, they are not considered duplicates and are both used for training.
-
-[[datasets]]
-resolution = 512
-
-  [[datasets.subsets]]
-  image_dir = 'C:\hoge'
-
-[[datasets]]
-resolution = 768
-
-  [[datasets.subsets]]
-  image_dir = 'C:\hoge'
-```
-
-## Command Line Argument and Configuration File
-
-There are options in the configuration file that have overlapping roles with command line argument options.
-
-The following command line argument options are ignored if a configuration file is passed:
-
-* `--train_data_dir`
-* `--reg_data_dir`
-* `--in_json`
-
-The following command line argument options are given priority over the configuration file options if both are specified simultaneously. In most cases, they have the same names as the corresponding options in the configuration file.
-
-| Command Line Argument Option   | Prioritized Configuration File Option |
-| ------------------------------- | ------------------------------------- |
-| `--bucket_no_upscale`           |                                       |
-| `--bucket_reso_steps`           |                                       |
-| `--caption_dropout_every_n_epochs` |                                       |
-| `--caption_dropout_rate`        |                                       |
-| `--caption_extension`           |                                       |
-| `--caption_tag_dropout_rate`    |                                       |
-| `--color_aug`                   |                                       |
-| `--dataset_repeats`             | `num_repeats`                          |
-| `--enable_bucket`               |                                       |
-| `--face_crop_aug_range`         |                                       |
-| `--flip_aug`                    |                                       |
-| `--keep_tokens`                 |                                       |
-| `--min_bucket_reso`              |                                       |
-| `--random_crop`                 |                                       |
-| `--resolution`                  |                                       |
-| `--shuffle_caption`             |                                       |
-| `--train_batch_size`            | `batch_size`                           |
-
-## Error Guide
-
-Currently, we are using an external library to check if the configuration file is written correctly, but the development has not been completed, and there is a problem that the error message is not clear. In the future, we plan to improve this problem.
-
-As a temporary measure, we will list common errors and their solutions. If you encounter an error even though it should be correct or if the error content is not understandable, please contact us as it may be a bug.
-
-* `voluptuous.error.MultipleInvalid: required key not provided @ ...`: This error occurs when a required option is not provided. It is highly likely that you forgot to specify the option or misspelled the option name.
-  * The error location is indicated by `...` in the error message. For example, if you encounter an error like `voluptuous.error.MultipleInvalid: required key not provided @ data['datasets'][0]['subsets'][0]['image_dir']`, it means that the `image_dir` option does not exist in the 0th `subsets` of the 0th `datasets` setting.
-* `voluptuous.error.MultipleInvalid: expected int for dictionary value @ ...`: This error occurs when the specified value format is incorrect. It is highly likely that the value format is incorrect. The `int` part changes depending on the target option. The example configurations in this README may be helpful.
-* `voluptuous.error.MultipleInvalid: extra keys not allowed @ ...`: This error occurs when there is an option name that is not supported. It is highly likely that you misspelled the option name or mistakenly included it.
-
-## Miscellaneous
-
-### Multi-line captions
-
-By setting `enable_wildcard = true`, multiple-line captions are also enabled. If the caption file consists of multiple lines, one line is randomly selected as the caption. 
-
-```txt
-1girl, hatsune miku, vocaloid, upper body, looking at viewer, microphone, stage
-a girl with a microphone standing on a stage
-detailed digital art of a girl with a microphone on a stage
-```
-
-It can be combined with wildcard notation.
-
-In metadata files, you can also specify multiple-line captions. In the `.json` metadata file, use `\n` to represent a line break. If the caption file consists of multiple lines, `merge_captions_to_metadata.py` will create a metadata file in this format.
-
-The tags in the metadata (`tags`) are added to each line of the caption.
-
-```json
-{
-    "/path/to/image.png": {
-        "caption": "a cartoon of a frog with the word frog on it\ntest multiline caption1\ntest multiline caption2",
-        "tags": "open mouth, simple background, standing, no humans, animal, black background, frog, animal costume, animal focus"
-    },
-    ...
-}
-```
-
-In this case, the actual caption will be `a cartoon of a frog with the word frog on it, open mouth, simple background ...`, `test multiline caption1, open mouth, simple background ...`, `test multiline caption2, open mouth, simple background ...`, etc.
-
-### Example of configuration file : `secondary_separator`, wildcard notation, `keep_tokens_separator`, etc.
-
-```toml
-[general]
-flip_aug = true
-color_aug = false
-resolution = [1024, 1024]
-
-[[datasets]]
-batch_size = 6
-enable_bucket = true
-bucket_no_upscale = true
-caption_extension = ".txt"
-keep_tokens_separator= "|||"
-shuffle_caption = true
-caption_tag_dropout_rate = 0.1
-secondary_separator = ";;;" # subset 側に書くこともできます / can be written in the subset side
-enable_wildcard = true # 同上 / same as above
-
-  [[datasets.subsets]]
-  image_dir = "/path/to/image_dir"
-  num_repeats = 1
-
-  # ||| の前後はカンマは不要です（自動的に追加されます） / No comma is required before and after ||| (it is added automatically)
-  caption_prefix = "1girl, hatsune miku, vocaloid |||" 
-  
-  # ||| の後はシャッフル、drop されず残ります / After |||, it is not shuffled or dropped and remains
-  # 単純に文字列として連結されるので、カンマなどは自分で入れる必要があります / It is simply concatenated as a string, so you need to put commas yourself
-  caption_suffix = ", anime screencap ||| masterpiece, rating: general"
-```
-
-### Example of caption, secondary_separator notation: `secondary_separator = ";;;"`
-
-```txt
-1girl, hatsune miku, vocaloid, upper body, looking at viewer, sky;;;cloud;;;day, outdoors
-```
-The part `sky;;;cloud;;;day` is replaced with `sky,cloud,day` without shuffling or dropping. When shuffling and dropping are enabled, it is processed as a whole (as one tag). For example, it becomes `vocaloid, 1girl, upper body, sky,cloud,day, outdoors, hatsune miku` (shuffled) or `vocaloid, 1girl, outdoors, looking at viewer, upper body, hatsune miku` (dropped).
-
-### Example of caption, enable_wildcard notation: `enable_wildcard = true`
-
-```txt
-1girl, hatsune miku, vocaloid, upper body, looking at viewer, {simple|white} background
-```
-`simple` or `white` is randomly selected, and it becomes `simple background` or `white background`.
-
-```txt
-1girl, hatsune miku, vocaloid, {{retro style}}
-```
-If you want to include `{` or `}` in the tag string, double them like `{{` or `}}` (in this example, the actual caption used for training is `{retro style}`).
-
-### Example of caption, `keep_tokens_separator` notation: `keep_tokens_separator = "|||"`
-
-```txt
-1girl, hatsune miku, vocaloid ||| stage, microphone, white shirt, smile ||| best quality, rating: general
-```
-It becomes `1girl, hatsune miku, vocaloid, microphone, stage, white shirt, best quality, rating: general` or `1girl, hatsune miku, vocaloid, white shirt, smile, stage, microphone, best quality, rating: general` etc.
-
--- a/docs/config_README-ja.md
+++ b/docs/config_README-ja.md
@@ -1,3 +1,5 @@
+For non-Japanese speakers: this README is provided only in Japanese in the current state. Sorry for inconvenience. We will provide English version in the near future.
+
 `--dataset_config` で渡すことができる設定ファイルに関する説明です。

 ## 概要
@@ -118,8 +120,6 @@ DreamBooth の手法と fine tuning の手法の両方とも利用可能な学

 * `batch_size`
    * コマンドライン引数の `--train_batch_size` と同等です。
-* `max_bucket_reso`, `min_bucket_reso`
-    * bucketの最大、最小解像度を指定します。`bucket_reso_steps` で割り切れる必要があります。

 これらの設定はデータセットごとに固定です。
 つまり、データセットに所属するサブセットはこれらの設定を共有することになります。
@@ -140,28 +140,12 @@ DreamBooth の手法と fine tuning の手法の両方とも利用可能な学
 | `shuffle_caption` | `true` | o | o | o |
 | `caption_prefix` | `“masterpiece, best quality, ”` | o | o | o |
 | `caption_suffix` | `“, from side”` | o | o | o |
-| `caption_separator` | （通常は設定しません） | o | o | o |
-| `keep_tokens_separator` | `“|||”` | o | o | o |
-| `secondary_separator` | `“;;;”` | o | o | o |
-| `enable_wildcard` | `true` | o | o | o |

 * `num_repeats`
    * サブセットの画像の繰り返し回数を指定します。fine tuning における `--dataset_repeats` に相当しますが、`num_repeats` はどの学習方法でも指定可能です。
 * `caption_prefix`, `caption_suffix`
    * キャプションの前、後に付与する文字列を指定します。シャッフルはこれらの文字列を含めた状態で行われます。`keep_tokens` を指定する場合には注意してください。

-* `caption_separator`
-    * タグを区切る文字列を指定します。デフォルトは `,` です。このオプションは通常は設定する必要はありません。
-
-* `keep_tokens_separator`
-    *  キャプションで固定したい部分を区切る文字列を指定します。たとえば `aaa, bbb ||| ccc, ddd, eee, fff ||| ggg, hhh` のように指定すると、`aaa, bbb` と `ggg, hhh` の部分はシャッフル、drop されず残ります。間のカンマは不要です。結果としてプロンプトは `aaa, bbb, eee, ccc, fff, ggg, hhh` や `aaa, bbb, fff, ccc, eee, ggg, hhh` などになります。
-
-* `secondary_separator`
-    * 追加の区切り文字を指定します。この区切り文字で区切られた部分は一つのタグとして扱われ、シャッフル、drop されます。その後、`caption_separator` に置き換えられます。たとえば `aaa;;;bbb;;;ccc` のように指定すると、`aaa,bbb,ccc` に置き換えられるか、まとめて drop されます。
-
-* `enable_wildcard`
-    * ワイルドカード記法および複数行キャプションを有効にします。ワイルドカード記法、複数行キャプションについては後述します。
-
 ### DreamBooth 方式専用のオプション

 DreamBooth 方式のオプションは、サブセット向けオプションのみ存在します。
@@ -175,7 +159,6 @@ DreamBooth 方式のサブセットの設定に関わるオプションです。
 | `image_dir` | `‘C:\hoge’` | - | - | o（必須） |
 | `caption_extension` | `".txt"` | o | o | o |
 | `class_tokens` | `“sks girl”` | - | - | o |
-| `cache_info` | `false` | o | o | o | 
 | `is_reg` | `false` | - | - | o |

 まず注意点として、 `image_dir` には画像ファイルが直下に置かれているパスを指定する必要があります。従来の DreamBooth の手法ではサブディレクトリに画像を置く必要がありましたが、そちらとは仕様に互換性がありません。また、`5_cat` のようなフォルダ名にしても、画像の繰り返し回数とクラス名は反映されません。これらを個別に設定したい場合、`num_repeats` と `class_tokens` で明示的に指定する必要があることに注意してください。
@@ -186,9 +169,6 @@ DreamBooth 方式のサブセットの設定に関わるオプションです。
 * `class_tokens`
    * クラストークンを設定します。
    * 画像に対応する caption ファイルが存在しない場合にのみ学習時に利用されます。利用するかどうかの判定は画像ごとに行います。`class_tokens` を指定しなかった場合に caption ファイルも見つからなかった場合にはエラーになります。
-* `cache_info`
-    * 画像サイズ、キャプションをキャッシュするかどうかを指定します。指定しなかった場合は `false` になります。キャッシュは `image_dir` に `metadata_cache.json` というファイル名で保存されます。
-    * キャッシュを行うと、二回目以降のデータセット読み込みが高速化されます。数千枚以上の画像を扱う場合には有効です。
 * `is_reg`
    * サブセットの画像が正規化用かどうかを指定します。指定しなかった場合は `false` として、つまり正規化画像ではないとして扱います。

@@ -300,89 +280,4 @@ resolution = 768
 * `voluptuous.error.MultipleInvalid: expected int for dictionary value @ ...`: 指定する値の形式が不正というエラーです。値の形式が間違っている可能性が高いです。`int` の部分は対象となるオプションによって変わります。この README に載っているオプションの「設定例」が役立つかもしれません。
 * `voluptuous.error.MultipleInvalid: extra keys not allowed @ ...`: 対応していないオプション名が存在している場合に発生するエラーです。オプション名を間違って記述しているか、誤って紛れ込んでいる可能性が高いです。

-## その他

-### 複数行キャプション
-
-`enable_wildcard = true` を設定することで、複数行キャプションも同時に有効になります。キャプションファイルが複数の行からなる場合、ランダムに一つの行が選ばれてキャプションとして利用されます。
-
-```txt
-1girl, hatsune miku, vocaloid, upper body, looking at viewer, microphone, stage
-a girl with a microphone standing on a stage
-detailed digital art of a girl with a microphone on a stage
-```
-
-ワイルドカード記法と組み合わせることも可能です。
-
-メタデータファイルでも同様に複数行キャプションを指定することができます。メタデータの .json 内には、`\n` を使って改行を表現してください。キャプションファイルが複数行からなる場合、`merge_captions_to_metadata.py` を使うと、この形式でメタデータファイルが作成されます。
-
-メタデータのタグ (`tags`) は、キャプションの各行に追加されます。
-
-```json
-{
-    "/path/to/image.png": {
-        "caption": "a cartoon of a frog with the word frog on it\ntest multiline caption1\ntest multiline caption2",
-        "tags": "open mouth, simple background, standing, no humans, animal, black background, frog, animal costume, animal focus"
-    },
-    ...
-}
-```
-
-この場合、実際のキャプションは `a cartoon of a frog with the word frog on it, open mouth, simple background ...` または `test multiline caption1, open mouth, simple background ...`、 `test multiline caption2, open mouth, simple background ...` 等になります。
-
-### 設定ファイルの記述例：追加の区切り文字、ワイルドカード記法、`keep_tokens_separator` 等
-
-```toml
-[general]
-flip_aug = true
-color_aug = false
-resolution = [1024, 1024]
-
-[[datasets]]
-batch_size = 6
-enable_bucket = true
-bucket_no_upscale = true
-caption_extension = ".txt"
-keep_tokens_separator= "|||"
-shuffle_caption = true
-caption_tag_dropout_rate = 0.1
-secondary_separator = ";;;" # subset 側に書くこともできます / can be written in the subset side
-enable_wildcard = true # 同上 / same as above
-
-  [[datasets.subsets]]
-  image_dir = "/path/to/image_dir"
-  num_repeats = 1
-
-  # ||| の前後はカンマは不要です（自動的に追加されます） / No comma is required before and after ||| (it is added automatically)
-  caption_prefix = "1girl, hatsune miku, vocaloid |||" 
-  
-  # ||| の後はシャッフル、drop されず残ります / After |||, it is not shuffled or dropped and remains
-  # 単純に文字列として連結されるので、カンマなどは自分で入れる必要があります / It is simply concatenated as a string, so you need to put commas yourself
-  caption_suffix = ", anime screencap ||| masterpiece, rating: general"
-```
-
-### キャプション記述例、secondary_separator 記法：`secondary_separator = ";;;"` の場合
-
-```txt
-1girl, hatsune miku, vocaloid, upper body, looking at viewer, sky;;;cloud;;;day, outdoors
-```
-`sky;;;cloud;;;day` の部分はシャッフル、drop されず `sky,cloud,day` に置換されます。シャッフル、drop が有効な場合、まとめて（一つのタグとして）処理されます。つまり `vocaloid, 1girl, upper body, sky,cloud,day, outdoors, hatsune miku` （シャッフル）や `vocaloid, 1girl, outdoors, looking at viewer, upper body, hatsune miku` （drop されたケース）などになります。
-
-### キャプション記述例、ワイルドカード記法： `enable_wildcard = true` の場合
-
-```txt
-1girl, hatsune miku, vocaloid, upper body, looking at viewer, {simple|white} background
-```
-ランダムに `simple` または `white` が選ばれ、`simple background` または `white background` になります。
-
-```txt
-1girl, hatsune miku, vocaloid, {{retro style}}
-```
-タグ文字列に `{` や `}` そのものを含めたい場合は `{{` や `}}` のように二つ重ねてください（この例では実際に学習に用いられるキャプションは `{retro style}` になります）。
-
-### キャプション記述例、`keep_tokens_separator` 記法： `keep_tokens_separator = "|||"` の場合
-
-```txt
-1girl, hatsune miku, vocaloid ||| stage, microphone, white shirt, smile ||| best quality, rating: general
-```
-`1girl, hatsune miku, vocaloid, microphone, stage, white shirt, best quality, rating: general` や `1girl, hatsune miku, vocaloid, white shirt, smile, stage, microphone, best quality, rating: general` などになります。
--- a/docs/masked_loss_README-ja.md
+++ b/docs/masked_loss_README-ja.md
@@ -1,57 +0,0 @@
-## マスクロスについて
-
-マスクロスは、入力画像のマスクで指定された部分だけ損失計算することで、画像の一部分だけを学習することができる機能です。
-たとえばキャラクタを学習したい場合、キャラクタ部分だけをマスクして学習することで、背景を無視して学習することができます。
-
-マスクロスのマスクには、二種類の指定方法があります。
-
- マスク画像を用いる方法
- 透明度（アルファチャネル）を使用する方法
-
-なお、サンプルは [ずんずんPJイラスト/3Dデータ](https://zunko.jp/con_illust.html) の「AI画像モデル用学習データ」を使用しています。
-
-### マスク画像を用いる方法
-
-学習画像それぞれに対応するマスク画像を用意する方法です。学習画像と同じファイル名のマスク画像を用意し、それを学習画像と別のディレクトリに保存します。
-
- 学習画像
-  ![image](https://github.com/kohya-ss/sd-scripts/assets/52813779/607c5116-5f62-47de-8b66-9c4a597f0441)
- マスク画像
-  ![image](https://github.com/kohya-ss/sd-scripts/assets/52813779/53e9b0f8-a4bf-49ed-882d-4026f84e8450)
-
-```.toml
-[[datasets.subsets]]
-image_dir = "/path/to/a_zundamon"
-caption_extension = ".txt"
-conditioning_data_dir = "/path/to/a_zundamon_mask"
-num_repeats = 8
-```
-
-マスク画像は、学習画像と同じサイズで、学習する部分を白、無視する部分を黒で描画します。グレースケールにも対応しています（127 ならロス重みが 0.5 になります）。なお、正確にはマスク画像の R チャネルが用いられます。
-
-DreamBooth 方式の dataset で、`conditioning_data_dir` で指定したディレクトリにマスク画像を保存してください。ControlNet のデータセットと同じですので、詳細は [ControlNet-LLLite](train_lllite_README-ja.md#データセットの準備) を参照してください。
-
-### 透明度（アルファチャネル）を使用する方法
-
-学習画像の透明度（アルファチャネル）がマスクとして使用されます。透明度が 0 の部分は無視され、255 の部分は学習されます。半透明の場合は、その透明度に応じてロス重みが変化します（127 ならおおむね 0.5）。
-
-![image](https://github.com/kohya-ss/sd-scripts/assets/52813779/0baa129b-446a-4aac-b98c-7208efb0e75e)
-
-※それぞれの画像は透過PNG
-
-学習時のスクリプトのオプションに `--alpha_mask` を指定するか、dataset の設定ファイルの subset で、`alpha_mask` を指定してください。たとえば、以下のようになります。
-
-```toml
-[[datasets.subsets]]
-image_dir = "/path/to/image/dir"
-caption_extension = ".txt"
-num_repeats = 8
-alpha_mask = true
-```
-
-## 学習時の注意事項
-
- 現時点では DreamBooth 方式の dataset のみ対応しています。
- マスクは latents のサイズ、つまり 1/8 に縮小されてから適用されます。そのため、細かい部分（たとえばアホ毛やイヤリングなど）はうまく学習できない可能性があります。マスクをわずかに拡張するなどの工夫が必要かもしれません。
- マスクロスを用いる場合、学習対象外の部分をキャプションに含める必要はないかもしれません。（要検証）
- `alpha_mask` の場合、マスクの有無を切り替えると latents キャッシュが自動的に再生成されます。
--- a/docs/masked_loss_README.md
+++ b/docs/masked_loss_README.md
@@ -1,56 +0,0 @@
-## Masked Loss
-
-Masked loss is a feature that allows you to train only part of an image by calculating the loss only for the part specified by the mask of the input image. For example, if you want to train a character, you can train only the character part by masking it, ignoring the background.
-
-There are two ways to specify the mask for masked loss.
-
- Using a mask image
- Using transparency (alpha channel) of the image
-
-The sample uses the "AI image model training data" from [ZunZunPJ Illustration/3D Data](https://zunko.jp/con_illust.html).
-
-### Using a mask image
-
-This is a method of preparing a mask image corresponding to each training image. Prepare a mask image with the same file name as the training image and save it in a different directory from the training image.
-
- Training image
-  ![image](https://github.com/kohya-ss/sd-scripts/assets/52813779/607c5116-5f62-47de-8b66-9c4a597f0441)
- Mask image
-  ![image](https://github.com/kohya-ss/sd-scripts/assets/52813779/53e9b0f8-a4bf-49ed-882d-4026f84e8450)
-
-```.toml
-[[datasets.subsets]]
-image_dir = "/path/to/a_zundamon"
-caption_extension = ".txt"
-conditioning_data_dir = "/path/to/a_zundamon_mask"
-num_repeats = 8
-```
-
-The mask image is the same size as the training image, with the part to be trained drawn in white and the part to be ignored in black. It also supports grayscale (127 gives a loss weight of 0.5). The R channel of the mask image is used currently.
-
-Use the dataset in the DreamBooth method, and save the mask image in the directory specified by `conditioning_data_dir`. It is the same as the ControlNet dataset, so please refer to [ControlNet-LLLite](train_lllite_README.md#Preparing-the-dataset) for details.
-
-### Using transparency (alpha channel) of the image
-
-The transparency (alpha channel) of the training image is used as a mask. The part with transparency 0 is ignored, the part with transparency 255 is trained. For semi-transparent parts, the loss weight changes according to the transparency (127 gives a weight of about 0.5).
-
-![image](https://github.com/kohya-ss/sd-scripts/assets/52813779/0baa129b-446a-4aac-b98c-7208efb0e75e)
-
-※Each image is a transparent PNG
-
-Specify `--alpha_mask` in the training script options or specify `alpha_mask` in the subset of the dataset configuration file. For example, it will look like this.
-
-```toml
-[[datasets.subsets]]
-image_dir = "/path/to/image/dir"
-caption_extension = ".txt"
-num_repeats = 8
-alpha_mask = true
-```
-
-## Notes on training
-
- At the moment, only the dataset in the DreamBooth method is supported.
- The mask is applied after the size is reduced to 1/8, which is the size of the latents. Therefore, fine details (such as ahoge or earrings) may not be learned well. Some dilations of the mask may be necessary.
- If using masked loss, it may not be necessary to include parts that are not to be trained in the caption. (To be verified)
- In the case of `alpha_mask`, the latents cache is automatically regenerated when the enable/disable state of the mask is switched.
--- a/docs/train_README-ja.md
+++ b/docs/train_README-ja.md
@@ -648,7 +648,7 @@ masterpiece, best quality, 1boy, in business suit, standing at street, looking b

    詳細については各自お調べください。

-    任意のスケジューラを使う場合、任意のオプティマイザと同様に、`--lr_scheduler_args`でオプション引数を指定してください。
+    任意のスケジューラを使う場合、任意のオプティマイザと同様に、`--scheduler_args`でオプション引数を指定してください。

 ### オプティマイザの指定について

--- a/docs/train_README-zh.md
+++ b/docs/train_README-zh.md
@@ -582,7 +582,7 @@ masterpiece, best quality, 1boy, in business suit, standing at street, looking b

    有关详细信息，请自行研究。

-    要使用任何调度程序，请像使用任何优化器一样使用“--lr_scheduler_args”指定可选参数。
+    要使用任何调度程序，请像使用任何优化器一样使用“--scheduler_args”指定可选参数。
 ### 关于指定优化器

 使用 --optimizer_args 选项指定优化器选项参数。可以以key=value的格式指定多个值。此外，您可以指定多个值，以逗号分隔。例如，要指定 AdamW 优化器的参数，``--optimizer_args weight_decay=0.01 betas=.9,.999``。
--- a/docs/train_SDXL-en.md
+++ b/docs/train_SDXL-en.md
@@ -1,84 +0,0 @@
-## SDXL training
-
-The documentation will be moved to the training documentation in the future. The following is a brief explanation of the training scripts for SDXL.
-
-### Training scripts for SDXL
-
- `sdxl_train.py` is a script for SDXL fine-tuning. The usage is almost the same as `fine_tune.py`, but it also supports DreamBooth dataset.
-  - `--full_bf16` option is added. Thanks to KohakuBlueleaf!
-    - This option enables the full bfloat16 training (includes gradients). This option is useful to reduce the GPU memory usage. 
-    - The full bfloat16 training might be unstable. Please use it at your own risk.
-  - The different learning rates for each U-Net block are now supported in sdxl_train.py. Specify with `--block_lr` option. Specify 23 values separated by commas like `--block_lr 1e-3,1e-3 ... 1e-3`.
-    - 23 values correspond to `0: time/label embed, 1-9: input blocks 0-8, 10-12: mid blocks 0-2, 13-21: output blocks 0-8, 22: out`.
- `prepare_buckets_latents.py` now supports SDXL fine-tuning.
-
- `sdxl_train_network.py` is a script for LoRA training for SDXL. The usage is almost the same as `train_network.py`.
-
- Both scripts has following additional options:
-  - `--cache_text_encoder_outputs` and `--cache_text_encoder_outputs_to_disk`: Cache the outputs of the text encoders. This option is useful to reduce the GPU memory usage. This option cannot be used with options for shuffling or dropping the captions.
-  - `--no_half_vae`: Disable the half-precision (mixed-precision) VAE. VAE for SDXL seems to produce NaNs in some cases. This option is useful to avoid the NaNs.
-
- `--weighted_captions` option is not supported yet for both scripts.
-
- `sdxl_train_textual_inversion.py` is a script for Textual Inversion training for SDXL. The usage is almost the same as `train_textual_inversion.py`.
-  - `--cache_text_encoder_outputs` is not supported.
-  - There are two options for captions:
-    1. Training with captions. All captions must include the token string. The token string is replaced with multiple tokens.
-    2. Use `--use_object_template` or `--use_style_template` option. The captions are generated from the template. The existing captions are ignored.
-  - See below for the format of the embeddings.
-
- `--min_timestep` and `--max_timestep` options are added to each training script. These options can be used to train U-Net with different timesteps. The default values are 0 and 1000.
-
-### Utility scripts for SDXL
-
- `tools/cache_latents.py` is added. This script can be used to cache the latents to disk in advance. 
-  - The options are almost the same as `sdxl_train.py'. See the help message for the usage.
-  - Please launch the script as follows:
-    `accelerate launch  --num_cpu_threads_per_process 1 tools/cache_latents.py ...`
-  - This script should work with multi-GPU, but it is not tested in my environment.
-
- `tools/cache_text_encoder_outputs.py` is added. This script can be used to cache the text encoder outputs to disk in advance. 
-  - The options are almost the same as `cache_latents.py` and `sdxl_train.py`. See the help message for the usage.
-
- `sdxl_gen_img.py` is added. This script can be used to generate images with SDXL, including LoRA, Textual Inversion and ControlNet-LLLite. See the help message for the usage.
-
-### Tips for SDXL training
-
- The default resolution of SDXL is 1024x1024.
- The fine-tuning can be done with 24GB GPU memory with the batch size of 1. For 24GB GPU, the following options are recommended __for the fine-tuning with 24GB GPU memory__:
-  - Train U-Net only.
-  - Use gradient checkpointing.
-  - Use `--cache_text_encoder_outputs` option and caching latents.
-  - Use Adafactor optimizer. RMSprop 8bit or Adagrad 8bit may work. AdamW 8bit doesn't seem to work.
- The LoRA training can be done with 8GB GPU memory (10GB recommended). For reducing the GPU memory usage, the following options are recommended:
-  - Train U-Net only.
-  - Use gradient checkpointing.
-  - Use `--cache_text_encoder_outputs` option and caching latents.
-  - Use one of 8bit optimizers or Adafactor optimizer.
-  - Use lower dim (4 to 8 for 8GB GPU).
- `--network_train_unet_only` option is highly recommended for SDXL LoRA. Because SDXL has two text encoders, the result of the training will be unexpected.
- PyTorch 2 seems to use slightly less GPU memory than PyTorch 1.
- `--bucket_reso_steps` can be set to 32 instead of the default value 64. Smaller values than 32 will not work for SDXL training.
-
-Example of the optimizer settings for Adafactor with the fixed learning rate:
-```toml
-optimizer_type = "adafactor"
-optimizer_args = [ "scale_parameter=False", "relative_step=False", "warmup_init=False" ]
-lr_scheduler = "constant_with_warmup"
-lr_warmup_steps = 100
-learning_rate = 4e-7 # SDXL original learning rate
-```
-
-### Format of Textual Inversion embeddings for SDXL
-
-```python
-from safetensors.torch import save_file
-
-state_dict = {"clip_g": embs_for_text_encoder_1280, "clip_l": embs_for_text_encoder_768}
-save_file(state_dict, file)
-```
-
-### ControlNet-LLLite
-
-ControlNet-LLLite, a novel method for ControlNet with SDXL, is added. See [documentation](./docs/train_lllite_README.md) for details.
-
--- a/docs/train_lllite_README-ja.md
+++ b/docs/train_lllite_README-ja.md
@@ -21,13 +21,9 @@ ComfyUIのカスタムノードを用意しています。: https://github.com/k
 ## モデルの学習

 ### データセットの準備
-DreamBooth 方式の dataset で、`conditioning_data_dir` で指定したディレクトリにconditioning imageを格納してください。
+通常のdatasetに加え、`conditioning_data_dir` で指定したディレクトリにconditioning imageを格納してください。conditioning imageは学習用画像と同じbasenameを持つ必要があります。また、conditioning imageは学習用画像と同じサイズに自動的にリサイズされます。conditioning imageにはキャプションファイルは不要です。

-（finetuning 方式の dataset はサポートしていません。）
-
-conditioning imageは学習用画像と同じbasenameを持つ必要があります。また、conditioning imageは学習用画像と同じサイズに自動的にリサイズされます。conditioning imageにはキャプションファイルは不要です。
-
-たとえば、キャプションにフォルダ名ではなくキャプションファイルを用いる場合の設定ファイルは以下のようになります。
+たとえば DreamBooth 方式でキャプションファイルを用いる場合の設定ファイルは以下のようになります。

 ```toml
 [[datasets.subsets]]
--- a/docs/train_lllite_README.md
+++ b/docs/train_lllite_README.md
@@ -26,9 +26,7 @@ Due to the limitations of the inference environment, only CrossAttention (attn1

 ### Preparing the dataset

-In addition to the normal DreamBooth method dataset, please store the conditioning image in the directory specified by `conditioning_data_dir`. The conditioning image must have the same basename as the training image. The conditioning image will be automatically resized to the same size as the training image. The conditioning image does not require a caption file.
-
-(We do not support the finetuning method dataset.)
+In addition to the normal dataset, please store the conditioning image in the directory specified by `conditioning_data_dir`. The conditioning image must have the same basename as the training image. The conditioning image will be automatically resized to the same size as the training image. The conditioning image does not require a caption file.

 ```toml
 [[datasets.subsets]]
--- a/docs/train_network_README-ja.md
+++ b/docs/train_network_README-ja.md
@@ -102,8 +102,6 @@ accelerate launch --num_cpu_threads_per_process 1 train_network.py
  * Text Encoderに関連するLoRAモジュールに、通常の学習率（--learning_rateオプションで指定）とは異なる学習率を使う時に指定します。Text Encoderのほうを若干低めの学習率（5e-5など）にしたほうが良い、という話もあるようです。
 * `--network_args`
  * 複数の引数を指定できます。後述します。
-* `--alpha_mask`
-  * 画像のアルファ値をマスクとして使用します。透過画像を学習する際に使用します。[PR #1223](https://github.com/kohya-ss/sd-scripts/pull/1223)

 `--network_train_unet_only` と `--network_train_text_encoder_only` の両方とも未指定時（デフォルト）はText EncoderとU-Netの両方のLoRAモジュールを有効にします。

@@ -183,16 +181,16 @@ python networks\extract_lora_from_dylora.py --model "foldername/dylora-model.saf

 詳細は[PR #355](https://github.com/kohya-ss/sd-scripts/pull/355) をご覧ください。

-フルモデルの25個のブロックの重みを指定できます。最初のブロックに該当するLoRAは存在しませんが、階層別LoRA適用等との互換性のために25個としています。またconv2d3x3に拡張しない場合も一部のブロックにはLoRAが存在しませんが、記述を統一するため常に25個の値を指定してください。
+SDXLは現在サポートしていません。

-SDXL では down/up 9 個、middle 3 個の値を指定してください。
+フルモデルの25個のブロックの重みを指定できます。最初のブロックに該当するLoRAは存在しませんが、階層別LoRA適用等との互換性のために25個としています。またconv2d3x3に拡張しない場合も一部のブロックにはLoRAが存在しませんが、記述を統一するため常に25個の値を指定してください。

 `--network_args` で以下の引数を指定してください。

 - `down_lr_weight` : U-Netのdown blocksの学習率の重みを指定します。以下が指定可能です。
-  - ブロックごとの重み : `"down_lr_weight=0,0,0,0,0,0,1,1,1,1,1,1"` のように12個（SDXL では 9 個）の数値を指定します。
+  - ブロックごとの重み : `"down_lr_weight=0,0,0,0,0,0,1,1,1,1,1,1"` のように12個の数値を指定します。
  - プリセットからの指定 : `"down_lr_weight=sine"` のように指定します（サインカーブで重みを指定します）。sine, cosine, linear, reverse_linear, zeros が指定可能です。また `"down_lr_weight=cosine+.25"` のように `+数値` を追加すると、指定した数値を加算します（0.25~1.25になります）。
- `mid_lr_weight` : U-Netのmid blockの学習率の重みを指定します。`"down_lr_weight=0.5"` のように数値を一つだけ指定します（SDXL の場合は 3 個）。
+- `mid_lr_weight` : U-Netのmid blockの学習率の重みを指定します。`"down_lr_weight=0.5"` のように数値を一つだけ指定します。
 - `up_lr_weight` : U-Netのup blocksの学習率の重みを指定します。down_lr_weightと同様です。
 - 指定を省略した部分は1.0として扱われます。また重みを0にするとそのブロックのLoRAモジュールは作成されません。
 - `block_lr_zero_threshold` : 重みがこの値以下の場合、LoRAモジュールを作成しません。デフォルトは0です。
@@ -217,9 +215,6 @@ network_args = [ "block_lr_zero_threshold=0.1", "down_lr_weight=sine+.5", "mid_l

 フルモデルの25個のブロックのdim (rank)を指定できます。階層別学習率と同様に一部のブロックにはLoRAが存在しない場合がありますが、常に25個の値を指定してください。

-SDXL では 23 個の値を指定してください。一部のブロックにはLoRA が存在しませんが、`sdxl_train.py` の[階層別学習率](./train_SDXL-en.md) との互換性のためです。
-対応は、`0: time/label embed, 1-9: input blocks 0-8, 10-12: mid blocks 0-2, 13-21: output blocks 0-8, 22: out` です。
-
 `--network_args` で以下の引数を指定してください。

 - `block_dims` : 各ブロックのdim (rank)を指定します。`"block_dims=2,2,2,2,4,4,4,4,6,6,6,6,8,6,6,6,6,4,4,4,4,2,2,2,2"` のように25個の数値を指定します。
--- a/docs/train_network_README-zh.md
+++ b/docs/train_network_README-zh.md
@@ -101,8 +101,6 @@ LoRA的模型将会被保存在通过`--output_dir`选项指定的文件夹中
  * 当在Text Encoder相关的LoRA模块中使用与常规学习率（由`--learning_rate`选项指定）不同的学习率时，应指定此选项。可能最好将Text Encoder的学习率稍微降低（例如5e-5）。
 * `--network_args`
  * 可以指定多个参数。将在下面详细说明。
-* `--alpha_mask`
-  * 使用图像的 Alpha 值作为遮罩。这在学习透明图像时使用。[PR #1223](https://github.com/kohya-ss/sd-scripts/pull/1223)

 当未指定`--network_train_unet_only`和`--network_train_text_encoder_only`时（默认情况），将启用Text Encoder和U-Net的两个LoRA模块。

--- a/docs/wd14_tagger_README-en.md
+++ b/docs/wd14_tagger_README-en.md
@@ -1,88 +0,0 @@
-# Image Tagging using WD14Tagger
-
-This document is based on the information from this github page (https://github.com/toriato/stable-diffusion-webui-wd14-tagger#mrsmilingwolfs-model-aka-waifu-diffusion-14-tagger).
-
-Using onnx for inference is recommended. Please install onnx with the following command:
-
-```powershell
-pip install onnx==1.15.0 onnxruntime-gpu==1.17.1  
-```
-
-The model weights will be automatically downloaded from Hugging Face.
-
-# Usage
-
-Run the script to perform tagging.
-
-```powershell
-python finetune/tag_images_by_wd14_tagger.py --onnx --repo_id <model repo id> --batch_size <batch size> <training data folder>
-```
-
-For example, if using the repository `SmilingWolf/wd-swinv2-tagger-v3` with a batch size of 4, and the training data is located in the parent folder `train_data`, it would be:
-
-```powershell
-python tag_images_by_wd14_tagger.py --onnx --repo_id SmilingWolf/wd-swinv2-tagger-v3 --batch_size 4 ..\train_data
-```
-
-On the first run, the model files will be automatically downloaded to the `wd14_tagger_model` folder (the folder can be changed with an option). 
-
-Tag files will be created in the same directory as the training data images, with the same filename and a `.txt` extension.
-
-![Generated tag files](https://user-images.githubusercontent.com/52813779/208910534-ea514373-1185-4b7d-9ae3-61eb50bc294e.png)
-
-![Tags and image](https://user-images.githubusercontent.com/52813779/208910599-29070c15-7639-474f-b3e4-06bd5a3df29e.png)
-
-## Example
-
-To output in the Animagine XL 3.1 format, it would be as follows (enter on a single line in practice):
-
-```
-python tag_images_by_wd14_tagger.py --onnx --repo_id SmilingWolf/wd-swinv2-tagger-v3 
-    --batch_size 4  --remove_underscore --undesired_tags "PUT,YOUR,UNDESIRED,TAGS" --recursive 
-    --use_rating_tags_as_last_tag --character_tags_first --character_tag_expand 
-    --always_first_tags "1girl,1boy"  ..\train_data
-```
-
-## Available Repository IDs
-
-[SmilingWolf's V2 and V3 models](https://huggingface.co/SmilingWolf) are available for use. Specify them in the format like `SmilingWolf/wd-vit-tagger-v3`. The default when omitted is `SmilingWolf/wd-v1-4-convnext-tagger-v2`.
-
-# Options 
-
-## General Options
-
- `--onnx`: Use ONNX for inference. If not specified, TensorFlow will be used. If using TensorFlow, please install TensorFlow separately. 
- `--batch_size`: Number of images to process at once. Default is 1. Adjust according to VRAM capacity.
- `--caption_extension`: File extension for caption files. Default is `.txt`.
- `--max_data_loader_n_workers`: Maximum number of workers for DataLoader. Specifying a value of 1 or more will use DataLoader to speed up image loading. If unspecified, DataLoader will not be used.
- `--thresh`: Confidence threshold for outputting tags. Default is 0.35. Lowering the value will assign more tags but accuracy will decrease. 
- `--general_threshold`: Confidence threshold for general tags. If omitted, same as `--thresh`.
- `--character_threshold`: Confidence threshold for character tags. If omitted, same as `--thresh`.
- `--recursive`: If specified, subfolders within the specified folder will also be processed recursively.
- `--append_tags`: Append tags to existing tag files.
- `--frequency_tags`: Output tag frequencies.  
- `--debug`: Debug mode. Outputs debug information if specified.
-
-## Model Download
-
- `--model_dir`: Folder to save model files. Default is `wd14_tagger_model`.  
- `--force_download`: Re-download model files if specified.
-
-## Tag Editing
-
- `--remove_underscore`: Remove underscores from output tags.
- `--undesired_tags`: Specify tags not to output. Multiple tags can be specified, separated by commas. For example, `black eyes,black hair`.
- `--use_rating_tags`: Output rating tags at the beginning of the tags.
- `--use_rating_tags_as_last_tag`: Add rating tags at the end of the tags.
- `--character_tags_first`: Output character tags first.
- `--character_tag_expand`: Expand character tag series names. For example, split the tag `chara_name_(series)` into `chara_name, series`.  
- `--always_first_tags`: Specify tags to always output first when a certain tag appears in an image. Multiple tags can be specified, separated by commas. For example, `1girl,1boy`.
- `--caption_separator`: Separate tags with this string in the output file. Default is `, `.
- `--tag_replacement`: Perform tag replacement. Specify in the format `tag1,tag2;tag3,tag4`. If using `,` and `;`, escape them with `\`. \
-    For example, specify `aira tsubase,aira tsubase (uniform)` (when you want to train a specific costume), `aira tsubase,aira tsubase\, heir of shadows` (when the series name is not included in the tag).
-
-When using `tag_replacement`, it is applied after `character_tag_expand`.
-
-When specifying `remove_underscore`, specify `undesired_tags`, `always_first_tags`, and `tag_replacement` without including underscores.
-
-When specifying `caption_separator`, separate `undesired_tags` and `always_first_tags` with `caption_separator`. Always separate `tag_replacement` with `,`.
--- a/docs/wd14_tagger_README-ja.md
+++ b/docs/wd14_tagger_README-ja.md
@@ -1,88 +0,0 @@
-# WD14Taggerによるタグ付け
-
-こちらのgithubページ（https://github.com/toriato/stable-diffusion-webui-wd14-tagger#mrsmilingwolfs-model-aka-waifu-diffusion-14-tagger ）の情報を参考にさせていただきました。
-
-onnx を用いた推論を推奨します。以下のコマンドで onnx をインストールしてください。
-
-```powershell
-pip install onnx==1.15.0 onnxruntime-gpu==1.17.1
-```
-
-モデルの重みはHugging Faceから自動的にダウンロードしてきます。
-
-# 使い方
-
-スクリプトを実行してタグ付けを行います。
-```
-python fintune/tag_images_by_wd14_tagger.py --onnx --repo_id <モデルのrepo id> --batch_size <バッチサイズ> <教師データフォルダ>
-```
-
-レポジトリに `SmilingWolf/wd-swinv2-tagger-v3` を使用し、バッチサイズを4にして、教師データを親フォルダの `train_data`に置いた場合、以下のようになります。
-
-```
-python tag_images_by_wd14_tagger.py --onnx --repo_id SmilingWolf/wd-swinv2-tagger-v3 --batch_size 4 ..\train_data
-```
-
-初回起動時にはモデルファイルが `wd14_tagger_model` フォルダに自動的にダウンロードされます（フォルダはオプションで変えられます）。
-
-タグファイルが教師データ画像と同じディレクトリに、同じファイル名、拡張子.txtで作成されます。
-
-![生成されたタグファイル](https://user-images.githubusercontent.com/52813779/208910534-ea514373-1185-4b7d-9ae3-61eb50bc294e.png)
-
-![タグと画像](https://user-images.githubusercontent.com/52813779/208910599-29070c15-7639-474f-b3e4-06bd5a3df29e.png)
-
-## 記述例
-
-Animagine XL 3.1 方式で出力する場合、以下のようになります（実際には 1 行で入力してください）。
-
-```
-python tag_images_by_wd14_tagger.py --onnx --repo_id SmilingWolf/wd-swinv2-tagger-v3 
-    --batch_size 4  --remove_underscore --undesired_tags "PUT,YOUR,UNDESIRED,TAGS" --recursive 
-    --use_rating_tags_as_last_tag --character_tags_first --character_tag_expand 
-    --always_first_tags "1girl,1boy"  ..\train_data
-```
-
-## 使用可能なリポジトリID
-
-[SmilingWolf 氏の V2、V3 のモデル](https://huggingface.co/SmilingWolf)が使用可能です。`SmilingWolf/wd-vit-tagger-v3` のように指定してください。省略時のデフォルトは `SmilingWolf/wd-v1-4-convnext-tagger-v2` です。
-
-# オプション
-
-## 一般オプション
-
- `--onnx` : ONNX を使用して推論します。指定しない場合は TensorFlow を使用します。TensorFlow 使用時は別途 TensorFlow をインストールしてください。
- `--batch_size` : 一度に処理する画像の数。デフォルトは1です。VRAMの容量に応じて増減してください。
- `--caption_extension` : キャプションファイルの拡張子。デフォルトは `.txt` です。
- `--max_data_loader_n_workers` : DataLoader の最大ワーカー数です。このオプションに 1 以上の数値を指定すると、DataLoader を用いて画像読み込みを高速化します。未指定時は DataLoader を用いません。
- `--thresh` : 出力するタグの信頼度の閾値。デフォルトは0.35です。値を下げるとより多くのタグが付与されますが、精度は下がります。
- `--general_threshold` : 一般タグの信頼度の閾値。省略時は `--thresh` と同じです。
- `--character_threshold` : キャラクタータグの信頼度の閾値。省略時は `--thresh` と同じです。
- `--recursive` : 指定すると、指定したフォルダ内のサブフォルダも再帰的に処理します。
- `--append_tags` : 既存のタグファイルにタグを追加します。
- `--frequency_tags` : タグの頻度を出力します。
- `--debug` : デバッグモード。指定するとデバッグ情報を出力します。
-
-## モデルのダウンロード
-
- `--model_dir` : モデルファイルの保存先フォルダ。デフォルトは `wd14_tagger_model` です。
- `--force_download` : 指定するとモデルファイルを再ダウンロードします。
-
-## タグ編集関連
-
- `--remove_underscore` : 出力するタグからアンダースコアを削除します。
- `--undesired_tags` : 出力しないタグを指定します。カンマ区切りで複数指定できます。たとえば `black eyes,black hair` のように指定します。
- `--use_rating_tags` : タグの最初にレーティングタグを出力します。
- `--use_rating_tags_as_last_tag` : タグの最後にレーティングタグを追加します。
- `--character_tags_first` : キャラクタータグを最初に出力します。
- `--character_tag_expand` : キャラクタータグのシリーズ名を展開します。たとえば `chara_name_(series)` のタグを `chara_name, series` に分割します。
- `--always_first_tags` : あるタグが画像に出力されたとき、そのタグを最初に出力するタグを指定します。カンマ区切りで複数指定できます。たとえば `1girl,1boy` のように指定します。
- `--caption_separator` : 出力するファイルでタグをこの文字列で区切ります。デフォルトは `, ` です。
- `--tag_replacement` : タグの置換を行います。`tag1,tag2;tag3,tag4` のように指定します。`,` および `;` を使う場合は `\` でエスケープしてください。\
-    たとえば `aira tsubase,aira tsubase (uniform)` （特定の衣装を学習させたいとき）、`aira tsubase,aira tsubase\, heir of shadows` （シリーズ名がタグに含まれないとき）のように指定します。
-
-`tag_replacement` は `character_tag_expand` の後に適用されます。
-
-`remove_underscore` 指定時は、`undesired_tags`、`always_first_tags`、`tag_replacement` はアンダースコアを含めずに指定してください。
-
-`caption_separator` 指定時は、`undesired_tags`、`always_first_tags` は `caption_separator`  で区切ってください。`tag_replacement` は必ず `,` で区切ってください。
-
--- a/fine_tune.py
+++ b/fine_tune.py
@@ -10,9 +10,7 @@ import toml
 from tqdm import tqdm

 import torch
-from library import deepspeed_utils
 from library.device_utils import init_ipex, clean_memory_on_device
-
 init_ipex()

 from accelerate.utils import set_seed
@@ -44,7 +42,6 @@ from library.custom_train_functions import (
 def train(args):
    train_util.verify_training_args(args)
    train_util.prepare_dataset_args(args, True)
-    deepspeed_utils.prepare_deepspeed_args(args)
    setup_logging(args, reset=True)

    cache_latents = args.cache_latents
@@ -91,8 +88,6 @@ def train(args):
    ds_for_collator = train_dataset_group if args.max_data_loader_n_workers == 0 else None
    collator = train_util.collator_class(current_epoch, current_step, ds_for_collator)

-    train_dataset_group.verify_bucket_reso_steps(64)
-
    if args.debug_dataset:
        train_util.debug_dataset(train_dataset_group)
        return
@@ -113,7 +108,6 @@ def train(args):

    # mixed precisionに対応した型を用意しておき適宜castする
    weight_dtype, save_dtype = train_util.prepare_dtype(args)
-    vae_dtype = torch.float32 if args.no_half_vae else weight_dtype

    # モデルを読み込む
    text_encoder, vae, unet, load_stable_diffusion_format = train_util.load_target_model(args, weight_dtype, accelerator)
@@ -164,7 +158,7 @@ def train(args):

    # 学習を準備する
    if cache_latents:
-        vae.to(accelerator.device, dtype=vae_dtype)
+        vae.to(accelerator.device, dtype=weight_dtype)
        vae.requires_grad_(False)
        vae.eval()
        with torch.no_grad():
@@ -197,7 +191,7 @@ def train(args):
    if not cache_latents:
        vae.requires_grad_(False)
        vae.eval()
-        vae.to(accelerator.device, dtype=vae_dtype)
+        vae.to(accelerator.device, dtype=weight_dtype)

    for m in training_models:
        m.requires_grad_(True)
@@ -252,23 +246,13 @@ def train(args):
        unet.to(weight_dtype)
        text_encoder.to(weight_dtype)

-    if args.deepspeed:
-        if args.train_text_encoder:
-            ds_model = deepspeed_utils.prepare_deepspeed_model(args, unet=unet, text_encoder=text_encoder)
-        else:
-            ds_model = deepspeed_utils.prepare_deepspeed_model(args, unet=unet)
-        ds_model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
-            ds_model, optimizer, train_dataloader, lr_scheduler
+    # acceleratorがなんかよろしくやってくれるらしい
+    if args.train_text_encoder:
+        unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
+            unet, text_encoder, optimizer, train_dataloader, lr_scheduler
        )
-        training_models = [ds_model]
    else:
-        # acceleratorがなんかよろしくやってくれるらしい
-        if args.train_text_encoder:
-            unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
-                unet, text_encoder, optimizer, train_dataloader, lr_scheduler
-            )
-        else:
-            unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(unet, optimizer, train_dataloader, lr_scheduler)
+        unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(unet, optimizer, train_dataloader, lr_scheduler)

    # 実験的機能：勾配も含めたfp16学習を行う　PyTorchにパッチを当ててfp16でのgrad scaleを有効にする
    if args.full_fp16:
@@ -312,11 +296,7 @@ def train(args):
            init_kwargs["wandb"] = {"name": args.wandb_run_name}
        if args.log_tracker_config is not None:
            init_kwargs = toml.load(args.log_tracker_config)
-        accelerator.init_trackers(
-            "finetuning" if args.log_tracker_name is None else args.log_tracker_name,
-            config=train_util.get_sanitized_config_or_none(args),
-            init_kwargs=init_kwargs,
-        )
+        accelerator.init_trackers("finetuning" if args.log_tracker_name is None else args.log_tracker_name, init_kwargs=init_kwargs)

    # For --sample_at_first
    train_util.sample_images(accelerator, args, 0, global_step, accelerator.device, vae, tokenizer, text_encoder, unet)
@@ -331,13 +311,13 @@ def train(args):

        for step, batch in enumerate(train_dataloader):
            current_step.value = global_step
-            with accelerator.accumulate(*training_models):
+            with accelerator.accumulate(training_models[0]):  # 複数モデルに対応していない模様だがとりあえずこうしておく
                with torch.no_grad():
                    if "latents" in batch and batch["latents"] is not None:
-                        latents = batch["latents"].to(accelerator.device).to(dtype=weight_dtype)
+                        latents = batch["latents"].to(accelerator.device)  # .to(dtype=weight_dtype)
                    else:
                        # latentに変換
-                        latents = vae.encode(batch["images"].to(dtype=vae_dtype)).latent_dist.sample().to(weight_dtype)
+                        latents = vae.encode(batch["images"].to(dtype=weight_dtype)).latent_dist.sample()
                    latents = latents * 0.18215
                b_size = latents.shape[0]

@@ -360,9 +340,7 @@ def train(args):

                # Sample noise, sample a random timestep for each image, and add noise to the latents,
                # with noise offset and/or multires noise if specified
-                noise, noisy_latents, timesteps, huber_c = train_util.get_noise_noisy_latents_and_timesteps(
-                    args, noise_scheduler, latents
-                )
+                noise, noisy_latents, timesteps = train_util.get_noise_noisy_latents_and_timesteps(args, noise_scheduler, latents)

                # Predict the noise residual
                with accelerator.autocast():
@@ -376,9 +354,7 @@ def train(args):

                if args.min_snr_gamma or args.scale_v_pred_loss_like_noise_pred or args.debiased_estimation_loss:
                    # do not mean over batch dimension for snr weight or scale v-pred loss
-                    loss = train_util.conditional_loss(
-                        noise_pred.float(), target.float(), reduction="none", loss_type=args.loss_type, huber_c=huber_c
-                    )
+                    loss = torch.nn.functional.mse_loss(noise_pred.float(), target.float(), reduction="none")
                    loss = loss.mean([1, 2, 3])

                    if args.min_snr_gamma:
@@ -386,13 +362,11 @@ def train(args):
                    if args.scale_v_pred_loss_like_noise_pred:
                        loss = scale_v_prediction_loss_like_noise_prediction(loss, timesteps, noise_scheduler)
                    if args.debiased_estimation_loss:
-                        loss = apply_debiased_estimation(loss, timesteps, noise_scheduler, args.v_parameterization)
+                        loss = apply_debiased_estimation(loss, timesteps, noise_scheduler)

                    loss = loss.mean()  # mean over batch dimension
                else:
-                    loss = train_util.conditional_loss(
-                        noise_pred.float(), target.float(), reduction="mean", loss_type=args.loss_type, huber_c=huber_c
-                    )
+                    loss = torch.nn.functional.mse_loss(noise_pred.float(), target.float(), reduction="mean")

                accelerator.backward(loss)
                if accelerator.sync_gradients and args.max_grad_norm != 0.0:
@@ -483,7 +457,7 @@ def train(args):

    accelerator.end_training()

-    if is_main_process and (args.save_state or args.save_state_on_train_end):
+    if args.save_state and is_main_process:
        train_util.save_state_on_train_end(args, accelerator)

    del accelerator  # この後メモリを使うのでこれは消す
@@ -503,7 +477,6 @@ def setup_parser() -> argparse.ArgumentParser:
    train_util.add_sd_models_arguments(parser)
    train_util.add_dataset_arguments(parser, False, True, True)
    train_util.add_training_arguments(parser, False)
-    deepspeed_utils.add_deepspeed_arguments(parser)
    train_util.add_sd_saving_arguments(parser)
    train_util.add_optimizer_arguments(parser)
    config_util.add_config_arguments(parser)
@@ -519,11 +492,6 @@ def setup_parser() -> argparse.ArgumentParser:
        default=None,
        help="learning rate for text encoder, default is same as unet / Text Encoderの学習率、デフォルトはunetと同じ",
    )
-    parser.add_argument(
-        "--no_half_vae",
-        action="store_true",
-        help="do not use fp16/bf16 VAE in mixed precision (use float VAE) / mixed precisionでも fp16/bf16 VAEを使わずfloat VAEを使う",
-    )

    return parser

@@ -532,7 +500,6 @@ if __name__ == "__main__":
    parser = setup_parser()

    args = parser.parse_args()
-    train_util.verify_command_line_training_args(args)
    args = train_util.read_config_from_file(args, parser)

    train(args)
--- a/finetune/blip/blip.py
+++ b/finetune/blip/blip.py
@@ -134,9 +134,8 @@ class BLIP_Decoder(nn.Module):
    def generate(self, image, sample=False, num_beams=3, max_length=30, min_length=10, top_p=0.9, repetition_penalty=1.0):
        image_embeds = self.visual_encoder(image)

-        # recent version of transformers seems to do repeat_interleave automatically
-        # if not sample:
-        #     image_embeds = image_embeds.repeat_interleave(num_beams,dim=0)
+        if not sample:
+            image_embeds = image_embeds.repeat_interleave(num_beams,dim=0)
            
        image_atts = torch.ones(image_embeds.size()[:-1],dtype=torch.long).to(image.device)
        model_kwargs = {"encoder_hidden_states": image_embeds, "encoder_attention_mask":image_atts}
--- a/finetune/merge_captions_to_metadata.py
+++ b/finetune/merge_captions_to_metadata.py
@@ -6,95 +6,75 @@ from tqdm import tqdm
 import library.train_util as train_util
 import os
 from library.utils import setup_logging
-
 setup_logging()
 import logging
-
 logger = logging.getLogger(__name__)

-
 def main(args):
-    assert not args.recursive or (
-        args.recursive and args.full_path
-    ), "recursive requires full_path / recursiveはfull_pathと同時に指定してください"
+  assert not args.recursive or (args.recursive and args.full_path), "recursive requires full_path / recursiveはfull_pathと同時に指定してください"

-    train_data_dir_path = Path(args.train_data_dir)
-    image_paths: List[Path] = train_util.glob_images_pathlib(train_data_dir_path, args.recursive)
-    logger.info(f"found {len(image_paths)} images.")
+  train_data_dir_path = Path(args.train_data_dir)
+  image_paths: List[Path] = train_util.glob_images_pathlib(train_data_dir_path, args.recursive)
+  logger.info(f"found {len(image_paths)} images.")

-    if args.in_json is None and Path(args.out_json).is_file():
-        args.in_json = args.out_json
+  if args.in_json is None and Path(args.out_json).is_file():
+    args.in_json = args.out_json

-    if args.in_json is not None:
-        logger.info(f"loading existing metadata: {args.in_json}")
-        metadata = json.loads(Path(args.in_json).read_text(encoding="utf-8"))
-        logger.warning("captions for existing images will be overwritten / 既存の画像のキャプションは上書きされます")
-    else:
-        logger.info("new metadata will be created / 新しいメタデータファイルが作成されます")
-        metadata = {}
+  if args.in_json is not None:
+    logger.info(f"loading existing metadata: {args.in_json}")
+    metadata = json.loads(Path(args.in_json).read_text(encoding='utf-8'))
+    logger.warning("captions for existing images will be overwritten / 既存の画像のキャプションは上書きされます")
+  else:
+    logger.info("new metadata will be created / 新しいメタデータファイルが作成されます")
+    metadata = {}

-    logger.info("merge caption texts to metadata json.")
-    for image_path in tqdm(image_paths):
-        caption_path = image_path.with_suffix(args.caption_extension)
-        caption = caption_path.read_text(encoding="utf-8").strip()
+  logger.info("merge caption texts to metadata json.")
+  for image_path in tqdm(image_paths):
+    caption_path = image_path.with_suffix(args.caption_extension)
+    caption = caption_path.read_text(encoding='utf-8').strip()

-        if not os.path.exists(caption_path):
-            caption_path = os.path.join(image_path, args.caption_extension)
+    if not os.path.exists(caption_path):
+      caption_path = os.path.join(image_path, args.caption_extension)

-        image_key = str(image_path) if args.full_path else image_path.stem
-        if image_key not in metadata:
-            metadata[image_key] = {}
+    image_key = str(image_path) if args.full_path else image_path.stem
+    if image_key not in metadata:
+      metadata[image_key] = {}

-        metadata[image_key]["caption"] = caption
-        if args.debug:
-            logger.info(f"{image_key} {caption}")
+    metadata[image_key]['caption'] = caption
+    if args.debug:
+      logger.info(f"{image_key} {caption}")

-    # metadataを書き出して終わり
-    logger.info(f"writing metadata: {args.out_json}")
-    Path(args.out_json).write_text(json.dumps(metadata, indent=2), encoding="utf-8")
-    logger.info("done!")
+  # metadataを書き出して終わり
+  logger.info(f"writing metadata: {args.out_json}")
+  Path(args.out_json).write_text(json.dumps(metadata, indent=2), encoding='utf-8')
+  logger.info("done!")


 def setup_parser() -> argparse.ArgumentParser:
-    parser = argparse.ArgumentParser()
-    parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
-    parser.add_argument("out_json", type=str, help="metadata file to output / メタデータファイル書き出し先")
-    parser.add_argument(
-        "--in_json",
-        type=str,
-        help="metadata file to input (if omitted and out_json exists, existing out_json is read) / 読み込むメタデータファイル（省略時、out_jsonが存在すればそれを読み込む）",
-    )
-    parser.add_argument(
-        "--caption_extention",
-        type=str,
-        default=None,
-        help="extension of caption file (for backward compatibility) / 読み込むキャプションファイルの拡張子（スペルミスしていたのを残してあります）",
-    )
-    parser.add_argument(
-        "--caption_extension", type=str, default=".caption", help="extension of caption file / 読み込むキャプションファイルの拡張子"
-    )
-    parser.add_argument(
-        "--full_path",
-        action="store_true",
-        help="use full path as image-key in metadata (supports multiple directories) / メタデータで画像キーをフルパスにする（複数の学習画像ディレクトリに対応）",
-    )
-    parser.add_argument(
-        "--recursive",
-        action="store_true",
-        help="recursively look for training tags in all child folders of train_data_dir / train_data_dirのすべての子フォルダにある学習タグを再帰的に探す",
-    )
-    parser.add_argument("--debug", action="store_true", help="debug mode")
+  parser = argparse.ArgumentParser()
+  parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
+  parser.add_argument("out_json", type=str, help="metadata file to output / メタデータファイル書き出し先")
+  parser.add_argument("--in_json", type=str,
+                      help="metadata file to input (if omitted and out_json exists, existing out_json is read) / 読み込むメタデータファイル（省略時、out_jsonが存在すればそれを読み込む）")
+  parser.add_argument("--caption_extention", type=str, default=None,
+                      help="extension of caption file (for backward compatibility) / 読み込むキャプションファイルの拡張子（スペルミスしていたのを残してあります）")
+  parser.add_argument("--caption_extension", type=str, default=".caption", help="extension of caption file / 読み込むキャプションファイルの拡張子")
+  parser.add_argument("--full_path", action="store_true",
+                      help="use full path as image-key in metadata (supports multiple directories) / メタデータで画像キーをフルパスにする（複数の学習画像ディレクトリに対応）")
+  parser.add_argument("--recursive", action="store_true",
+                      help="recursively look for training tags in all child folders of train_data_dir / train_data_dirのすべての子フォルダにある学習タグを再帰的に探す")
+  parser.add_argument("--debug", action="store_true", help="debug mode")

-    return parser
+  return parser


-if __name__ == "__main__":
-    parser = setup_parser()
+if __name__ == '__main__':
+  parser = setup_parser()

-    args = parser.parse_args()
+  args = parser.parse_args()

-    # スペルミスしていたオプションを復元する
-    if args.caption_extention is not None:
-        args.caption_extension = args.caption_extention
+  # スペルミスしていたオプションを復元する
+  if args.caption_extention is not None:
+    args.caption_extension = args.caption_extention

-    main(args)
+  main(args)
--- a/finetune/merge_dd_tags_to_metadata.py
+++ b/finetune/merge_dd_tags_to_metadata.py
@@ -6,88 +6,70 @@ from tqdm import tqdm
 import library.train_util as train_util
 import os
 from library.utils import setup_logging
-
 setup_logging()
 import logging
-
 logger = logging.getLogger(__name__)

-
 def main(args):
-    assert not args.recursive or (
-        args.recursive and args.full_path
-    ), "recursive requires full_path / recursiveはfull_pathと同時に指定してください"
+  assert not args.recursive or (args.recursive and args.full_path), "recursive requires full_path / recursiveはfull_pathと同時に指定してください"

-    train_data_dir_path = Path(args.train_data_dir)
-    image_paths: List[Path] = train_util.glob_images_pathlib(train_data_dir_path, args.recursive)
-    logger.info(f"found {len(image_paths)} images.")
+  train_data_dir_path = Path(args.train_data_dir)
+  image_paths: List[Path] = train_util.glob_images_pathlib(train_data_dir_path, args.recursive)
+  logger.info(f"found {len(image_paths)} images.")

-    if args.in_json is None and Path(args.out_json).is_file():
-        args.in_json = args.out_json
+  if args.in_json is None and Path(args.out_json).is_file():
+    args.in_json = args.out_json

-    if args.in_json is not None:
-        logger.info(f"loading existing metadata: {args.in_json}")
-        metadata = json.loads(Path(args.in_json).read_text(encoding="utf-8"))
-        logger.warning("tags data for existing images will be overwritten / 既存の画像のタグは上書きされます")
-    else:
-        logger.info("new metadata will be created / 新しいメタデータファイルが作成されます")
-        metadata = {}
+  if args.in_json is not None:
+    logger.info(f"loading existing metadata: {args.in_json}")
+    metadata = json.loads(Path(args.in_json).read_text(encoding='utf-8'))
+    logger.warning("tags data for existing images will be overwritten / 既存の画像のタグは上書きされます")
+  else:
+    logger.info("new metadata will be created / 新しいメタデータファイルが作成されます")
+    metadata = {}

-    logger.info("merge tags to metadata json.")
-    for image_path in tqdm(image_paths):
-        tags_path = image_path.with_suffix(args.caption_extension)
-        tags = tags_path.read_text(encoding="utf-8").strip()
+  logger.info("merge tags to metadata json.")
+  for image_path in tqdm(image_paths):
+    tags_path = image_path.with_suffix(args.caption_extension)
+    tags = tags_path.read_text(encoding='utf-8').strip()

-        if not os.path.exists(tags_path):
-            tags_path = os.path.join(image_path, args.caption_extension)
+    if not os.path.exists(tags_path):
+      tags_path = os.path.join(image_path, args.caption_extension)

-        image_key = str(image_path) if args.full_path else image_path.stem
-        if image_key not in metadata:
-            metadata[image_key] = {}
+    image_key = str(image_path) if args.full_path else image_path.stem
+    if image_key not in metadata:
+      metadata[image_key] = {}

-        metadata[image_key]["tags"] = tags
-        if args.debug:
-            logger.info(f"{image_key} {tags}")
+    metadata[image_key]['tags'] = tags
+    if args.debug:
+      logger.info(f"{image_key} {tags}")

-    # metadataを書き出して終わり
-    logger.info(f"writing metadata: {args.out_json}")
-    Path(args.out_json).write_text(json.dumps(metadata, indent=2), encoding="utf-8")
+  # metadataを書き出して終わり
+  logger.info(f"writing metadata: {args.out_json}")
+  Path(args.out_json).write_text(json.dumps(metadata, indent=2), encoding='utf-8')

-    logger.info("done!")
+  logger.info("done!")


 def setup_parser() -> argparse.ArgumentParser:
-    parser = argparse.ArgumentParser()
-    parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
-    parser.add_argument("out_json", type=str, help="metadata file to output / メタデータファイル書き出し先")
-    parser.add_argument(
-        "--in_json",
-        type=str,
-        help="metadata file to input (if omitted and out_json exists, existing out_json is read) / 読み込むメタデータファイル（省略時、out_jsonが存在すればそれを読み込む）",
-    )
-    parser.add_argument(
-        "--full_path",
-        action="store_true",
-        help="use full path as image-key in metadata (supports multiple directories) / メタデータで画像キーをフルパスにする（複数の学習画像ディレクトリに対応）",
-    )
-    parser.add_argument(
-        "--recursive",
-        action="store_true",
-        help="recursively look for training tags in all child folders of train_data_dir / train_data_dirのすべての子フォルダにある学習タグを再帰的に探す",
-    )
-    parser.add_argument(
-        "--caption_extension",
-        type=str,
-        default=".txt",
-        help="extension of caption (tag) file / 読み込むキャプション（タグ）ファイルの拡張子",
-    )
-    parser.add_argument("--debug", action="store_true", help="debug mode, print tags")
+  parser = argparse.ArgumentParser()
+  parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
+  parser.add_argument("out_json", type=str, help="metadata file to output / メタデータファイル書き出し先")
+  parser.add_argument("--in_json", type=str,
+                      help="metadata file to input (if omitted and out_json exists, existing out_json is read) / 読み込むメタデータファイル（省略時、out_jsonが存在すればそれを読み込む）")
+  parser.add_argument("--full_path", action="store_true",
+                      help="use full path as image-key in metadata (supports multiple directories) / メタデータで画像キーをフルパスにする（複数の学習画像ディレクトリに対応）")
+  parser.add_argument("--recursive", action="store_true",
+                      help="recursively look for training tags in all child folders of train_data_dir / train_data_dirのすべての子フォルダにある学習タグを再帰的に探す")
+  parser.add_argument("--caption_extension", type=str, default=".txt",
+                      help="extension of caption (tag) file / 読み込むキャプション（タグ）ファイルの拡張子")
+  parser.add_argument("--debug", action="store_true", help="debug mode, print tags")

-    return parser
+  return parser


-if __name__ == "__main__":
-    parser = setup_parser()
+if __name__ == '__main__':
+  parser = setup_parser()

-    args = parser.parse_args()
-    main(args)
+  args = parser.parse_args()
+  main(args)
--- a/finetune/prepare_buckets_latents.py
+++ b/finetune/prepare_buckets_latents.py
@@ -17,6 +17,7 @@ init_ipex()
 from torchvision import transforms

 import library.model_util as model_util
+import library.stable_cascade_utils as sc_utils
 import library.train_util as train_util
 from library.utils import setup_logging

@@ -45,7 +46,7 @@ def collate_fn_remove_corrupted(batch):
    return batch


-def get_npz_filename(data_dir, image_key, is_full_path, recursive):
+def get_npz_filename(data_dir, image_key, is_full_path, recursive, stable_cascade):
    if is_full_path:
        base_name = os.path.splitext(os.path.basename(image_key))[0]
        relative_path = os.path.relpath(os.path.dirname(image_key), data_dir)
@@ -53,10 +54,11 @@ def get_npz_filename(data_dir, image_key, is_full_path, recursive):
        base_name = image_key
        relative_path = ""

+    ext = ".npz" if not stable_cascade else train_util.STABLE_CASCADE_LATENTS_CACHE_SUFFIX
    if recursive and relative_path:
-        return os.path.join(data_dir, relative_path, base_name) + ".npz"
+        return os.path.join(data_dir, relative_path, base_name) + ext
    else:
-        return os.path.join(data_dir, base_name) + ".npz"
+        return os.path.join(data_dir, base_name) + ext


 def main(args):
@@ -86,7 +88,12 @@ def main(args):
    elif args.mixed_precision == "bf16":
        weight_dtype = torch.bfloat16

-    vae = model_util.load_vae(args.model_name_or_path, weight_dtype)
+    if not args.stable_cascade:
+        vae = model_util.load_vae(args.model_name_or_path, weight_dtype)
+        divisor = 8
+    else:
+        vae = sc_utils.load_effnet(args.model_name_or_path, DEVICE)
+        divisor = 32
    vae.eval()
    vae.to(DEVICE, dtype=weight_dtype)

@@ -112,7 +119,7 @@ def main(args):
    def process_batch(is_last):
        for bucket in bucket_manager.buckets:
            if (is_last and len(bucket) > 0) or len(bucket) >= args.batch_size:
-                train_util.cache_batch_latents(vae, True, bucket, args.flip_aug, args.alpha_mask, False)
+                train_util.cache_batch_latents(vae, True, bucket, args.flip_aug, False)
                bucket.clear()

    # 読み込みの高速化のためにDataLoaderを使うオプション
@@ -159,6 +166,10 @@ def main(args):
        # メタデータに記録する解像度はlatent単位とするので、8単位で切り捨て
        metadata[image_key]["train_resolution"] = (reso[0] - reso[0] % 8, reso[1] - reso[1] % 8)

+        # 追加情報を記録
+        metadata[image_key]["original_size"] = (image.width, image.height)
+        metadata[image_key]["train_resized_size"] = resized_size
+
        if not args.bucket_no_upscale:
            # upscaleを行わないときには、resize後のサイズは、bucketのサイズと、縦横どちらかが同じであることを確認する
            assert (
@@ -173,9 +184,9 @@ def main(args):
        ), f"internal error resized size is small: {resized_size}, {reso}"

        # 既に存在するファイルがあればshape等を確認して同じならskipする
-        npz_file_name = get_npz_filename(args.train_data_dir, image_key, args.full_path, args.recursive)
+        npz_file_name = get_npz_filename(args.train_data_dir, image_key, args.full_path, args.recursive, args.stable_cascade)
        if args.skip_existing:
-            if train_util.is_disk_cached_latents_is_expected(reso, npz_file_name, args.flip_aug):
+            if train_util.is_disk_cached_latents_is_expected(reso, npz_file_name, args.flip_aug, divisor):
                continue

        # バッチへ追加
@@ -213,6 +224,11 @@ def setup_parser() -> argparse.ArgumentParser:
    parser.add_argument("in_json", type=str, help="metadata file to input / 読み込むメタデータファイル")
    parser.add_argument("out_json", type=str, help="metadata file to output / メタデータファイル書き出し先")
    parser.add_argument("model_name_or_path", type=str, help="model name or path to encode latents / latentを取得するためのモデル")
+    parser.add_argument(
+        "--stable_cascade",
+        action="store_true",
+        help="prepare EffNet latents for stable cascade / stable cascade用のEffNetのlatentsを準備する",
+    )
    parser.add_argument(
        "--v2", action="store_true", help="not used (for backward compatibility) / 使用されません（互換性のため残してあります）"
    )
@@ -259,12 +275,6 @@ def setup_parser() -> argparse.ArgumentParser:
        action="store_true",
        help="flip augmentation, save latents for flipped images / 左右反転した画像もlatentを取得、保存する",
    )
-    parser.add_argument(
-        "--alpha_mask",
-        type=str,
-        default="",
-        help="save alpha mask for images for loss calculation / 損失計算用に画像のアルファマスクを保存する",
-    )
    parser.add_argument(
        "--skip_existing",
        action="store_true",
--- a/finetune/tag_images_by_wd14_tagger.py
+++ b/finetune/tag_images_by_wd14_tagger.py
@@ -11,11 +11,9 @@ from PIL import Image
 from tqdm import tqdm

 import library.train_util as train_util
-from library.utils import setup_logging, pil_resize
-
+from library.utils import setup_logging
 setup_logging()
 import logging
-
 logger = logging.getLogger(__name__)

 # from wd14 tagger
@@ -42,10 +40,8 @@ def preprocess_image(image):
    pad_t = pad_y // 2
    image = np.pad(image, ((pad_t, pad_y - pad_t), (pad_l, pad_x - pad_l), (0, 0)), mode="constant", constant_values=255)

-    if size > IMAGE_SIZE:
-        image = cv2.resize(image, (IMAGE_SIZE, IMAGE_SIZE), cv2.INTER_AREA)
-    else:
-        image = pil_resize(image, (IMAGE_SIZE, IMAGE_SIZE))
+    interp = cv2.INTER_AREA if size > IMAGE_SIZE else cv2.INTER_LANCZOS4
+    image = cv2.resize(image, (IMAGE_SIZE, IMAGE_SIZE), interpolation=interp)

    image = image.astype(np.float32)
    return image
@@ -64,12 +60,12 @@ class ImageLoadingPrepDataset(torch.utils.data.Dataset):
        try:
            image = Image.open(img_path).convert("RGB")
            image = preprocess_image(image)
-            # tensor = torch.tensor(image) # これ Tensor に変換する必要ないな……(;･∀･)
+            tensor = torch.tensor(image)
        except Exception as e:
            logger.error(f"Could not load image path / 画像を読み込めません: {img_path}, error: {e}")
            return None

-        return (image, img_path)
+        return (tensor, img_path)


 def collate_fn_remove_corrupted(batch):
@@ -83,41 +79,34 @@ def collate_fn_remove_corrupted(batch):


 def main(args):
-    # model location is model_dir + repo_id
-    # repo id may be like "user/repo" or "user/repo/branch", so we need to remove slash
-    model_location = os.path.join(args.model_dir, args.repo_id.replace("/", "_"))
-
    # hf_hub_downloadをそのまま使うとsymlink関係で問題があるらしいので、キャッシュディレクトリとforce_filenameを指定してなんとかする
    # depreacatedの警告が出るけどなくなったらその時
    # https://github.com/toriato/stable-diffusion-webui-wd14-tagger/issues/22
-    if not os.path.exists(model_location) or args.force_download:
-        os.makedirs(args.model_dir, exist_ok=True)
+    if not os.path.exists(args.model_dir) or args.force_download:
        logger.info(f"downloading wd14 tagger model from hf_hub. id: {args.repo_id}")
        files = FILES
        if args.onnx:
-            files = ["selected_tags.csv"]
            files += FILES_ONNX
-        else:
-            for file in SUB_DIR_FILES:
-                hf_hub_download(
-                    args.repo_id,
-                    file,
-                    subfolder=SUB_DIR,
-                    cache_dir=os.path.join(model_location, SUB_DIR),
-                    force_download=True,
-                    force_filename=file,
-                )
        for file in files:
-            hf_hub_download(args.repo_id, file, cache_dir=model_location, force_download=True, force_filename=file)
+            hf_hub_download(args.repo_id, file, cache_dir=args.model_dir, force_download=True, force_filename=file)
+        for file in SUB_DIR_FILES:
+            hf_hub_download(
+                args.repo_id,
+                file,
+                subfolder=SUB_DIR,
+                cache_dir=os.path.join(args.model_dir, SUB_DIR),
+                force_download=True,
+                force_filename=file,
+            )
    else:
        logger.info("using existing wd14 tagger model")

-    # モデルを読み込む
+    # 画像を読み込む
    if args.onnx:
        import onnx
        import onnxruntime as ort

-        onnx_path = f"{model_location}/model.onnx"
+        onnx_path = f"{args.model_dir}/model.onnx"
        logger.info("Running wd14 tagger with onnx")
        logger.info(f"loading onnx model: {onnx_path}")

@@ -131,10 +120,10 @@ def main(args):
        input_name = model.graph.input[0].name
        try:
            batch_size = model.graph.input[0].type.tensor_type.shape.dim[0].dim_value
-        except Exception:
+        except:
            batch_size = model.graph.input[0].type.tensor_type.shape.dim[0].dim_param

-        if args.batch_size != batch_size and not isinstance(batch_size, str) and batch_size > 0:
+        if args.batch_size != batch_size and type(batch_size) != str:
            # some rebatch model may use 'N' as dynamic axes
            logger.warning(
                f"Batch size {args.batch_size} doesn't match onnx model batch size {batch_size}, use model batch size {batch_size}"
@@ -143,79 +132,32 @@ def main(args):

        del model

-        if "OpenVINOExecutionProvider" in ort.get_available_providers():
-            # requires provider options for gpu support
-            # fp16 causes nonsense outputs
-            ort_sess = ort.InferenceSession(
-                onnx_path,
-                providers=(["OpenVINOExecutionProvider"]),
-                provider_options=[{'device_type' : "GPU_FP32"}],
-            )
-        else:
-            ort_sess = ort.InferenceSession(
-                onnx_path,
-                providers=(
-                    ["CUDAExecutionProvider"] if "CUDAExecutionProvider" in ort.get_available_providers() else
-                    ["ROCMExecutionProvider"] if "ROCMExecutionProvider" in ort.get_available_providers() else
-                    ["CPUExecutionProvider"]
-                ),
-            )
+        ort_sess = ort.InferenceSession(
+            onnx_path,
+            providers=["CUDAExecutionProvider"]
+            if "CUDAExecutionProvider" in ort.get_available_providers()
+            else ["CPUExecutionProvider"],
+        )
    else:
        from tensorflow.keras.models import load_model

-        model = load_model(f"{model_location}")
+        model = load_model(f"{args.model_dir}")

    # label_names = pd.read_csv("2022_0000_0899_6549/selected_tags.csv")
    # 依存ライブラリを増やしたくないので自力で読むよ

-    with open(os.path.join(model_location, CSV_FILE), "r", encoding="utf-8") as f:
+    with open(os.path.join(args.model_dir, CSV_FILE), "r", encoding="utf-8") as f:
        reader = csv.reader(f)
-        line = [row for row in reader]
-        header = line[0]  # tag_id,name,category,count
-        rows = line[1:]
+        l = [row for row in reader]
+        header = l[0]  # tag_id,name,category,count
+        rows = l[1:]
    assert header[0] == "tag_id" and header[1] == "name" and header[2] == "category", f"unexpected csv format: {header}"

-    rating_tags = [row[1] for row in rows[0:] if row[2] == "9"]
-    general_tags = [row[1] for row in rows[0:] if row[2] == "0"]
-    character_tags = [row[1] for row in rows[0:] if row[2] == "4"]
-
-    # preprocess tags in advance
-    if args.character_tag_expand:
-        for i, tag in enumerate(character_tags):
-            if tag.endswith(")"):
-                # chara_name_(series) -> chara_name, series
-                # chara_name_(costume)_(series) -> chara_name_(costume), series
-                tags = tag.split("(")
-                character_tag = "(".join(tags[:-1])
-                if character_tag.endswith("_"):
-                    character_tag = character_tag[:-1]
-                series_tag = tags[-1].replace(")", "")
-                character_tags[i] = character_tag + args.caption_separator + series_tag
-
-    if args.remove_underscore:
-        rating_tags = [tag.replace("_", " ") if len(tag) > 3 else tag for tag in rating_tags]
-        general_tags = [tag.replace("_", " ") if len(tag) > 3 else tag for tag in general_tags]
-        character_tags = [tag.replace("_", " ") if len(tag) > 3 else tag for tag in character_tags]
-
-    if args.tag_replacement is not None:
-        # escape , and ; in tag_replacement: wd14 tag names may contain , and ;
-        escaped_tag_replacements = args.tag_replacement.replace("\\,", "@@@@").replace("\\;", "####")
-        tag_replacements = escaped_tag_replacements.split(";")
-        for tag_replacement in tag_replacements:
-            tags = tag_replacement.split(",")  # source, target
-            assert len(tags) == 2, f"tag replacement must be in the format of `source,target` / タグの置換は `置換元,置換先` の形式で指定してください: {args.tag_replacement}"
-
-            source, target = [tag.replace("@@@@", ",").replace("####", ";") for tag in tags]
-            logger.info(f"replacing tag: {source} -> {target}")
-
-            if source in general_tags:
-                general_tags[general_tags.index(source)] = target
-            elif source in character_tags:
-                character_tags[character_tags.index(source)] = target
-            elif source in rating_tags:
-                rating_tags[rating_tags.index(source)] = target
+    general_tags = [row[1] for row in rows[1:] if row[2] == "0"]
+    character_tags = [row[1] for row in rows[1:] if row[2] == "4"]

    # 画像を読み込む
+
    train_data_dir_path = Path(args.train_data_dir)
    image_paths = train_util.glob_images_pathlib(train_data_dir_path, args.recursive)
    logger.info(f"found {len(image_paths)} images.")
@@ -224,19 +166,14 @@ def main(args):

    caption_separator = args.caption_separator
    stripped_caption_separator = caption_separator.strip()
-    undesired_tags = args.undesired_tags.split(stripped_caption_separator)
-    undesired_tags = set([tag.strip() for tag in undesired_tags if tag.strip() != ""])
-
-    always_first_tags = None
-    if args.always_first_tags is not None:
-        always_first_tags = [tag for tag in args.always_first_tags.split(stripped_caption_separator) if tag.strip() != ""]
+    undesired_tags = set(args.undesired_tags.split(stripped_caption_separator))

    def run_batch(path_imgs):
        imgs = np.array([im for _, im in path_imgs])

        if args.onnx:
-            # if len(imgs) < args.batch_size:
-            #     imgs = np.concatenate([imgs, np.zeros((args.batch_size - len(imgs), IMAGE_SIZE, IMAGE_SIZE, 3))], axis=0)
+            if len(imgs) < args.batch_size:
+                imgs = np.concatenate([imgs, np.zeros((args.batch_size - len(imgs), IMAGE_SIZE, IMAGE_SIZE, 3))], axis=0)
            probs = ort_sess.run(None, {input_name: imgs})[0]  # onnx output numpy
            probs = probs[: len(path_imgs)]
        else:
@@ -244,16 +181,22 @@ def main(args):
            probs = probs.numpy()

        for (image_path, _), prob in zip(path_imgs, probs):
-            combined_tags = []
-            rating_tag_text = ""
-            character_tag_text = ""
-            general_tag_text = ""
+            # 最初の4つはratingなので無視する
+            # # First 4 labels are actually ratings: pick one with argmax
+            # ratings_names = label_names[:4]
+            # rating_index = ratings_names["probs"].argmax()
+            # found_rating = ratings_names[rating_index: rating_index + 1][["name", "probs"]]

-            # 最初の4つ以降はタグなのでconfidenceがthreshold以上のものを追加する
-            # First 4 labels are ratings, the rest are tags: pick any where prediction confidence >= threshold
+            # それ以降はタグなのでconfidenceがthresholdより高いものを追加する
+            # Everything else is tags: pick any where prediction confidence > threshold
+            combined_tags = []
+            general_tag_text = ""
+            character_tag_text = ""
            for i, p in enumerate(prob[4:]):
                if i < len(general_tags) and p >= args.general_threshold:
                    tag_name = general_tags[i]
+                    if args.remove_underscore and len(tag_name) > 3:  # ignore emoji tags like >_< and ^_^
+                        tag_name = tag_name.replace("_", " ")

                    if tag_name not in undesired_tags:
                        tag_freq[tag_name] = tag_freq.get(tag_name, 0) + 1
@@ -261,37 +204,13 @@ def main(args):
                        combined_tags.append(tag_name)
                elif i >= len(general_tags) and p >= args.character_threshold:
                    tag_name = character_tags[i - len(general_tags)]
+                    if args.remove_underscore and len(tag_name) > 3:
+                        tag_name = tag_name.replace("_", " ")

                    if tag_name not in undesired_tags:
                        tag_freq[tag_name] = tag_freq.get(tag_name, 0) + 1
                        character_tag_text += caption_separator + tag_name
-                        if args.character_tags_first: # insert to the beginning
-                            combined_tags.insert(0, tag_name)
-                        else:
-                            combined_tags.append(tag_name)
-
-            # 最初の4つはratingなのでargmaxで選ぶ
-            # First 4 labels are actually ratings: pick one with argmax
-            if args.use_rating_tags or args.use_rating_tags_as_last_tag:
-                ratings_probs = prob[:4]
-                rating_index = ratings_probs.argmax()
-                found_rating = rating_tags[rating_index]
-
-                if found_rating not in undesired_tags:
-                    tag_freq[found_rating] = tag_freq.get(found_rating, 0) + 1
-                    rating_tag_text = found_rating
-                    if args.use_rating_tags:
-                        combined_tags.insert(0, found_rating) # insert to the beginning
-                    else:
-                        combined_tags.append(found_rating)
-
-            # 一番最初に置くタグを指定する
-            # Always put some tags at the beginning
-            if always_first_tags is not None:
-                for tag in always_first_tags:
-                    if tag in combined_tags:
-                        combined_tags.remove(tag)
-                        combined_tags.insert(0, tag)
+                        combined_tags.append(tag_name)

            # 先頭のカンマを取る
            if len(general_tag_text) > 0:
@@ -324,7 +243,6 @@ def main(args):
                if args.debug:
                    logger.info("")
                    logger.info(f"{image_path}:")
-                    logger.info(f"\tRating tags: {rating_tag_text}")
                    logger.info(f"\tCharacter tags: {character_tag_text}")
                    logger.info(f"\tGeneral tags: {general_tag_text}")

@@ -349,7 +267,9 @@ def main(args):
                continue

            image, image_path = data
-            if image is None:
+            if image is not None:
+                image = image.detach().numpy()
+            else:
                try:
                    image = Image.open(image_path)
                    if image.mode != "RGB":
@@ -380,9 +300,7 @@ def main(args):

 def setup_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ"
-    )
+    parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
    parser.add_argument(
        "--repo_id",
        type=str,
@@ -396,13 +314,9 @@ def setup_parser() -> argparse.ArgumentParser:
        help="directory to store wd14 tagger model / wd14 taggerのモデルを格納するディレクトリ",
    )
    parser.add_argument(
-        "--force_download",
-        action="store_true",
-        help="force downloading wd14 tagger models / wd14 taggerのモデルを再ダウンロードします",
-    )
-    parser.add_argument(
-        "--batch_size", type=int, default=1, help="batch size in inference / 推論時のバッチサイズ"
+        "--force_download", action="store_true", help="force downloading wd14 tagger models / wd14 taggerのモデルを再ダウンロードします"
    )
+    parser.add_argument("--batch_size", type=int, default=1, help="batch size in inference / 推論時のバッチサイズ")
    parser.add_argument(
        "--max_data_loader_n_workers",
        type=int,
@@ -415,12 +329,8 @@ def setup_parser() -> argparse.ArgumentParser:
        default=None,
        help="extension of caption file (for backward compatibility) / 出力されるキャプションファイルの拡張子（スペルミスしていたのを残してあります）",
    )
-    parser.add_argument(
-        "--caption_extension", type=str, default=".txt", help="extension of caption file / 出力されるキャプションファイルの拡張子"
-    )
-    parser.add_argument(
-        "--thresh", type=float, default=0.35, help="threshold of confidence to add a tag / タグを追加するか判定する閾値"
-    )
+    parser.add_argument("--caption_extension", type=str, default=".txt", help="extension of caption file / 出力されるキャプションファイルの拡張子")
+    parser.add_argument("--thresh", type=float, default=0.35, help="threshold of confidence to add a tag / タグを追加するか判定する閾値")
    parser.add_argument(
        "--general_threshold",
        type=float,
@@ -433,67 +343,28 @@ def setup_parser() -> argparse.ArgumentParser:
        default=None,
        help="threshold of confidence to add a tag for character category, same as --thres if omitted / characterカテゴリのタグを追加するための確信度の閾値、省略時は --thresh と同じ",
    )
-    parser.add_argument(
-        "--recursive", action="store_true", help="search for images in subfolders recursively / サブフォルダを再帰的に検索する"
-    )
+    parser.add_argument("--recursive", action="store_true", help="search for images in subfolders recursively / サブフォルダを再帰的に検索する")
    parser.add_argument(
        "--remove_underscore",
        action="store_true",
        help="replace underscores with spaces in the output tags / 出力されるタグのアンダースコアをスペースに置き換える",
    )
-    parser.add_argument(
-        "--debug", action="store_true", help="debug mode"
-    )
+    parser.add_argument("--debug", action="store_true", help="debug mode")
    parser.add_argument(
        "--undesired_tags",
        type=str,
        default="",
        help="comma-separated list of undesired tags to remove from the output / 出力から除外したいタグのカンマ区切りのリスト",
    )
-    parser.add_argument(
-        "--frequency_tags", action="store_true", help="Show frequency of tags for images / タグの出現頻度を表示する"
-    )
-    parser.add_argument(
-        "--onnx", action="store_true", help="use onnx model for inference / onnxモデルを推論に使用する"
-    )
-    parser.add_argument(
-        "--append_tags", action="store_true", help="Append captions instead of overwriting / 上書きではなくキャプションを追記する"
-    )
-    parser.add_argument(
-        "--use_rating_tags", action="store_true", help="Adds rating tags as the first tag / レーティングタグを最初のタグとして追加する",
-    )
-    parser.add_argument(
-        "--use_rating_tags_as_last_tag", action="store_true", help="Adds rating tags as the last tag / レーティングタグを最後のタグとして追加する",
-    )
-    parser.add_argument(
-        "--character_tags_first", action="store_true", help="Always inserts character tags before the general tags / characterタグを常にgeneralタグの前に出力する",
-    )
-    parser.add_argument(
-        "--always_first_tags",
-        type=str,
-        default=None,
-        help="comma-separated list of tags to always put at the beginning, e.g. `1girl,1boy`"
-        + " / 必ず先頭に置くタグのカンマ区切りリスト、例 : `1girl,1boy`",
-    )
+    parser.add_argument("--frequency_tags", action="store_true", help="Show frequency of tags for images / 画像ごとのタグの出現頻度を表示する")
+    parser.add_argument("--onnx", action="store_true", help="use onnx model for inference / onnxモデルを推論に使用する")
+    parser.add_argument("--append_tags", action="store_true", help="Append captions instead of overwriting / 上書きではなくキャプションを追記する")
    parser.add_argument(
        "--caption_separator",
        type=str,
        default=", ",
        help="Separator for captions, include space if needed / キャプションの区切り文字、必要ならスペースを含めてください",
    )
-    parser.add_argument(
-        "--tag_replacement",
-        type=str,
-        default=None,
-        help="tag replacement in the format of `source1,target1;source2,target2; ...`. Escape `,` and `;` with `\`. e.g. `tag1,tag2;tag3,tag4`"
-        + " / タグの置換を `置換元1,置換先1;置換元2,置換先2; ...`で指定する。`\` で `,` と `;` をエスケープできる。例: `tag1,tag2;tag3,tag4`",
-    )
-    parser.add_argument(
-        "--character_tag_expand",
-        action="store_true",
-        help="expand tag tail parenthesis to another tag for character tags. `chara_name_(series)` becomes `chara_name, series`"
-        + " / キャラクタタグの末尾の括弧を別のタグに展開する。`chara_name_(series)` は `chara_name, series` になる",
-    )

    return parser

--- a/gen_img.py
+++ b/gen_img.py
@@ -86,8 +86,7 @@ CLIP_VISION_MODEL = "laion/CLIP-ViT-bigG-14-laion2B-39B-b160k"
 """


-# def replace_unet_modules(unet: diffusers.models.unets.unet_2d_condition.UNet2DConditionModel, mem_eff_attn, xformers, sdpa):
-def replace_unet_modules(unet, mem_eff_attn, xformers, sdpa):
+def replace_unet_modules(unet: diffusers.models.unet_2d_condition.UNet2DConditionModel, mem_eff_attn, xformers, sdpa):
    if mem_eff_attn:
        logger.info("Enable memory efficient attention for U-Net")

@@ -1436,7 +1435,6 @@ class BatchDataBase(NamedTuple):
    clip_prompt: str
    guide_image: Any
    raw_prompt: str
-    file_name: Optional[str]


 class BatchDataExt(NamedTuple):
@@ -1495,6 +1493,8 @@ def main(args):
    highres_fix = args.highres_fix_scale is not None
    # assert not highres_fix or args.image_path is None, f"highres_fix doesn't work with img2img / highres_fixはimg2imgと同時に使えません"

+    if args.v_parameterization and not args.v2:
+        logger.warning("v_parameterization should be with v2 / v1でv_parameterizationを使用することは想定されていません")
    if args.v2 and args.clip_skip is not None:
        logger.warning("v2 with clip_skip will be unexpected / v2でclip_skipを使用することは想定されていません")

@@ -2316,7 +2316,7 @@ def main(args):
            # このバッチの情報を取り出す
            (
                return_latents,
-                (step_first, _, _, _, init_image, mask_image, _, guide_image, _, _),
+                (step_first, _, _, _, init_image, mask_image, _, guide_image, _),
                (
                    width,
                    height,
@@ -2339,7 +2339,6 @@ def main(args):
            prompts = []
            negative_prompts = []
            raw_prompts = []
-            filenames = []
            start_code = torch.zeros((batch_size, *noise_shape), device=device, dtype=dtype)
            noises = [
                torch.zeros((batch_size, *noise_shape), device=device, dtype=dtype)
@@ -2372,7 +2371,7 @@ def main(args):
            all_guide_images_are_same = True
            for i, (
                _,
-                (_, prompt, negative_prompt, seed, init_image, mask_image, clip_prompt, guide_image, raw_prompt, filename),
+                (_, prompt, negative_prompt, seed, init_image, mask_image, clip_prompt, guide_image, raw_prompt),
                _,
            ) in enumerate(batch):
                prompts.append(prompt)
@@ -2380,7 +2379,6 @@ def main(args):
                seeds.append(seed)
                clip_prompts.append(clip_prompt)
                raw_prompts.append(raw_prompt)
-                filenames.append(filename)

                if init_image is not None:
                    init_images.append(init_image)
@@ -2480,8 +2478,8 @@ def main(args):
            # save image
            highres_prefix = ("0" if highres_1st else "1") if highres_fix else ""
            ts_str = time.strftime("%Y%m%d%H%M%S", time.localtime())
-            for i, (image, prompt, negative_prompts, seed, clip_prompt, raw_prompt, filename) in enumerate(
-                zip(images, prompts, negative_prompts, seeds, clip_prompts, raw_prompts, filenames)
+            for i, (image, prompt, negative_prompts, seed, clip_prompt, raw_prompt) in enumerate(
+                zip(images, prompts, negative_prompts, seeds, clip_prompts, raw_prompts)
            ):
                if highres_fix:
                    seed -= 1  # record original seed
@@ -2507,23 +2505,17 @@ def main(args):
                    metadata.add_text("crop-top", str(crop_top))
                    metadata.add_text("crop-left", str(crop_left))

-                if filename is not None:
-                    fln = filename
-                else:
-                    if args.use_original_file_name and init_images is not None:
-                        if type(init_images) is list:
-                            fln = os.path.splitext(os.path.basename(init_images[i % len(init_images)].filename))[0] + ".png"
-                        else:
-                            fln = os.path.splitext(os.path.basename(init_images.filename))[0] + ".png"
-                    elif args.sequential_file_name:
-                        fln = f"im_{highres_prefix}{step_first + i + 1:06d}.png"
+                if args.use_original_file_name and init_images is not None:
+                    if type(init_images) is list:
+                        fln = os.path.splitext(os.path.basename(init_images[i % len(init_images)].filename))[0] + ".png"
                    else:
-                        fln = f"im_{ts_str}_{highres_prefix}{i:03d}_{seed}.png"
-
-                if fln.endswith(".webp"):
-                    image.save(os.path.join(args.outdir, fln), pnginfo=metadata, quality=100)  # lossy
+                        fln = os.path.splitext(os.path.basename(init_images.filename))[0] + ".png"
+                elif args.sequential_file_name:
+                    fln = f"im_{highres_prefix}{step_first + i + 1:06d}.png"
                else:
-                    image.save(os.path.join(args.outdir, fln), pnginfo=metadata)
+                    fln = f"im_{ts_str}_{highres_prefix}{i:03d}_{seed}.png"
+
+                image.save(os.path.join(args.outdir, fln), pnginfo=metadata)

            if not args.no_preview and not highres_1st and args.interactive:
                try:
@@ -2570,7 +2562,6 @@ def main(args):
            # repeat prompt
            for pi in range(args.images_per_prompt if len(raw_prompts) == 1 else len(raw_prompts)):
                raw_prompt = raw_prompts[pi] if len(raw_prompts) > 1 else raw_prompts[0]
-                filename = None

                if pi == 0 or len(raw_prompts) > 1:
                    # parse prompt: if prompt is not changed, skip parsing
@@ -2792,12 +2783,6 @@ def main(args):
                                logger.info(f"gradual latent unsharp params: {gl_unsharp_params}")
                                continue

-                            m = re.match(r"f (.+)", parg, re.IGNORECASE)
-                            if m:  # filename
-                                filename = m.group(1)
-                                logger.info(f"filename: {filename}")
-                                continue
-
                        except ValueError as ex:
                            logger.error(f"Exception in parsing / 解析エラー: {parg}")
                            logger.error(f"{ex}")
@@ -2888,16 +2873,7 @@ def main(args):
                b1 = BatchData(
                    False,
                    BatchDataBase(
-                        global_step,
-                        prompt,
-                        negative_prompt,
-                        seed,
-                        init_image,
-                        mask_image,
-                        clip_prompt,
-                        guide_image,
-                        raw_prompt,
-                        filename,
+                        global_step, prompt, negative_prompt, seed, init_image, mask_image, clip_prompt, guide_image, raw_prompt
                    ),
                    BatchDataExt(
                        width,
@@ -2940,7 +2916,7 @@ def setup_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser()

    add_logging_arguments(parser)
-
+    
    parser.add_argument(
        "--sdxl", action="store_true", help="load Stable Diffusion XL model / Stable Diffusion XLのモデルを読み込む"
    )
--- a/gen_img_diffusers.py
+++ b/gen_img_diffusers.py
@@ -2216,6 +2216,8 @@ def main(args):
    highres_fix = args.highres_fix_scale is not None
    # assert not highres_fix or args.image_path is None, f"highres_fix doesn't work with img2img / highres_fixはimg2imgと同時に使えません"

+    if args.v_parameterization and not args.v2:
+        logger.warning("v_parameterization should be with v2 / v1でv_parameterizationを使用することは想定されていません")
    if args.v2 and args.clip_skip is not None:
        logger.warning("v2 with clip_skip will be unexpected / v2でclip_skipを使用することは想定されていません")

--- a/library/adafactor_fused.py
+++ b/library/adafactor_fused.py
@@ -1,106 +0,0 @@
-import math
-import torch
-from transformers import Adafactor
-
-@torch.no_grad()
-def adafactor_step_param(self, p, group):
-    if p.grad is None:
-        return
-    grad = p.grad
-    if grad.dtype in {torch.float16, torch.bfloat16}:
-        grad = grad.float()
-    if grad.is_sparse:
-        raise RuntimeError("Adafactor does not support sparse gradients.")
-
-    state = self.state[p]
-    grad_shape = grad.shape
-
-    factored, use_first_moment = Adafactor._get_options(group, grad_shape)
-    # State Initialization
-    if len(state) == 0:
-        state["step"] = 0
-
-        if use_first_moment:
-            # Exponential moving average of gradient values
-            state["exp_avg"] = torch.zeros_like(grad)
-        if factored:
-            state["exp_avg_sq_row"] = torch.zeros(grad_shape[:-1]).to(grad)
-            state["exp_avg_sq_col"] = torch.zeros(grad_shape[:-2] + grad_shape[-1:]).to(grad)
-        else:
-            state["exp_avg_sq"] = torch.zeros_like(grad)
-
-        state["RMS"] = 0
-    else:
-        if use_first_moment:
-            state["exp_avg"] = state["exp_avg"].to(grad)
-        if factored:
-            state["exp_avg_sq_row"] = state["exp_avg_sq_row"].to(grad)
-            state["exp_avg_sq_col"] = state["exp_avg_sq_col"].to(grad)
-        else:
-            state["exp_avg_sq"] = state["exp_avg_sq"].to(grad)
-
-    p_data_fp32 = p
-    if p.dtype in {torch.float16, torch.bfloat16}:
-        p_data_fp32 = p_data_fp32.float()
-
-    state["step"] += 1
-    state["RMS"] = Adafactor._rms(p_data_fp32)
-    lr = Adafactor._get_lr(group, state)
-
-    beta2t = 1.0 - math.pow(state["step"], group["decay_rate"])
-    update = (grad ** 2) + group["eps"][0]
-    if factored:
-        exp_avg_sq_row = state["exp_avg_sq_row"]
-        exp_avg_sq_col = state["exp_avg_sq_col"]
-
-        exp_avg_sq_row.mul_(beta2t).add_(update.mean(dim=-1), alpha=(1.0 - beta2t))
-        exp_avg_sq_col.mul_(beta2t).add_(update.mean(dim=-2), alpha=(1.0 - beta2t))
-
-        # Approximation of exponential moving average of square of gradient
-        update = Adafactor._approx_sq_grad(exp_avg_sq_row, exp_avg_sq_col)
-        update.mul_(grad)
-    else:
-        exp_avg_sq = state["exp_avg_sq"]
-
-        exp_avg_sq.mul_(beta2t).add_(update, alpha=(1.0 - beta2t))
-        update = exp_avg_sq.rsqrt().mul_(grad)
-
-    update.div_((Adafactor._rms(update) / group["clip_threshold"]).clamp_(min=1.0))
-    update.mul_(lr)
-
-    if use_first_moment:
-        exp_avg = state["exp_avg"]
-        exp_avg.mul_(group["beta1"]).add_(update, alpha=(1 - group["beta1"]))
-        update = exp_avg
-
-    if group["weight_decay"] != 0:
-        p_data_fp32.add_(p_data_fp32, alpha=(-group["weight_decay"] * lr))
-
-    p_data_fp32.add_(-update)
-
-    if p.dtype in {torch.float16, torch.bfloat16}:
-        p.copy_(p_data_fp32)
-
-
-@torch.no_grad()
-def adafactor_step(self, closure=None):
-    """
-    Performs a single optimization step
-
-    Arguments:
-        closure (callable, optional): A closure that reevaluates the model
-            and returns the loss.
-    """
-    loss = None
-    if closure is not None:
-        loss = closure()
-
-    for group in self.param_groups:
-        for p in group["params"]:
-            adafactor_step_param(self, p, group)
-
-    return loss
-
-def patch_adafactor_fused(optimizer: Adafactor):
-    optimizer.step_param = adafactor_step_param.__get__(optimizer)
-    optimizer.step = adafactor_step.__get__(optimizer)
--- a/library/config_util.py
+++ b/library/config_util.py
@@ -41,17 +41,12 @@ from .train_util import (
    DatasetGroup,
 )
 from .utils import setup_logging
-
 setup_logging()
 import logging
-
 logger = logging.getLogger(__name__)

-
 def add_config_arguments(parser: argparse.ArgumentParser):
-    parser.add_argument(
-        "--dataset_config", type=Path, default=None, help="config file for detail settings / 詳細な設定用の設定ファイル"
-    )
+    parser.add_argument("--dataset_config", type=Path, default=None, help="config file for detail settings / 詳細な設定用の設定ファイル")


 # TODO: inherit Params class in Subset, Dataset
@@ -65,8 +60,6 @@ class BaseSubsetParams:
    caption_separator: str = (",",)
    keep_tokens: int = 0
    keep_tokens_separator: str = (None,)
-    secondary_separator: Optional[str] = None
-    enable_wildcard: bool = False
    color_aug: bool = False
    flip_aug: bool = False
    face_crop_aug_range: Optional[Tuple[float, float]] = None
@@ -85,21 +78,17 @@ class DreamBoothSubsetParams(BaseSubsetParams):
    is_reg: bool = False
    class_tokens: Optional[str] = None
    caption_extension: str = ".caption"
-    cache_info: bool = False
-    alpha_mask: bool = False


@dataclass
 class FineTuningSubsetParams(BaseSubsetParams):
    metadata_file: Optional[str] = None
-    alpha_mask: bool = False


@dataclass
 class ControlNetSubsetParams(BaseSubsetParams):
    conditioning_data_dir: str = None
    caption_extension: str = ".caption"
-    cache_info: bool = False


@dataclass
@@ -192,9 +181,6 @@ class ConfigSanitizer:
        "shuffle_caption": bool,
        "keep_tokens": int,
        "keep_tokens_separator": str,
-        "secondary_separator": str,
-        "caption_separator": str,
-        "enable_wildcard": bool,
        "token_warmup_min": int,
        "token_warmup_step": Any(float, int),
        "caption_prefix": str,
@@ -210,22 +196,18 @@ class ConfigSanitizer:
    DB_SUBSET_ASCENDABLE_SCHEMA = {
        "caption_extension": str,
        "class_tokens": str,
-        "cache_info": bool,
    }
    DB_SUBSET_DISTINCT_SCHEMA = {
        Required("image_dir"): str,
        "is_reg": bool,
-        "alpha_mask": bool,
    }
    # FT means FineTuning
    FT_SUBSET_DISTINCT_SCHEMA = {
        Required("metadata_file"): str,
        "image_dir": str,
-        "alpha_mask": bool,
    }
    CN_SUBSET_ASCENDABLE_SCHEMA = {
        "caption_extension": str,
-        "cache_info": bool,
    }
    CN_SUBSET_DISTINCT_SCHEMA = {
        Required("image_dir"): str,
@@ -262,10 +244,9 @@ class ConfigSanitizer:
    }

    def __init__(self, support_dreambooth: bool, support_finetuning: bool, support_controlnet: bool, support_dropout: bool) -> None:
-        assert support_dreambooth or support_finetuning or support_controlnet, (
-            "Neither DreamBooth mode nor fine tuning mode nor controlnet mode specified. Please specify one mode or more."
-            + " / DreamBooth モードか fine tuning モードか controlnet モードのどれも指定されていません。1つ以上指定してください。"
-        )
+        assert (
+            support_dreambooth or support_finetuning or support_controlnet
+        ), "Neither DreamBooth mode nor fine tuning mode specified. Please specify one mode or more. / DreamBooth モードか fine tuning モードのどちらも指定されていません。1つ以上指定してください。"

        self.db_subset_schema = self.__merge_dict(
            self.SUBSET_ASCENDABLE_SCHEMA,
@@ -332,10 +313,7 @@ class ConfigSanitizer:

            self.dataset_schema = validate_flex_dataset
        elif support_dreambooth:
-            if support_controlnet:
-                self.dataset_schema = self.cn_dataset_schema
-            else:
-                self.dataset_schema = self.db_dataset_schema
+            self.dataset_schema = self.db_dataset_schema
        elif support_finetuning:
            self.dataset_schema = self.ft_dataset_schema
        elif support_controlnet:
@@ -380,9 +358,7 @@ class ConfigSanitizer:
            return self.argparse_config_validator(argparse_namespace)
        except MultipleInvalid:
            # XXX: this should be a bug
-            logger.error(
-                "Invalid cmdline parsed arguments. This should be a bug. / コマンドラインのパース結果が正しくないようです。プログラムのバグの可能性が高いです。"
-            )
+            logger.error("Invalid cmdline parsed arguments. This should be a bug. / コマンドラインのパース結果が正しくないようです。プログラムのバグの可能性が高いです。")
            raise

    # NOTE: value would be overwritten by latter dict if there is already the same key
@@ -528,9 +504,6 @@ def generate_dataset_group_by_blueprint(dataset_group_blueprint: DatasetGroupBlu
          shuffle_caption: {subset.shuffle_caption}
          keep_tokens: {subset.keep_tokens}
          keep_tokens_separator: {subset.keep_tokens_separator}
-          caption_separator: {subset.caption_separator}
-          secondary_separator: {subset.secondary_separator}
-          enable_wildcard: {subset.enable_wildcard}
          caption_dropout_rate: {subset.caption_dropout_rate}
          caption_dropout_every_n_epoches: {subset.caption_dropout_every_n_epochs}
          caption_tag_dropout_rate: {subset.caption_tag_dropout_rate}
@@ -542,7 +515,6 @@ def generate_dataset_group_by_blueprint(dataset_group_blueprint: DatasetGroupBlu
          random_crop: {subset.random_crop}
          token_warmup_min: {subset.token_warmup_min},
          token_warmup_step: {subset.token_warmup_step},
-          alpha_mask: {subset.alpha_mask},
      """
                ),
                "  ",
@@ -569,11 +541,11 @@ def generate_dataset_group_by_blueprint(dataset_group_blueprint: DatasetGroupBlu
                    "    ",
                )

-    logger.info(f"{info}")
+    logger.info(f'{info}')

    # make buckets first because it determines the length of dataset
    # and set the same seed for all datasets
-    seed = random.randint(0, 2**31)  # actual seed is seed + epoch_no
+    seed = random.randint(0, 2**31) # actual seed is seed + epoch_no
    for i, dataset in enumerate(datasets):
        logger.info(f"[Dataset {i}]")
        dataset.make_buckets()
@@ -660,17 +632,13 @@ def load_user_config(file: str) -> dict:
            with open(file, "r") as f:
                config = json.load(f)
        except Exception:
-            logger.error(
-                f"Error on parsing JSON config file. Please check the format. / JSON 形式の設定ファイルの読み込みに失敗しました。文法が正しいか確認してください。: {file}"
-            )
+            logger.error(f"Error on parsing JSON config file. Please check the format. / JSON 形式の設定ファイルの読み込みに失敗しました。文法が正しいか確認してください。: {file}")
            raise
    elif file.name.lower().endswith(".toml"):
        try:
            config = toml.load(file)
        except Exception:
-            logger.error(
-                f"Error on parsing TOML config file. Please check the format. / TOML 形式の設定ファイルの読み込みに失敗しました。文法が正しいか確認してください。: {file}"
-            )
+            logger.error(f"Error on parsing TOML config file. Please check the format. / TOML 形式の設定ファイルの読み込みに失敗しました。文法が正しいか確認してください。: {file}")
            raise
    else:
        raise ValueError(f"not supported config file format / 対応していない設定ファイルの形式です: {file}")
@@ -697,13 +665,13 @@ if __name__ == "__main__":
    train_util.prepare_dataset_args(argparse_namespace, config_args.support_finetuning)

    logger.info("[argparse_namespace]")
-    logger.info(f"{vars(argparse_namespace)}")
+    logger.info(f'{vars(argparse_namespace)}')

    user_config = load_user_config(config_args.dataset_config)

    logger.info("")
    logger.info("[user_config]")
-    logger.info(f"{user_config}")
+    logger.info(f'{user_config}')

    sanitizer = ConfigSanitizer(
        config_args.support_dreambooth, config_args.support_finetuning, config_args.support_controlnet, config_args.support_dropout
@@ -712,10 +680,10 @@ if __name__ == "__main__":

    logger.info("")
    logger.info("[sanitized_user_config]")
-    logger.info(f"{sanitized_user_config}")
+    logger.info(f'{sanitized_user_config}')

    blueprint = BlueprintGenerator(sanitizer).generate(user_config, argparse_namespace)

    logger.info("")
    logger.info("[blueprint]")
-    logger.info(f"{blueprint}")
+    logger.info(f'{blueprint}')
--- a/library/custom_train_functions.py
+++ b/library/custom_train_functions.py
@@ -3,14 +3,11 @@ import argparse
 import random
 import re
 from typing import List, Optional, Union
-from .utils import setup_logging
-
+from .utils import setup_logging
 setup_logging()
-import logging
-
+import logging
 logger = logging.getLogger(__name__)

-
 def prepare_scheduler_for_custom_training(noise_scheduler, device):
    if hasattr(noise_scheduler, "all_snr"):
        return
@@ -67,7 +64,7 @@ def apply_snr_weight(loss, timesteps, noise_scheduler, gamma, v_prediction=False
    snr = torch.stack([noise_scheduler.all_snr[t] for t in timesteps])
    min_snr_gamma = torch.minimum(snr, torch.full_like(snr, gamma))
    if v_prediction:
-        snr_weight = torch.div(min_snr_gamma, snr + 1).float().to(loss.device)
+        snr_weight = torch.div(min_snr_gamma, snr+1).float().to(loss.device)
    else:
        snr_weight = torch.div(min_snr_gamma, snr).float().to(loss.device)
    loss = loss * snr_weight
@@ -95,18 +92,13 @@ def add_v_prediction_like_loss(loss, timesteps, noise_scheduler, v_pred_like_los
    loss = loss + loss / scale * v_pred_like_loss
    return loss

-
-def apply_debiased_estimation(loss, timesteps, noise_scheduler, v_prediction=False):
+def apply_debiased_estimation(loss, timesteps, noise_scheduler):
    snr_t = torch.stack([noise_scheduler.all_snr[t] for t in timesteps])  # batch_size
    snr_t = torch.minimum(snr_t, torch.ones_like(snr_t) * 1000)  # if timestep is 0, snr_t is inf, so limit it to 1000
-    if v_prediction:
-        weight = 1 / (snr_t + 1)
-    else:
-        weight = 1 / torch.sqrt(snr_t)
+    weight = 1/torch.sqrt(snr_t)
    loss = weight * loss
    return loss

-
 # TODO train_utilと分散しているのでどちらかに寄せる


@@ -482,25 +474,6 @@ def apply_noise_offset(latents, noise, noise_offset, adaptive_noise_scale):
    return noise


-def apply_masked_loss(loss, batch):
-    if "conditioning_images" in batch:
-        # conditioning image is -1 to 1. we need to convert it to 0 to 1
-        mask_image = batch["conditioning_images"].to(dtype=loss.dtype)[:, 0].unsqueeze(1)  # use R channel
-        mask_image = mask_image / 2 + 0.5
-        # print(f"conditioning_image: {mask_image.shape}")
-    elif "alpha_masks" in batch and batch["alpha_masks"] is not None:
-        # alpha mask is 0 to 1
-        mask_image = batch["alpha_masks"].to(dtype=loss.dtype).unsqueeze(1) # add channel dimension
-        # print(f"mask_image: {mask_image.shape}, {mask_image.mean()}")
-    else:
-        return loss
-
-    # resize to the same size as the loss
-    mask_image = torch.nn.functional.interpolate(mask_image, size=loss.shape[2:], mode="area")
-    loss = loss * mask_image
-    return loss
-
-
 """
 ##########################################
 # Perlin Noise
--- a/library/deepspeed_utils.py
+++ b/library/deepspeed_utils.py
@@ -1,139 +0,0 @@
-import os
-import argparse
-import torch
-from accelerate import DeepSpeedPlugin, Accelerator
-
-from .utils import setup_logging
-
-setup_logging()
-import logging
-
-logger = logging.getLogger(__name__)
-
-
-def add_deepspeed_arguments(parser: argparse.ArgumentParser):
-    # DeepSpeed Arguments. https://huggingface.co/docs/accelerate/usage_guides/deepspeed
-    parser.add_argument("--deepspeed", action="store_true", help="enable deepspeed training")
-    parser.add_argument("--zero_stage", type=int, default=2, choices=[0, 1, 2, 3], help="Possible options are 0,1,2,3.")
-    parser.add_argument(
-        "--offload_optimizer_device",
-        type=str,
-        default=None,
-        choices=[None, "cpu", "nvme"],
-        help="Possible options are none|cpu|nvme. Only applicable with ZeRO Stages 2 and 3.",
-    )
-    parser.add_argument(
-        "--offload_optimizer_nvme_path",
-        type=str,
-        default=None,
-        help="Possible options are /nvme|/local_nvme. Only applicable with ZeRO Stage 3.",
-    )
-    parser.add_argument(
-        "--offload_param_device",
-        type=str,
-        default=None,
-        choices=[None, "cpu", "nvme"],
-        help="Possible options are none|cpu|nvme. Only applicable with ZeRO Stage 3.",
-    )
-    parser.add_argument(
-        "--offload_param_nvme_path",
-        type=str,
-        default=None,
-        help="Possible options are /nvme|/local_nvme. Only applicable with ZeRO Stage 3.",
-    )
-    parser.add_argument(
-        "--zero3_init_flag",
-        action="store_true",
-        help="Flag to indicate whether to enable `deepspeed.zero.Init` for constructing massive models."
-        "Only applicable with ZeRO Stage-3.",
-    )
-    parser.add_argument(
-        "--zero3_save_16bit_model",
-        action="store_true",
-        help="Flag to indicate whether to save 16-bit model. Only applicable with ZeRO Stage-3.",
-    )
-    parser.add_argument(
-        "--fp16_master_weights_and_gradients",
-        action="store_true",
-        help="fp16_master_and_gradients requires optimizer to support keeping fp16 master and gradients while keeping the optimizer states in fp32.",
-    )
-
-
-def prepare_deepspeed_args(args: argparse.Namespace):
-    if not args.deepspeed:
-        return
-
-    # To avoid RuntimeError: DataLoader worker exited unexpectedly with exit code 1.
-    args.max_data_loader_n_workers = 1
-
-
-def prepare_deepspeed_plugin(args: argparse.Namespace):
-    if not args.deepspeed:
-        return None
-
-    try:
-        import deepspeed
-    except ImportError as e:
-        logger.error(
-            "deepspeed is not installed. please install deepspeed in your environment with following command. DS_BUILD_OPS=0 pip install deepspeed"
-        )
-        exit(1)
-
-    deepspeed_plugin = DeepSpeedPlugin(
-        zero_stage=args.zero_stage,
-        gradient_accumulation_steps=args.gradient_accumulation_steps,
-        gradient_clipping=args.max_grad_norm,
-        offload_optimizer_device=args.offload_optimizer_device,
-        offload_optimizer_nvme_path=args.offload_optimizer_nvme_path,
-        offload_param_device=args.offload_param_device,
-        offload_param_nvme_path=args.offload_param_nvme_path,
-        zero3_init_flag=args.zero3_init_flag,
-        zero3_save_16bit_model=args.zero3_save_16bit_model,
-    )
-    deepspeed_plugin.deepspeed_config["train_micro_batch_size_per_gpu"] = args.train_batch_size
-    deepspeed_plugin.deepspeed_config["train_batch_size"] = (
-        args.train_batch_size * args.gradient_accumulation_steps * int(os.environ["WORLD_SIZE"])
-    )
-    deepspeed_plugin.set_mixed_precision(args.mixed_precision)
-    if args.mixed_precision.lower() == "fp16":
-        deepspeed_plugin.deepspeed_config["fp16"]["initial_scale_power"] = 0  # preventing overflow.
-    if args.full_fp16 or args.fp16_master_weights_and_gradients:
-        if args.offload_optimizer_device == "cpu" and args.zero_stage == 2:
-            deepspeed_plugin.deepspeed_config["fp16"]["fp16_master_weights_and_grads"] = True
-            logger.info("[DeepSpeed] full fp16 enable.")
-        else:
-            logger.info(
-                "[DeepSpeed]full fp16, fp16_master_weights_and_grads currently only supported using ZeRO-Offload with DeepSpeedCPUAdam on ZeRO-2 stage."
-            )
-
-    if args.offload_optimizer_device is not None:
-        logger.info("[DeepSpeed] start to manually build cpu_adam.")
-        deepspeed.ops.op_builder.CPUAdamBuilder().load()
-        logger.info("[DeepSpeed] building cpu_adam done.")
-
-    return deepspeed_plugin
-
-
-# Accelerate library does not support multiple models for deepspeed. So, we need to wrap multiple models into a single model.
-def prepare_deepspeed_model(args: argparse.Namespace, **models):
-    # remove None from models
-    models = {k: v for k, v in models.items() if v is not None}
-
-    class DeepSpeedWrapper(torch.nn.Module):
-        def __init__(self, **kw_models) -> None:
-            super().__init__()
-            self.models = torch.nn.ModuleDict()
-
-            for key, model in kw_models.items():
-                if isinstance(model, list):
-                    model = torch.nn.ModuleList(model)
-                assert isinstance(
-                    model, torch.nn.Module
-                ), f"model must be an instance of torch.nn.Module, but got {key} is {type(model)}"
-                self.models.update(torch.nn.ModuleDict({key: model}))
-
-        def get_models(self):
-            return self.models
-
-    ds_model = DeepSpeedWrapper(**models)
-    return ds_model
--- a/library/device_utils.py
+++ b/library/device_utils.py
@@ -3,6 +3,11 @@ import gc

 import torch

+from .utils import setup_logging
+setup_logging()
+import logging
+logger = logging.getLogger(__name__)
+
 try:
    HAS_CUDA = torch.cuda.is_available()
 except Exception:
@@ -59,7 +64,7 @@ def get_preferred_device() -> torch.device:
        device = torch.device("mps")
    else:
        device = torch.device("cpu")
-    print(f"get_preferred_device() -> {device}")
+    logger.info(f"get_preferred_device() -> {device}")
    return device


@@ -77,8 +82,8 @@ def init_ipex():

            is_initialized, error_message = ipex_init()
            if not is_initialized:
-                print("failed to initialize ipex:", error_message)
+                logger.error("failed to initialize ipex: {error_message}")
        else:
            return
    except Exception as e:
-        print("failed to initialize ipex:", e)
+        logger.error("failed to initialize ipex: {e}")
--- a/library/ipex/init.py
+++ b/library/ipex/init.py
@@ -32,7 +32,6 @@ def ipex_init(): # pylint: disable=too-many-statements
            torch.cuda.FloatTensor = torch.xpu.FloatTensor
            torch.Tensor.cuda = torch.Tensor.xpu
            torch.Tensor.is_cuda = torch.Tensor.is_xpu
-            torch.nn.Module.cuda = torch.nn.Module.xpu
            torch.UntypedStorage.cuda = torch.UntypedStorage.xpu
            torch.cuda._initialization_lock = torch.xpu.lazy_init._initialization_lock
            torch.cuda._initialized = torch.xpu.lazy_init._initialized
@@ -148,9 +147,9 @@ def ipex_init(): # pylint: disable=too-many-statements

            # C
            torch._C._cuda_getCurrentRawStream = ipex._C._getCurrentStream
-            ipex._C._DeviceProperties.multi_processor_count = ipex._C._DeviceProperties.gpu_subslice_count
-            ipex._C._DeviceProperties.major = 2024
-            ipex._C._DeviceProperties.minor = 0
+            ipex._C._DeviceProperties.multi_processor_count = ipex._C._DeviceProperties.gpu_eu_count
+            ipex._C._DeviceProperties.major = 2023
+            ipex._C._DeviceProperties.minor = 2

            # Fix functions with ipex:
            torch.cuda.mem_get_info = lambda device=None: [(torch.xpu.get_device_properties(device).total_memory - torch.xpu.memory_reserved(device)), torch.xpu.get_device_properties(device).total_memory]
--- a/library/ipex/attention.py
+++ b/library/ipex/attention.py
@@ -5,7 +5,7 @@ from functools import cache

 # pylint: disable=protected-access, missing-function-docstring, line-too-long

-# ARC GPUs can't allocate more than 4GB to a single block so we slice the attention layers
+# ARC GPUs can't allocate more than 4GB to a single block so we slice the attetion layers

 sdpa_slice_trigger_rate = float(os.environ.get('IPEX_SDPA_SLICE_TRIGGER_RATE', 4))
 attention_slice_rate = float(os.environ.get('IPEX_ATTENTION_SLICE_RATE', 4))
@@ -122,15 +122,15 @@ def torch_bmm_32_bit(input, mat2, *, out=None):
                    mat2[start_idx:end_idx],
                    out=out
                )
-        torch.xpu.synchronize(input.device)
    else:
        return original_torch_bmm(input, mat2, out=out)
+    torch.xpu.synchronize(input.device)
    return hidden_states

 original_scaled_dot_product_attention = torch.nn.functional.scaled_dot_product_attention
-def scaled_dot_product_attention_32_bit(query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False, **kwargs):
+def scaled_dot_product_attention_32_bit(query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False):
    if query.device.type != "xpu":
-        return original_scaled_dot_product_attention(query, key, value, attn_mask=attn_mask, dropout_p=dropout_p, is_causal=is_causal, **kwargs)
+        return original_scaled_dot_product_attention(query, key, value, attn_mask=attn_mask, dropout_p=dropout_p, is_causal=is_causal)
    do_split, do_split_2, do_split_3, split_slice_size, split_2_slice_size, split_3_slice_size = find_sdpa_slice_sizes(query.shape, query.element_size())

    # Slice SDPA
@@ -153,7 +153,7 @@ def scaled_dot_product_attention_32_bit(query, key, value, attn_mask=None, dropo
                                key[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3],
                                value[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3],
                                attn_mask=attn_mask[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3] if attn_mask is not None else attn_mask,
-                                dropout_p=dropout_p, is_causal=is_causal, **kwargs
+                                dropout_p=dropout_p, is_causal=is_causal
                            )
                    else:
                        hidden_states[start_idx:end_idx, start_idx_2:end_idx_2] = original_scaled_dot_product_attention(
@@ -161,7 +161,7 @@ def scaled_dot_product_attention_32_bit(query, key, value, attn_mask=None, dropo
                            key[start_idx:end_idx, start_idx_2:end_idx_2],
                            value[start_idx:end_idx, start_idx_2:end_idx_2],
                            attn_mask=attn_mask[start_idx:end_idx, start_idx_2:end_idx_2] if attn_mask is not None else attn_mask,
-                            dropout_p=dropout_p, is_causal=is_causal, **kwargs
+                            dropout_p=dropout_p, is_causal=is_causal
                        )
            else:
                hidden_states[start_idx:end_idx] = original_scaled_dot_product_attention(
@@ -169,9 +169,9 @@ def scaled_dot_product_attention_32_bit(query, key, value, attn_mask=None, dropo
                    key[start_idx:end_idx],
                    value[start_idx:end_idx],
                    attn_mask=attn_mask[start_idx:end_idx] if attn_mask is not None else attn_mask,
-                    dropout_p=dropout_p, is_causal=is_causal, **kwargs
+                    dropout_p=dropout_p, is_causal=is_causal
                )
-        torch.xpu.synchronize(query.device)
    else:
-        return original_scaled_dot_product_attention(query, key, value, attn_mask=attn_mask, dropout_p=dropout_p, is_causal=is_causal, **kwargs)
+        return original_scaled_dot_product_attention(query, key, value, attn_mask=attn_mask, dropout_p=dropout_p, is_causal=is_causal)
+    torch.xpu.synchronize(query.device)
    return hidden_states
--- a/library/ipex/hijacks.py
+++ b/library/ipex/hijacks.py
@@ -12,7 +12,7 @@ device_supports_fp64 = torch.xpu.has_fp64_dtype()
 class DummyDataParallel(torch.nn.Module): # pylint: disable=missing-class-docstring, unused-argument, too-few-public-methods
    def __new__(cls, module, device_ids=None, output_device=None, dim=0): # pylint: disable=unused-argument
        if isinstance(device_ids, list) and len(device_ids) > 1:
-            print("IPEX backend doesn't support DataParallel on multiple XPU devices")
+            logger.error("IPEX backend doesn't support DataParallel on multiple XPU devices")
        return module.to("xpu")

 def return_null_context(*args, **kwargs): # pylint: disable=unused-argument
@@ -42,7 +42,7 @@ def autocast_init(self, device_type, dtype=None, enabled=True, cache_enabled=Non
 original_interpolate = torch.nn.functional.interpolate
@wraps(torch.nn.functional.interpolate)
 def interpolate(tensor, size=None, scale_factor=None, mode='nearest', align_corners=None, recompute_scale_factor=None, antialias=False): # pylint: disable=too-many-arguments
-    if antialias or align_corners is not None or mode == 'bicubic':
+    if antialias or align_corners is not None:
        return_device = tensor.device
        return_dtype = tensor.dtype
        return original_interpolate(tensor.to("cpu", dtype=torch.float32), size=size, scale_factor=scale_factor, mode=mode,
@@ -190,16 +190,6 @@ def Tensor_cuda(self, device=None, *args, **kwargs):
    else:
        return original_Tensor_cuda(self, device, *args, **kwargs)

-original_Tensor_pin_memory = torch.Tensor.pin_memory
-@wraps(torch.Tensor.pin_memory)
-def Tensor_pin_memory(self, device=None, *args, **kwargs):
-    if device is None:
-        device = "xpu"
-    if check_device(device):
-        return original_Tensor_pin_memory(self, return_xpu(device), *args, **kwargs)
-    else:
-        return original_Tensor_pin_memory(self, device, *args, **kwargs)
-
 original_UntypedStorage_init = torch.UntypedStorage.__init__
@wraps(torch.UntypedStorage.__init__)
 def UntypedStorage_init(*args, device=None, **kwargs):
@@ -226,9 +216,7 @@ def torch_empty(*args, device=None, **kwargs):

 original_torch_randn = torch.randn
@wraps(torch.randn)
-def torch_randn(*args, device=None, dtype=None, **kwargs):
-    if dtype == bytes:
-        dtype = None
+def torch_randn(*args, device=None, **kwargs):
    if check_device(device):
        return original_torch_randn(*args, device=return_xpu(device), **kwargs)
    else:
@@ -268,13 +256,11 @@ def torch_Generator(device=None):

 original_torch_load = torch.load
@wraps(torch.load)
-def torch_load(f, map_location=None, *args, **kwargs):
-    if map_location is None:
-        map_location = "xpu"
+def torch_load(f, map_location=None, pickle_module=None, *, weights_only=False, mmap=None, **kwargs):
    if check_device(map_location):
-        return original_torch_load(f, *args, map_location=return_xpu(map_location), **kwargs)
+        return original_torch_load(f, map_location=return_xpu(map_location), pickle_module=pickle_module, weights_only=weights_only, mmap=mmap, **kwargs)
    else:
-        return original_torch_load(f, *args, map_location=map_location, **kwargs)
+        return original_torch_load(f, map_location=map_location, pickle_module=pickle_module, weights_only=weights_only, mmap=mmap, **kwargs)


 # Hijack Functions:
@@ -282,7 +268,6 @@ def ipex_hijacks():
    torch.tensor = torch_tensor
    torch.Tensor.to = Tensor_to
    torch.Tensor.cuda = Tensor_cuda
-    torch.Tensor.pin_memory = Tensor_pin_memory
    torch.UntypedStorage.__init__ = UntypedStorage_init
    torch.UntypedStorage.cuda = UntypedStorage_cuda
    torch.empty = torch_empty
--- a/library/sai_model_spec.py
+++ b/library/sai_model_spec.py
@@ -6,8 +6,10 @@ import os
 from typing import List, Optional, Tuple, Union
 import safetensors
 from library.utils import setup_logging
+
 setup_logging()
 import logging
+
 logger = logging.getLogger(__name__)

 r"""
@@ -55,11 +57,13 @@ ARCH_SD_V1 = "stable-diffusion-v1"
 ARCH_SD_V2_512 = "stable-diffusion-v2-512"
 ARCH_SD_V2_768_V = "stable-diffusion-v2-768-v"
 ARCH_SD_XL_V1_BASE = "stable-diffusion-xl-v1-base"
+ARCH_STABLE_CASCADE = "stable-cascade"

 ADAPTER_LORA = "lora"
 ADAPTER_TEXTUAL_INVERSION = "textual-inversion"

 IMPL_STABILITY_AI = "https://github.com/Stability-AI/generative-models"
+IMPL_STABILITY_AI_STABLE_CASCADE = "https://github.com/Stability-AI/StableCascade"
 IMPL_DIFFUSERS = "diffusers"

 PRED_TYPE_EPSILON = "epsilon"
@@ -113,6 +117,7 @@ def build_metadata(
    merged_from: Optional[str] = None,
    timesteps: Optional[Tuple[int, int]] = None,
    clip_skip: Optional[int] = None,
+    stable_cascade: Optional[bool] = None,
 ):
    # if state_dict is None, hash is not calculated

@@ -124,7 +129,9 @@ def build_metadata(
    # hash = precalculate_safetensors_hashes(state_dict)
    # metadata["modelspec.hash_sha256"] = hash

-    if sdxl:
+    if stable_cascade:
+        arch = ARCH_STABLE_CASCADE
+    elif sdxl:
        arch = ARCH_SD_XL_V1_BASE
    elif v2:
        if v_parameterization:
@@ -142,9 +149,11 @@ def build_metadata(
    metadata["modelspec.architecture"] = arch

    if not lora and not textual_inversion and is_stable_diffusion_ckpt is None:
-        is_stable_diffusion_ckpt = True # default is stable diffusion ckpt if not lora and not textual_inversion
+        is_stable_diffusion_ckpt = True  # default is stable diffusion ckpt if not lora and not textual_inversion

-    if (lora and sdxl) or textual_inversion or is_stable_diffusion_ckpt:
+    if stable_cascade:
+        impl = IMPL_STABILITY_AI_STABLE_CASCADE
+    elif (lora and sdxl) or textual_inversion or is_stable_diffusion_ckpt:
        # Stable Diffusion ckpt, TI, SDXL LoRA
        impl = IMPL_STABILITY_AI
    else:
@@ -236,7 +245,7 @@ def build_metadata(
    # assert all([v is not None for v in metadata.values()]), metadata
    if not all([v is not None for v in metadata.values()]):
        logger.error(f"Internal error: some metadata values are None: {metadata}")
-    
+
    return metadata


@@ -250,7 +259,7 @@ def get_title(metadata: dict) -> Optional[str]:
 def load_metadata_from_safetensors(model: str) -> dict:
    if not model.endswith(".safetensors"):
        return {}
-    
+
    with safetensors.safe_open(model, framework="pt") as f:
        metadata = f.metadata()
    if metadata is None:
--- a/library/sdxl_model_util.py
+++ b/library/sdxl_model_util.py
@@ -1,5 +1,4 @@
 import torch
-import safetensors
 from accelerate import init_empty_weights
 from accelerate.utils.modeling import set_module_tensor_to_device
 from safetensors.torch import load_file, save_file
@@ -9,10 +8,8 @@ from diffusers import AutoencoderKL, EulerDiscreteScheduler, UNet2DConditionMode
 from library import model_util
 from library import sdxl_original_unet
 from .utils import setup_logging
-
 setup_logging()
 import logging
-
 logger = logging.getLogger(__name__)

 VAE_SCALE_FACTOR = 0.13025
@@ -166,20 +163,17 @@ def _load_state_dict_on_device(model, state_dict, device, dtype=None):
    raise RuntimeError("Error(s) in loading state_dict for {}:\n\t{}".format(model.__class__.__name__, "\n\t".join(error_msgs)))


-def load_models_from_sdxl_checkpoint(model_version, ckpt_path, map_location, dtype=None, disable_mmap=False):
+def load_models_from_sdxl_checkpoint(model_version, ckpt_path, map_location, dtype=None):
    # model_version is reserved for future use
    # dtype is used for full_fp16/bf16 integration. Text Encoder will remain fp32, because it runs on CPU when caching

    # Load the state dict
    if model_util.is_safetensors(ckpt_path):
        checkpoint = None
-        if disable_mmap:
-            state_dict = safetensors.torch.load(open(ckpt_path, "rb").read())
-        else:
-            try:
-                state_dict = load_file(ckpt_path, device=map_location)
-            except:
-                state_dict = load_file(ckpt_path)  # prevent device invalid Error
+        try:
+            state_dict = load_file(ckpt_path, device=map_location)
+        except:
+            state_dict = load_file(ckpt_path)  # prevent device invalid Error
        epoch = None
        global_step = None
    else:
--- a/library/sdxl_original_unet.py
+++ b/library/sdxl_original_unet.py
@@ -31,10 +31,8 @@ from torch import nn
 from torch.nn import functional as F
 from einops import rearrange
 from .utils import setup_logging
-
 setup_logging()
 import logging
-
 logger = logging.getLogger(__name__)

 IN_CHANNELS: int = 4
@@ -1076,7 +1074,7 @@ class SdxlUNet2DConditionModel(nn.Module):
        timesteps = timesteps.expand(x.shape[0])

        hs = []
-        t_emb = get_timestep_embedding(timesteps, self.model_channels, downscale_freq_shift=0)  # , repeat_only=False)
+        t_emb = get_timestep_embedding(timesteps, self.model_channels)  # , repeat_only=False)
        t_emb = t_emb.to(x.dtype)
        emb = self.time_embed(t_emb)

@@ -1134,7 +1132,7 @@ class InferSdxlUNet2DConditionModel:
    # call original model's methods
    def __getattr__(self, name):
        return getattr(self.delegate, name)
-
+    
    def __call__(self, *args, **kwargs):
        return self.delegate(*args, **kwargs)

@@ -1166,7 +1164,7 @@ class InferSdxlUNet2DConditionModel:
        timesteps = timesteps.expand(x.shape[0])

        hs = []
-        t_emb = get_timestep_embedding(timesteps, _self.model_channels, downscale_freq_shift=0)  # , repeat_only=False)
+        t_emb = get_timestep_embedding(timesteps, _self.model_channels)  # , repeat_only=False)
        t_emb = t_emb.to(x.dtype)
        emb = _self.time_embed(t_emb)

--- a/library/sdxl_train_util.py
+++ b/library/sdxl_train_util.py
@@ -5,7 +5,6 @@ from typing import Optional

 import torch
 from library.device_utils import init_ipex, clean_memory_on_device
-
 init_ipex()

 from accelerate import init_empty_weights
@@ -14,10 +13,8 @@ from transformers import CLIPTokenizer
 from library import model_util, sdxl_model_util, train_util, sdxl_original_unet
 from library.sdxl_lpw_stable_diffusion import SdxlStableDiffusionLongPromptWeightingPipeline
 from .utils import setup_logging
-
 setup_logging()
 import logging
-
 logger = logging.getLogger(__name__)

 TOKENIZER1_PATH = "openai/clip-vit-large-patch14"
@@ -27,6 +24,7 @@ TOKENIZER2_PATH = "laion/CLIP-ViT-bigG-14-laion2B-39B-b160k"


 def load_target_model(args, accelerator, model_version: str, weight_dtype):
+    # load models for each process
    model_dtype = match_mixed_precision(args, weight_dtype)  # prepare fp16/bf16
    for pi in range(accelerator.state.num_processes):
        if pi == accelerator.state.local_process_index:
@@ -47,7 +45,6 @@ def load_target_model(args, accelerator, model_version: str, weight_dtype):
                weight_dtype,
                accelerator.device if args.lowram else "cpu",
                model_dtype,
-                args.disable_mmap_load_safetensors,
            )

            # work on low-ram device
@@ -64,7 +61,7 @@ def load_target_model(args, accelerator, model_version: str, weight_dtype):


 def _load_target_model(
-    name_or_path: str, vae_path: Optional[str], model_version: str, weight_dtype, device="cpu", model_dtype=None, disable_mmap=False
+    name_or_path: str, vae_path: Optional[str], model_version: str, weight_dtype, device="cpu", model_dtype=None
 ):
    # model_dtype only work with full fp16/bf16
    name_or_path = os.readlink(name_or_path) if os.path.islink(name_or_path) else name_or_path
@@ -79,7 +76,7 @@ def _load_target_model(
            unet,
            logit_scale,
            ckpt_info,
-        ) = sdxl_model_util.load_models_from_sdxl_checkpoint(model_version, name_or_path, device, model_dtype, disable_mmap)
+        ) = sdxl_model_util.load_models_from_sdxl_checkpoint(model_version, name_or_path, device, model_dtype)
    else:
        # Diffusers model is loaded to CPU
        from diffusers import StableDiffusionXLPipeline
@@ -336,11 +333,6 @@ def add_sdxl_training_arguments(parser: argparse.ArgumentParser):
        action="store_true",
        help="cache text encoder outputs to disk / text encoderの出力をディスクにキャッシュする",
    )
-    parser.add_argument(
-        "--disable_mmap_load_safetensors",
-        action="store_true",
-        help="disable mmap load for safetensors. Speed up model loading in WSL environment / safetensorsのmmapロードを無効にする。WSL環境等でモデル読み込みを高速化できる",
-    )


 def verify_sdxl_training_args(args: argparse.Namespace, supportTextEncoderCaching: bool = True):
--- a/library/stable_cascade.py
+++ b/library/stable_cascade.py
--- a/library/stable_cascade_utils.py
+++ b/library/stable_cascade_utils.py
@@ -0,0 +1,668 @@
+import argparse
+import json
+import math
+import os
+import time
+from typing import List
+import numpy as np
+import toml
+
+import torch
+import torchvision
+from safetensors.torch import load_file, save_file
+from tqdm import tqdm
+from transformers import CLIPTokenizer, CLIPTextModelWithProjection, CLIPTextConfig
+from accelerate import init_empty_weights, Accelerator, PartialState
+from PIL import Image
+
+from library import stable_cascade as sc
+
+from library.sdxl_model_util import _load_state_dict_on_device
+from library.device_utils import clean_memory_on_device
+from library.train_util import (
+    save_sd_model_on_epoch_end_or_stepwise_common,
+    save_sd_model_on_train_end_common,
+    line_to_prompt_dict,
+    get_hidden_states_stable_cascade,
+)
+from library import sai_model_spec
+
+
+from library.utils import setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+
+CLIP_TEXT_MODEL_NAME: str = "laion/CLIP-ViT-bigG-14-laion2B-39B-b160k"
+
+TEXT_ENCODER_OUTPUTS_CACHE_SUFFIX = "_sc_te_outputs.npz"
+
+
+def calculate_latent_sizes(height=1024, width=1024, batch_size=4, compression_factor_b=42.67, compression_factor_a=4.0):
+    resolution_multiple = 42.67
+    latent_height = math.ceil(height / compression_factor_b)
+    latent_width = math.ceil(width / compression_factor_b)
+    stage_c_latent_shape = (batch_size, 16, latent_height, latent_width)
+
+    latent_height = math.ceil(height / compression_factor_a)
+    latent_width = math.ceil(width / compression_factor_a)
+    stage_b_latent_shape = (batch_size, 4, latent_height, latent_width)
+
+    return stage_c_latent_shape, stage_b_latent_shape
+
+
+# region load and save
+
+
+def load_effnet(effnet_checkpoint_path, loading_device="cpu") -> sc.EfficientNetEncoder:
+    logger.info(f"Loading EfficientNet encoder from {effnet_checkpoint_path}")
+    effnet = sc.EfficientNetEncoder()
+    effnet_checkpoint = load_file(effnet_checkpoint_path)
+    info = effnet.load_state_dict(effnet_checkpoint if "state_dict" not in effnet_checkpoint else effnet_checkpoint["state_dict"])
+    logger.info(info)
+    del effnet_checkpoint
+    return effnet
+
+
+def load_tokenizer(args: argparse.Namespace):
+    # TODO commonize with sdxl_train_util.load_tokenizers
+    logger.info("prepare tokenizers")
+
+    original_paths = [CLIP_TEXT_MODEL_NAME]
+    tokenizers = []
+    for i, original_path in enumerate(original_paths):
+        tokenizer: CLIPTokenizer = None
+        if args.tokenizer_cache_dir:
+            local_tokenizer_path = os.path.join(args.tokenizer_cache_dir, original_path.replace("/", "_"))
+            if os.path.exists(local_tokenizer_path):
+                logger.info(f"load tokenizer from cache: {local_tokenizer_path}")
+                tokenizer = CLIPTokenizer.from_pretrained(local_tokenizer_path)
+
+        if tokenizer is None:
+            tokenizer = CLIPTokenizer.from_pretrained(original_path)
+
+        if args.tokenizer_cache_dir and not os.path.exists(local_tokenizer_path):
+            logger.info(f"save Tokenizer to cache: {local_tokenizer_path}")
+            tokenizer.save_pretrained(local_tokenizer_path)
+
+        tokenizers.append(tokenizer)
+
+    if hasattr(args, "max_token_length") and args.max_token_length is not None:
+        logger.info(f"update token length: {args.max_token_length}")
+
+    return tokenizers[0]
+
+
+def load_stage_c_model(stage_c_checkpoint_path, dtype=None, device="cpu") -> sc.StageC:
+    # Generator
+    logger.info(f"Instantiating Stage C generator")
+    with init_empty_weights():
+        generator_c = sc.StageC()
+    logger.info(f"Loading Stage C generator from {stage_c_checkpoint_path}")
+    stage_c_checkpoint = load_file(stage_c_checkpoint_path)
+
+    stage_c_checkpoint = convert_state_dict_mha_to_normal_attn(stage_c_checkpoint)
+
+    logger.info(f"Loading state dict")
+    info = _load_state_dict_on_device(generator_c, stage_c_checkpoint, device, dtype=dtype)
+    logger.info(info)
+    return generator_c
+
+
+def load_stage_b_model(stage_b_checkpoint_path, dtype=None, device="cpu") -> sc.StageB:
+    logger.info(f"Instantiating Stage B generator")
+    with init_empty_weights():
+        generator_b = sc.StageB()
+    logger.info(f"Loading Stage B generator from {stage_b_checkpoint_path}")
+    stage_b_checkpoint = load_file(stage_b_checkpoint_path)
+
+    stage_b_checkpoint = convert_state_dict_mha_to_normal_attn(stage_b_checkpoint)
+
+    logger.info(f"Loading state dict")
+    info = _load_state_dict_on_device(generator_b, stage_b_checkpoint, device, dtype=dtype)
+    logger.info(info)
+    return generator_b
+
+
+def load_clip_text_model(text_model_checkpoint_path, dtype=None, device="cpu", save_text_model=False):
+    # CLIP encoders
+    logger.info(f"Loading CLIP text model")
+    if save_text_model or text_model_checkpoint_path is None:
+        logger.info(f"Loading CLIP text model from {CLIP_TEXT_MODEL_NAME}")
+        text_model = CLIPTextModelWithProjection.from_pretrained(CLIP_TEXT_MODEL_NAME)
+
+        if save_text_model:
+            sd = text_model.state_dict()
+            logger.info(f"Saving CLIP text model to {text_model_checkpoint_path}")
+            save_file(sd, text_model_checkpoint_path)
+    else:
+        logger.info(f"Loading CLIP text model from {text_model_checkpoint_path}")
+
+        # copy from sdxl_model_util.py
+        text_model2_cfg = CLIPTextConfig(
+            vocab_size=49408,
+            hidden_size=1280,
+            intermediate_size=5120,
+            num_hidden_layers=32,
+            num_attention_heads=20,
+            max_position_embeddings=77,
+            hidden_act="gelu",
+            layer_norm_eps=1e-05,
+            dropout=0.0,
+            attention_dropout=0.0,
+            initializer_range=0.02,
+            initializer_factor=1.0,
+            pad_token_id=1,
+            bos_token_id=0,
+            eos_token_id=2,
+            model_type="clip_text_model",
+            projection_dim=1280,
+            # torch_dtype="float32",
+            # transformers_version="4.25.0.dev0",
+        )
+        with init_empty_weights():
+            text_model = CLIPTextModelWithProjection(text_model2_cfg)
+
+        text_model_checkpoint = load_file(text_model_checkpoint_path)
+        info = _load_state_dict_on_device(text_model, text_model_checkpoint, device, dtype=dtype)
+        logger.info(info)
+
+    return text_model
+
+
+def load_stage_a_model(stage_a_checkpoint_path, dtype=None, device="cpu") -> sc.StageA:
+    logger.info(f"Loading Stage A vqGAN from {stage_a_checkpoint_path}")
+    stage_a = sc.StageA().to(device)
+    stage_a_checkpoint = load_file(stage_a_checkpoint_path)
+    info = stage_a.load_state_dict(
+        stage_a_checkpoint if "state_dict" not in stage_a_checkpoint else stage_a_checkpoint["state_dict"]
+    )
+    logger.info(info)
+    return stage_a
+
+
+def load_previewer_model(previewer_checkpoint_path, dtype=None, device="cpu") -> sc.Previewer:
+    logger.info(f"Loading Previewer from {previewer_checkpoint_path}")
+    previewer = sc.Previewer().to(device)
+    previewer_checkpoint = load_file(previewer_checkpoint_path)
+    info = previewer.load_state_dict(
+        previewer_checkpoint if "state_dict" not in previewer_checkpoint else previewer_checkpoint["state_dict"]
+    )
+    logger.info(info)
+    return previewer
+
+
+def convert_state_dict_mha_to_normal_attn(state_dict):
+    # convert nn.MultiheadAttention to to_q/k/v and out_proj
+    print("convert_state_dict_mha_to_normal_attn")
+    for key in list(state_dict.keys()):
+        if "attention.attn." in key:
+            if "in_proj_bias" in key:
+                value = state_dict.pop(key)
+                qkv = torch.chunk(value, 3, dim=0)
+                state_dict[key.replace("in_proj_bias", "to_q.bias")] = qkv[0]
+                state_dict[key.replace("in_proj_bias", "to_k.bias")] = qkv[1]
+                state_dict[key.replace("in_proj_bias", "to_v.bias")] = qkv[2]
+            elif "in_proj_weight" in key:
+                value = state_dict.pop(key)
+                qkv = torch.chunk(value, 3, dim=0)
+                state_dict[key.replace("in_proj_weight", "to_q.weight")] = qkv[0]
+                state_dict[key.replace("in_proj_weight", "to_k.weight")] = qkv[1]
+                state_dict[key.replace("in_proj_weight", "to_v.weight")] = qkv[2]
+            elif "out_proj.bias" in key:
+                value = state_dict.pop(key)
+                state_dict[key.replace("out_proj.bias", "out_proj.bias")] = value
+            elif "out_proj.weight" in key:
+                value = state_dict.pop(key)
+                state_dict[key.replace("out_proj.weight", "out_proj.weight")] = value
+    return state_dict
+
+
+def convert_state_dict_normal_attn_to_mha(state_dict):
+    # convert to_q/k/v and out_proj to nn.MultiheadAttention
+    for key in list(state_dict.keys()):
+        if "attention.attn." in key:
+            if "to_q.bias" in key:
+                q = state_dict.pop(key)
+                k = state_dict.pop(key.replace("to_q.bias", "to_k.bias"))
+                v = state_dict.pop(key.replace("to_q.bias", "to_v.bias"))
+                state_dict[key.replace("to_q.bias", "in_proj_bias")] = torch.cat([q, k, v])
+            elif "to_q.weight" in key:
+                q = state_dict.pop(key)
+                k = state_dict.pop(key.replace("to_q.weight", "to_k.weight"))
+                v = state_dict.pop(key.replace("to_q.weight", "to_v.weight"))
+                state_dict[key.replace("to_q.weight", "in_proj_weight")] = torch.cat([q, k, v])
+            elif "out_proj.bias" in key:
+                v = state_dict.pop(key)
+                state_dict[key.replace("out_proj.bias", "out_proj.bias")] = v
+            elif "out_proj.weight" in key:
+                v = state_dict.pop(key)
+                state_dict[key.replace("out_proj.weight", "out_proj.weight")] = v
+    return state_dict
+
+
+def get_sai_model_spec(args, lora=False):
+    timestamp = time.time()
+
+    reso = args.resolution
+
+    title = args.metadata_title if args.metadata_title is not None else args.output_name
+
+    if args.min_timestep is not None or args.max_timestep is not None:
+        min_time_step = args.min_timestep if args.min_timestep is not None else 0
+        max_time_step = args.max_timestep if args.max_timestep is not None else 1000
+        timesteps = (min_time_step, max_time_step)
+    else:
+        timesteps = None
+
+    metadata = sai_model_spec.build_metadata(
+        None,
+        False,
+        False,
+        False,
+        lora,
+        False,
+        timestamp,
+        title=title,
+        reso=reso,
+        is_stable_diffusion_ckpt=False,
+        author=args.metadata_author,
+        description=args.metadata_description,
+        license=args.metadata_license,
+        tags=args.metadata_tags,
+        timesteps=timesteps,
+        clip_skip=args.clip_skip,  # None or int
+        stable_cascade=True,
+    )
+    return metadata
+
+
+def stage_c_saver_common(ckpt_file, stage_c, text_model, save_dtype, sai_metadata):
+    state_dict = stage_c.state_dict()
+    if save_dtype is not None:
+        state_dict = {k: v.to(save_dtype) for k, v in state_dict.items()}
+
+    state_dict = convert_state_dict_normal_attn_to_mha(state_dict)
+
+    save_file(state_dict, ckpt_file, metadata=sai_metadata)
+
+    # save text model
+    if text_model is not None:
+        text_model_sd = text_model.state_dict()
+
+        if save_dtype is not None:
+            text_model_sd = {k: v.to(save_dtype) for k, v in text_model_sd.items()}
+
+        text_model_ckpt_file = os.path.splitext(ckpt_file)[0] + "_text_model.safetensors"
+        save_file(text_model_sd, text_model_ckpt_file)
+
+
+def save_stage_c_model_on_epoch_end_or_stepwise(
+    args: argparse.Namespace,
+    on_epoch_end: bool,
+    accelerator,
+    save_dtype: torch.dtype,
+    epoch: int,
+    num_train_epochs: int,
+    global_step: int,
+    stage_c,
+    text_model,
+):
+    def stage_c_saver(ckpt_file, epoch_no, global_step):
+        sai_metadata = get_sai_model_spec(args)
+        stage_c_saver_common(ckpt_file, stage_c, text_model, save_dtype, sai_metadata)
+
+    save_sd_model_on_epoch_end_or_stepwise_common(
+        args, on_epoch_end, accelerator, True, True, epoch, num_train_epochs, global_step, stage_c_saver, None
+    )
+
+
+def save_stage_c_model_on_end(
+    args: argparse.Namespace,
+    save_dtype: torch.dtype,
+    epoch: int,
+    global_step: int,
+    stage_c,
+    text_model,
+):
+    def stage_c_saver(ckpt_file, epoch_no, global_step):
+        sai_metadata = get_sai_model_spec(args)
+        stage_c_saver_common(ckpt_file, stage_c, text_model, save_dtype, sai_metadata)
+
+    save_sd_model_on_train_end_common(args, True, True, epoch, global_step, stage_c_saver, None)
+
+
+# endregion
+
+# region sample generation
+
+
+def sample_images(
+    accelerator: Accelerator,
+    args: argparse.Namespace,
+    epoch,
+    steps,
+    previewer,
+    tokenizer,
+    text_encoder,
+    stage_c,
+    gdf,
+    prompt_replacement=None,
+):
+    if steps == 0:
+        if not args.sample_at_first:
+            return
+    else:
+        if args.sample_every_n_steps is None and args.sample_every_n_epochs is None:
+            return
+        if args.sample_every_n_epochs is not None:
+            # sample_every_n_steps は無視する
+            if epoch is None or epoch % args.sample_every_n_epochs != 0:
+                return
+        else:
+            if steps % args.sample_every_n_steps != 0 or epoch is not None:  # steps is not divisible or end of epoch
+                return
+
+    logger.info("")
+    logger.info(f"generating sample images at step / サンプル画像生成 ステップ: {steps}")
+    if not os.path.isfile(args.sample_prompts):
+        logger.error(f"No prompt file / プロンプトファイルがありません: {args.sample_prompts}")
+        return
+
+    distributed_state = PartialState()  # for multi gpu distributed inference. this is a singleton, so it's safe to use it here
+
+    # unwrap unet and text_encoder(s)
+    stage_c = accelerator.unwrap_model(stage_c)
+    text_encoder = accelerator.unwrap_model(text_encoder)
+
+    # read prompts
+    if args.sample_prompts.endswith(".txt"):
+        with open(args.sample_prompts, "r", encoding="utf-8") as f:
+            lines = f.readlines()
+        prompts = [line.strip() for line in lines if len(line.strip()) > 0 and line[0] != "#"]
+    elif args.sample_prompts.endswith(".toml"):
+        with open(args.sample_prompts, "r", encoding="utf-8") as f:
+            data = toml.load(f)
+        prompts = [dict(**data["prompt"], **subset) for subset in data["prompt"]["subset"]]
+    elif args.sample_prompts.endswith(".json"):
+        with open(args.sample_prompts, "r", encoding="utf-8") as f:
+            prompts = json.load(f)
+
+    save_dir = args.output_dir + "/sample"
+    os.makedirs(save_dir, exist_ok=True)
+
+    # preprocess prompts
+    for i in range(len(prompts)):
+        prompt_dict = prompts[i]
+        if isinstance(prompt_dict, str):
+            prompt_dict = line_to_prompt_dict(prompt_dict)
+            prompts[i] = prompt_dict
+        assert isinstance(prompt_dict, dict)
+
+        # Adds an enumerator to the dict based on prompt position. Used later to name image files. Also cleanup of extra data in original prompt dict.
+        prompt_dict["enum"] = i
+        prompt_dict.pop("subset", None)
+
+    # save random state to restore later
+    rng_state = torch.get_rng_state()
+    cuda_rng_state = None
+    try:
+        cuda_rng_state = torch.cuda.get_rng_state() if torch.cuda.is_available() else None
+    except Exception:
+        pass
+
+    if distributed_state.num_processes <= 1:
+        # If only one device is available, just use the original prompt list. We don't need to care about the distribution of prompts.
+        with torch.no_grad():
+            for prompt_dict in prompts:
+                sample_image_inference(
+                    accelerator,
+                    args,
+                    tokenizer,
+                    text_encoder,
+                    stage_c,
+                    previewer,
+                    gdf,
+                    save_dir,
+                    prompt_dict,
+                    epoch,
+                    steps,
+                    prompt_replacement,
+                )
+    else:
+        # Creating list with N elements, where each element is a list of prompt_dicts, and N is the number of processes available (number of devices available)
+        # prompt_dicts are assigned to lists based on order of processes, to attempt to time the image creation time to match enum order. Probably only works when steps and sampler are identical.
+        per_process_prompts = []  # list of lists
+        for i in range(distributed_state.num_processes):
+            per_process_prompts.append(prompts[i :: distributed_state.num_processes])
+
+        with torch.no_grad():
+            with distributed_state.split_between_processes(per_process_prompts) as prompt_dict_lists:
+                for prompt_dict in prompt_dict_lists[0]:
+                    sample_image_inference(
+                        accelerator,
+                        args,
+                        tokenizer,
+                        text_encoder,
+                        stage_c,
+                        previewer,
+                        gdf,
+                        save_dir,
+                        prompt_dict,
+                        epoch,
+                        steps,
+                        prompt_replacement,
+                    )
+
+    # I'm not sure which of these is the correct way to clear the memory, but accelerator's device is used in the pipeline, so I'm using it here.
+    # with torch.cuda.device(torch.cuda.current_device()):
+    #     torch.cuda.empty_cache()
+    clean_memory_on_device(accelerator.device)
+
+    torch.set_rng_state(rng_state)
+    if cuda_rng_state is not None:
+        torch.cuda.set_rng_state(cuda_rng_state)
+
+
+def sample_image_inference(
+    accelerator: Accelerator,
+    args: argparse.Namespace,
+    tokenizer,
+    text_model,
+    stage_c,
+    previewer,
+    gdf,
+    save_dir,
+    prompt_dict,
+    epoch,
+    steps,
+    prompt_replacement,
+):
+    assert isinstance(prompt_dict, dict)
+    negative_prompt = prompt_dict.get("negative_prompt")
+    sample_steps = prompt_dict.get("sample_steps", 20)
+    width = prompt_dict.get("width", 1024)
+    height = prompt_dict.get("height", 1024)
+    scale = prompt_dict.get("scale", 4)
+    seed = prompt_dict.get("seed")
+    # controlnet_image = prompt_dict.get("controlnet_image")
+    prompt: str = prompt_dict.get("prompt", "")
+    # sampler_name: str = prompt_dict.get("sample_sampler", args.sample_sampler)
+
+    if prompt_replacement is not None:
+        prompt = prompt.replace(prompt_replacement[0], prompt_replacement[1])
+        if negative_prompt is not None:
+            negative_prompt = negative_prompt.replace(prompt_replacement[0], prompt_replacement[1])
+
+    if seed is not None:
+        torch.manual_seed(seed)
+        torch.cuda.manual_seed(seed)
+    else:
+        # True random sample image generation
+        torch.seed()
+        torch.cuda.seed()
+
+    height = max(64, height - height % 8)  # round to divisible by 8
+    width = max(64, width - width % 8)  # round to divisible by 8
+    logger.info(f"prompt: {prompt}")
+    logger.info(f"negative_prompt: {negative_prompt}")
+    logger.info(f"height: {height}")
+    logger.info(f"width: {width}")
+    logger.info(f"sample_steps: {sample_steps}")
+    logger.info(f"scale: {scale}")
+    # logger.info(f"sample_sampler: {sampler_name}")
+    if seed is not None:
+        logger.info(f"seed: {seed}")
+
+    negative_prompt = "" if negative_prompt is None else negative_prompt
+    cfg = scale
+    timesteps = sample_steps
+    shift = 2
+    t_start = 1.0
+
+    stage_c_latent_shape, _ = calculate_latent_sizes(height, width, batch_size=1)
+
+    # PREPARE CONDITIONS
+    input_ids = tokenizer(
+        [prompt], truncation=True, padding="max_length", max_length=tokenizer.model_max_length, return_tensors="pt"
+    )["input_ids"].to(text_model.device)
+    cond_text, cond_pooled = get_hidden_states_stable_cascade(tokenizer.model_max_length, input_ids, tokenizer, text_model)
+
+    input_ids = tokenizer(
+        [negative_prompt], truncation=True, padding="max_length", max_length=tokenizer.model_max_length, return_tensors="pt"
+    )["input_ids"].to(text_model.device)
+    uncond_text, uncond_pooled = get_hidden_states_stable_cascade(tokenizer.model_max_length, input_ids, tokenizer, text_model)
+
+    device = accelerator.device
+    dtype = stage_c.dtype
+    cond_text = cond_text.to(device, dtype=dtype)
+    cond_pooled = cond_pooled.unsqueeze(1).to(device, dtype=dtype)
+
+    uncond_text = uncond_text.to(device, dtype=dtype)
+    uncond_pooled = uncond_pooled.unsqueeze(1).to(device, dtype=dtype)
+
+    zero_img_emb = torch.zeros(1, 768, device=device)
+
+    # 辞書にしたくないけど GDF から先の変更が面倒だからとりあえず辞書にしておく
+    conditions = {"clip_text_pooled": cond_pooled, "clip": cond_pooled, "clip_text": cond_text, "clip_img": zero_img_emb}
+    unconditions = {"clip_text_pooled": uncond_pooled, "clip": uncond_pooled, "clip_text": uncond_text, "clip_img": zero_img_emb}
+
+    with torch.no_grad():  # , torch.cuda.amp.autocast(dtype=dtype):
+        sampling_c = gdf.sample(
+            stage_c,
+            conditions,
+            stage_c_latent_shape,
+            unconditions,
+            device=device,
+            cfg=cfg,
+            shift=shift,
+            timesteps=timesteps,
+            t_start=t_start,
+        )
+        for sampled_c, _, _ in tqdm(sampling_c, total=timesteps):
+            sampled_c = sampled_c
+
+    sampled_c = sampled_c.to(previewer.device, dtype=previewer.dtype)
+    image = previewer(sampled_c)[0]
+    image = torch.clamp(image, 0, 1)
+    image = image.cpu().numpy().transpose(1, 2, 0)
+    image = image * 255
+    image = image.astype(np.uint8)
+    image = Image.fromarray(image)
+
+    # adding accelerator.wait_for_everyone() here should sync up and ensure that sample images are saved in the same order as the original prompt list
+    # but adding 'enum' to the filename should be enough
+
+    ts_str = time.strftime("%Y%m%d%H%M%S", time.localtime())
+    num_suffix = f"e{epoch:06d}" if epoch is not None else f"{steps:06d}"
+    seed_suffix = "" if seed is None else f"_{seed}"
+    i: int = prompt_dict["enum"]
+    img_filename = f"{'' if args.output_name is None else args.output_name + '_'}{num_suffix}_{i:02d}_{ts_str}{seed_suffix}.png"
+    image.save(os.path.join(save_dir, img_filename))
+
+    # wandb有効時のみログを送信
+    try:
+        wandb_tracker = accelerator.get_tracker("wandb")
+        try:
+            import wandb
+        except ImportError:  # 事前に一度確認するのでここはエラー出ないはず
+            raise ImportError("No wandb / wandb がインストールされていないようです")
+
+        wandb_tracker.log({f"sample_{i}": wandb.Image(image)})
+    except:  # wandb 無効時
+        pass
+
+
+# endregion
+
+
+def add_effnet_arguments(parser):
+    parser.add_argument(
+        "--effnet_checkpoint_path",
+        type=str,
+        required=True,
+        help="path to EfficientNet checkpoint / EfficientNetのチェックポイントのパス",
+    )
+    return parser
+
+
+def add_text_model_arguments(parser):
+    parser.add_argument(
+        "--text_model_checkpoint_path",
+        type=str,
+        help="path to CLIP text model checkpoint / CLIPテキストモデルのチェックポイントのパス",
+    )
+    parser.add_argument("--save_text_model", action="store_true", help="if specified, save text model to corresponding path")
+    return parser
+
+
+def add_stage_a_arguments(parser):
+    parser.add_argument(
+        "--stage_a_checkpoint_path",
+        type=str,
+        required=True,
+        help="path to Stage A checkpoint / Stage Aのチェックポイントのパス",
+    )
+    return parser
+
+
+def add_stage_b_arguments(parser):
+    parser.add_argument(
+        "--stage_b_checkpoint_path",
+        type=str,
+        required=True,
+        help="path to Stage B checkpoint / Stage Bのチェックポイントのパス",
+    )
+    return parser
+
+
+def add_stage_c_arguments(parser):
+    parser.add_argument(
+        "--stage_c_checkpoint_path",
+        type=str,
+        required=True,
+        help="path to Stage C checkpoint / Stage Cのチェックポイントのパス",
+    )
+    return parser
+
+
+def add_previewer_arguments(parser):
+    parser.add_argument(
+        "--previewer_checkpoint_path",
+        type=str,
+        required=False,
+        help="path to previewer checkpoint / previewerのチェックポイントのパス",
+    )
+    return parser
+
+
+def add_training_arguments(parser):
+    parser.add_argument(
+        "--adaptive_loss_weight",
+        action="store_true",
+        help="if specified, use adaptive loss weight. if not, use P2 loss weight"
+        + " / Adaptive Loss Weightを使用する。指定しない場合はP2 Loss Weightを使用する",
+    )
--- a/library/train_util.py
+++ b/library/train_util.py
--- a/library/utils.py
+++ b/library/utils.py
@@ -7,9 +7,6 @@ from typing import *
 from diffusers import EulerAncestralDiscreteScheduler
 import diffusers.schedulers.scheduling_euler_ancestral_discrete
 from diffusers.schedulers.scheduling_euler_ancestral_discrete import EulerAncestralDiscreteSchedulerOutput
-import cv2
-from PIL import Image
-import numpy as np


 def fire_in_thread(f, *args, **kwargs):
@@ -82,24 +79,6 @@ def setup_logging(args=None, log_level=None, reset=False):
        logger.info(msg_init)


-def pil_resize(image, size, interpolation=Image.LANCZOS):
-    has_alpha = image.shape[2] == 4 if len(image.shape) == 3 else False
-
-    if has_alpha:
-        pil_image = Image.fromarray(cv2.cvtColor(image, cv2.COLOR_BGRA2RGBA))
-    else:
-        pil_image = Image.fromarray(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
-
-    resized_pil = pil_image.resize(size, interpolation)
-
-    # Convert back to cv2 format
-    if has_alpha:
-        resized_cv2 = cv2.cvtColor(np.array(resized_pil), cv2.COLOR_RGBA2BGRA)
-    else:
-        resized_cv2 = cv2.cvtColor(np.array(resized_pil), cv2.COLOR_RGB2BGR)
-
-    return resized_cv2
-

 # TODO make inf_utils.py

--- a/networks/check_lora_weights.py
+++ b/networks/check_lora_weights.py
@@ -18,7 +18,7 @@ def main(file):

    keys = list(sd.keys())
    for key in keys:
-        if "lora_up" in key or "lora_down" in key or "lora_A" in key or "lora_B" in key or "oft_" in key:
+        if "lora_up" in key or "lora_down" in key:
            values.append((key, sd[key]))
    print(f"number of LoRA modules: {len(values)}")

--- a/networks/control_net_lllite_for_train.py
+++ b/networks/control_net_lllite_for_train.py
@@ -7,10 +7,8 @@ from typing import Optional, List, Type
 import torch
 from library import sdxl_original_unet
 from library.utils import setup_logging
-
 setup_logging()
 import logging
-
 logger = logging.getLogger(__name__)

 # input_blocksに適用するかどうか / if True, input_blocks are not applied
@@ -105,15 +103,19 @@ class LLLiteLinear(ORIGINAL_LINEAR):
        add_lllite_modules(self, in_dim, depth, cond_emb_dim, mlp_dim)

        self.cond_image = None
+        self.cond_emb = None

    def set_cond_image(self, cond_image):
        self.cond_image = cond_image
+        self.cond_emb = None

    def forward(self, x):
        if not self.enabled:
            return super().forward(x)

-        cx = self.lllite_conditioning1(self.cond_image)  # make forward and backward compatible
+        if self.cond_emb is None:
+            self.cond_emb = self.lllite_conditioning1(self.cond_image)
+        cx = self.cond_emb

        # reshape / b,c,h,w -> b,h*w,c
        n, c, h, w = cx.shape
@@ -157,7 +159,9 @@ class LLLiteConv2d(ORIGINAL_CONV2D):
        if not self.enabled:
            return super().forward(x)

-        cx = self.lllite_conditioning1(self.cond_image)
+        if self.cond_emb is None:
+            self.cond_emb = self.lllite_conditioning1(self.cond_image)
+        cx = self.cond_emb

        cx = torch.cat([cx, self.down(x)], dim=1)
        cx = self.mid(cx)
--- a/networks/dylora.py
+++ b/networks/dylora.py
@@ -18,13 +18,10 @@ from transformers import CLIPTextModel
 import torch
 from torch import nn
 from library.utils import setup_logging
-
 setup_logging()
 import logging
-
 logger = logging.getLogger(__name__)

-
 class DyLoRAModule(torch.nn.Module):
    """
    replaces forward method of the original Linear, instead of replacing the original Linear module.
@@ -198,7 +195,7 @@ def create_network(
            conv_alpha = 1.0
        else:
            conv_alpha = float(conv_alpha)
-
+            
    if unit is not None:
        unit = int(unit)
    else:
@@ -214,16 +211,6 @@ def create_network(
        unit=unit,
        varbose=True,
    )
-
-    loraplus_lr_ratio = kwargs.get("loraplus_lr_ratio", None)
-    loraplus_unet_lr_ratio = kwargs.get("loraplus_unet_lr_ratio", None)
-    loraplus_text_encoder_lr_ratio = kwargs.get("loraplus_text_encoder_lr_ratio", None)
-    loraplus_lr_ratio = float(loraplus_lr_ratio) if loraplus_lr_ratio is not None else None
-    loraplus_unet_lr_ratio = float(loraplus_unet_lr_ratio) if loraplus_unet_lr_ratio is not None else None
-    loraplus_text_encoder_lr_ratio = float(loraplus_text_encoder_lr_ratio) if loraplus_text_encoder_lr_ratio is not None else None
-    if loraplus_lr_ratio is not None or loraplus_unet_lr_ratio is not None or loraplus_text_encoder_lr_ratio is not None:
-        network.set_loraplus_lr_ratio(loraplus_lr_ratio, loraplus_unet_lr_ratio, loraplus_text_encoder_lr_ratio)
-
    return network


@@ -268,7 +255,7 @@ def create_network_from_weights(multiplier, file, vae, text_encoder, unet, weigh
 class DyLoRANetwork(torch.nn.Module):
    UNET_TARGET_REPLACE_MODULE = ["Transformer2DModel"]
    UNET_TARGET_REPLACE_MODULE_CONV2D_3X3 = ["ResnetBlock2D", "Downsample2D", "Upsample2D"]
-    TEXT_ENCODER_TARGET_REPLACE_MODULE = ["CLIPAttention", "CLIPSdpaAttention", "CLIPMLP"]
+    TEXT_ENCODER_TARGET_REPLACE_MODULE = ["CLIPAttention", "CLIPMLP"]
    LORA_PREFIX_UNET = "lora_unet"
    LORA_PREFIX_TEXT_ENCODER = "lora_te"

@@ -293,10 +280,6 @@ class DyLoRANetwork(torch.nn.Module):
        self.alpha = alpha
        self.apply_to_conv = apply_to_conv

-        self.loraplus_lr_ratio = None
-        self.loraplus_unet_lr_ratio = None
-        self.loraplus_text_encoder_lr_ratio = None
-
        if modules_dim is not None:
            logger.info("create LoRA network from weights")
        else:
@@ -337,9 +320,9 @@ class DyLoRANetwork(torch.nn.Module):
                            lora = module_class(lora_name, child_module, self.multiplier, dim, alpha, unit)
                            loras.append(lora)
            return loras
-
+        
        text_encoders = text_encoder if type(text_encoder) == list else [text_encoder]
-
+        
        self.text_encoder_loras = []
        for i, text_encoder in enumerate(text_encoders):
            if len(text_encoders) > 1:
@@ -348,7 +331,7 @@ class DyLoRANetwork(torch.nn.Module):
            else:
                index = None
                logger.info("create LoRA for Text Encoder")
-
+            
            text_encoder_loras = create_modules(False, text_encoder, DyLoRANetwork.TEXT_ENCODER_TARGET_REPLACE_MODULE)
            self.text_encoder_loras.extend(text_encoder_loras)

@@ -363,14 +346,6 @@ class DyLoRANetwork(torch.nn.Module):
        self.unet_loras = create_modules(True, unet, target_modules)
        logger.info(f"create LoRA for U-Net: {len(self.unet_loras)} modules.")

-    def set_loraplus_lr_ratio(self, loraplus_lr_ratio, loraplus_unet_lr_ratio, loraplus_text_encoder_lr_ratio):
-        self.loraplus_lr_ratio = loraplus_lr_ratio
-        self.loraplus_unet_lr_ratio = loraplus_unet_lr_ratio
-        self.loraplus_text_encoder_lr_ratio = loraplus_text_encoder_lr_ratio
-
-        logger.info(f"LoRA+ UNet LR Ratio: {self.loraplus_unet_lr_ratio or self.loraplus_lr_ratio}")
-        logger.info(f"LoRA+ Text Encoder LR Ratio: {self.loraplus_text_encoder_lr_ratio or self.loraplus_lr_ratio}")
-
    def set_multiplier(self, multiplier):
        self.multiplier = multiplier
        for lora in self.text_encoder_loras + self.unet_loras:
@@ -431,53 +406,27 @@ class DyLoRANetwork(torch.nn.Module):
        logger.info(f"weights are merged")
    """

-    # 二つのText Encoderに別々の学習率を設定できるようにするといいかも
    def prepare_optimizer_params(self, text_encoder_lr, unet_lr, default_lr):
        self.requires_grad_(True)
        all_params = []

-        def assemble_params(loras, lr, ratio):
-            param_groups = {"lora": {}, "plus": {}}
-            for lora in loras:
-                for name, param in lora.named_parameters():
-                    if ratio is not None and "lora_B" in name:
-                        param_groups["plus"][f"{lora.lora_name}.{name}"] = param
-                    else:
-                        param_groups["lora"][f"{lora.lora_name}.{name}"] = param
-
+        def enumerate_params(loras):
            params = []
-            for key in param_groups.keys():
-                param_data = {"params": param_groups[key].values()}
-
-                if len(param_data["params"]) == 0:
-                    continue
-
-                if lr is not None:
-                    if key == "plus":
-                        param_data["lr"] = lr * ratio
-                    else:
-                        param_data["lr"] = lr
-
-                if param_data.get("lr", None) == 0 or param_data.get("lr", None) is None:
-                    continue
-
-                params.append(param_data)
-
+            for lora in loras:
+                params.extend(lora.parameters())
            return params

        if self.text_encoder_loras:
-            params = assemble_params(
-                self.text_encoder_loras,
-                text_encoder_lr if text_encoder_lr is not None else default_lr,
-                self.loraplus_text_encoder_lr_ratio or self.loraplus_lr_ratio,
-            )
-            all_params.extend(params)
+            param_data = {"params": enumerate_params(self.text_encoder_loras)}
+            if text_encoder_lr is not None:
+                param_data["lr"] = text_encoder_lr
+            all_params.append(param_data)

        if self.unet_loras:
-            params = assemble_params(
-                self.unet_loras, default_lr if unet_lr is None else unet_lr, self.loraplus_unet_lr_ratio or self.loraplus_lr_ratio
-            )
-            all_params.extend(params)
+            param_data = {"params": enumerate_params(self.unet_loras)}
+            if unet_lr is not None:
+                param_data["lr"] = unet_lr
+            all_params.append(param_data)

        return all_params

--- a/networks/lora.py
+++ b/networks/lora.py
@@ -12,7 +12,6 @@ import numpy as np
 import torch
 import re
 from library.utils import setup_logging
-from library.sdxl_original_unet import SdxlUNet2DConditionModel

 setup_logging()
 import logging
@@ -248,13 +247,14 @@ class LoRAInfModule(LoRAModule):
            area = x.size()[1]

        mask = self.network.mask_dic.get(area, None)
-        if mask is None or len(x.size()) == 2:
+        if mask is None:
+            # raise ValueError(f"mask is None for resolution {area}")
            # emb_layers in SDXL doesn't have mask
            # if "emb" not in self.lora_name:
            #     print(f"mask is None for resolution {self.lora_name}, {area}, {x.size()}")
            mask_size = (1, x.size()[1]) if len(x.size()) == 2 else (1, *x.size()[1:-1], 1)
            return torch.ones(mask_size, dtype=x.dtype, device=x.device) / self.network.num_sub_prompts
-        if len(x.size()) == 3:
+        if len(x.size()) != 4:
            mask = torch.reshape(mask, (1, -1, 1))
        return mask

@@ -386,14 +386,14 @@ class LoRAInfModule(LoRAModule):
        return out


-def parse_block_lr_kwargs(is_sdxl: bool, nw_kwargs: Dict) -> Optional[List[float]]:
+def parse_block_lr_kwargs(nw_kwargs):
    down_lr_weight = nw_kwargs.get("down_lr_weight", None)
    mid_lr_weight = nw_kwargs.get("mid_lr_weight", None)
    up_lr_weight = nw_kwargs.get("up_lr_weight", None)

    # 以上のいずれにも設定がない場合は無効としてNoneを返す
    if down_lr_weight is None and mid_lr_weight is None and up_lr_weight is None:
-        return None
+        return None, None, None

    # extract learning rate weight for each block
    if down_lr_weight is not None:
@@ -402,16 +402,18 @@ def parse_block_lr_kwargs(is_sdxl: bool, nw_kwargs: Dict) -> Optional[List[float
            down_lr_weight = [(float(s) if s else 0.0) for s in down_lr_weight.split(",")]

    if mid_lr_weight is not None:
-        mid_lr_weight = [(float(s) if s else 0.0) for s in mid_lr_weight.split(",")]
+        mid_lr_weight = float(mid_lr_weight)

    if up_lr_weight is not None:
        if "," in up_lr_weight:
            up_lr_weight = [(float(s) if s else 0.0) for s in up_lr_weight.split(",")]

-    return get_block_lr_weight(
-        is_sdxl, down_lr_weight, mid_lr_weight, up_lr_weight, float(nw_kwargs.get("block_lr_zero_threshold", 0.0))
+    down_lr_weight, mid_lr_weight, up_lr_weight = get_block_lr_weight(
+        down_lr_weight, mid_lr_weight, up_lr_weight, float(nw_kwargs.get("block_lr_zero_threshold", 0.0))
    )

+    return down_lr_weight, mid_lr_weight, up_lr_weight
+

 def create_network(
    multiplier: float,
@@ -423,9 +425,6 @@ def create_network(
    neuron_dropout: Optional[float] = None,
    **kwargs,
 ):
-    # if unet is an instance of SdxlUNet2DConditionModel or subclass, set is_sdxl to True
-    is_sdxl = unet is not None and issubclass(unet.__class__, SdxlUNet2DConditionModel)
-
    if network_dim is None:
        network_dim = 4  # default
    if network_alpha is None:
@@ -443,21 +442,21 @@ def create_network(

    # block dim/alpha/lr
    block_dims = kwargs.get("block_dims", None)
-    block_lr_weight = parse_block_lr_kwargs(is_sdxl, kwargs)
+    down_lr_weight, mid_lr_weight, up_lr_weight = parse_block_lr_kwargs(kwargs)

    # 以上のいずれかに指定があればblockごとのdim(rank)を有効にする
-    if block_dims is not None or block_lr_weight is not None:
+    if block_dims is not None or down_lr_weight is not None or mid_lr_weight is not None or up_lr_weight is not None:
        block_alphas = kwargs.get("block_alphas", None)
        conv_block_dims = kwargs.get("conv_block_dims", None)
        conv_block_alphas = kwargs.get("conv_block_alphas", None)

        block_dims, block_alphas, conv_block_dims, conv_block_alphas = get_block_dims_and_alphas(
-            is_sdxl, block_dims, block_alphas, network_dim, network_alpha, conv_block_dims, conv_block_alphas, conv_dim, conv_alpha
+            block_dims, block_alphas, network_dim, network_alpha, conv_block_dims, conv_block_alphas, conv_dim, conv_alpha
        )

        # remove block dim/alpha without learning rate
        block_dims, block_alphas, conv_block_dims, conv_block_alphas = remove_block_dims_and_alphas(
-            is_sdxl, block_dims, block_alphas, conv_block_dims, conv_block_alphas, block_lr_weight
+            block_dims, block_alphas, conv_block_dims, conv_block_alphas, down_lr_weight, mid_lr_weight, up_lr_weight
        )

    else:
@@ -490,20 +489,10 @@ def create_network(
        conv_block_dims=conv_block_dims,
        conv_block_alphas=conv_block_alphas,
        varbose=True,
-        is_sdxl=is_sdxl,
    )

-    loraplus_lr_ratio = kwargs.get("loraplus_lr_ratio", None)
-    loraplus_unet_lr_ratio = kwargs.get("loraplus_unet_lr_ratio", None)
-    loraplus_text_encoder_lr_ratio = kwargs.get("loraplus_text_encoder_lr_ratio", None)
-    loraplus_lr_ratio = float(loraplus_lr_ratio) if loraplus_lr_ratio is not None else None
-    loraplus_unet_lr_ratio = float(loraplus_unet_lr_ratio) if loraplus_unet_lr_ratio is not None else None
-    loraplus_text_encoder_lr_ratio = float(loraplus_text_encoder_lr_ratio) if loraplus_text_encoder_lr_ratio is not None else None
-    if loraplus_lr_ratio is not None or loraplus_unet_lr_ratio is not None or loraplus_text_encoder_lr_ratio is not None:
-        network.set_loraplus_lr_ratio(loraplus_lr_ratio, loraplus_unet_lr_ratio, loraplus_text_encoder_lr_ratio)
-
-    if block_lr_weight is not None:
-        network.set_block_lr_weight(block_lr_weight)
+    if up_lr_weight is not None or mid_lr_weight is not None or down_lr_weight is not None:
+        network.set_block_lr_weight(up_lr_weight, mid_lr_weight, down_lr_weight)

    return network

@@ -513,13 +502,9 @@ def create_network(
 # block_dims, block_alphas は両方ともNoneまたは両方とも値が入っている
 # conv_dim, conv_alpha は両方ともNoneまたは両方とも値が入っている
 def get_block_dims_and_alphas(
-    is_sdxl, block_dims, block_alphas, network_dim, network_alpha, conv_block_dims, conv_block_alphas, conv_dim, conv_alpha
+    block_dims, block_alphas, network_dim, network_alpha, conv_block_dims, conv_block_alphas, conv_dim, conv_alpha
 ):
-    if not is_sdxl:
-        num_total_blocks = LoRANetwork.NUM_OF_BLOCKS * 2 + LoRANetwork.NUM_OF_MID_BLOCKS
-    else:
-        # 1+9+3+9+1=23, no LoRA for emb_layers (0)
-        num_total_blocks = 1 + LoRANetwork.SDXL_NUM_OF_BLOCKS * 2 + LoRANetwork.SDXL_NUM_OF_MID_BLOCKS + 1
+    num_total_blocks = LoRANetwork.NUM_OF_BLOCKS * 2 + 1

    def parse_ints(s):
        return [int(i) for i in s.split(",")]
@@ -530,10 +515,9 @@ def get_block_dims_and_alphas(
    # block_dimsとblock_alphasをパースする。必ず値が入る
    if block_dims is not None:
        block_dims = parse_ints(block_dims)
-        assert len(block_dims) == num_total_blocks, (
-            f"block_dims must have {num_total_blocks} elements but {len(block_dims)} elements are given"
-            + f" / block_dimsは{num_total_blocks}個指定してください（指定された個数: {len(block_dims)}）"
-        )
+        assert (
+            len(block_dims) == num_total_blocks
+        ), f"block_dims must have {num_total_blocks} elements / block_dimsは{num_total_blocks}個指定してください"
    else:
        logger.warning(
            f"block_dims is not specified. all dims are set to {network_dim} / block_dimsが指定されていません。すべてのdimは{network_dim}になります"
@@ -584,25 +568,15 @@ def get_block_dims_and_alphas(
    return block_dims, block_alphas, conv_block_dims, conv_block_alphas


-# 層別学習率用に層ごとの学習率に対する倍率を定義する、外部から呼び出せるようにclass外に出しておく
-# 戻り値は block ごとの倍率のリスト
+# 層別学習率用に層ごとの学習率に対する倍率を定義する、外部から呼び出される可能性を考慮しておく
 def get_block_lr_weight(
-    is_sdxl,
-    down_lr_weight: Union[str, List[float]],
-    mid_lr_weight: List[float],
-    up_lr_weight: Union[str, List[float]],
-    zero_threshold: float,
-) -> Optional[List[float]]:
+    down_lr_weight, mid_lr_weight, up_lr_weight, zero_threshold
+) -> Tuple[List[float], List[float], List[float]]:
    # パラメータ未指定時は何もせず、今までと同じ動作とする
    if up_lr_weight is None and mid_lr_weight is None and down_lr_weight is None:
-        return None
+        return None, None, None

-    if not is_sdxl:
-        max_len_for_down_or_up = LoRANetwork.NUM_OF_BLOCKS
-        max_len_for_mid = LoRANetwork.NUM_OF_MID_BLOCKS
-    else:
-        max_len_for_down_or_up = LoRANetwork.SDXL_NUM_OF_BLOCKS
-        max_len_for_mid = LoRANetwork.SDXL_NUM_OF_MID_BLOCKS
+    max_len = LoRANetwork.NUM_OF_BLOCKS  # フルモデル相当でのup,downの層の数

    def get_list(name_with_suffix) -> List[float]:
        import math
@@ -612,18 +586,15 @@ def get_block_lr_weight(
        base_lr = float(tokens[1]) if len(tokens) > 1 else 0.0

        if name == "cosine":
-            return [
-                math.sin(math.pi * (i / (max_len_for_down_or_up - 1)) / 2) + base_lr
-                for i in reversed(range(max_len_for_down_or_up))
-            ]
+            return [math.sin(math.pi * (i / (max_len - 1)) / 2) + base_lr for i in reversed(range(max_len))]
        elif name == "sine":
-            return [math.sin(math.pi * (i / (max_len_for_down_or_up - 1)) / 2) + base_lr for i in range(max_len_for_down_or_up)]
+            return [math.sin(math.pi * (i / (max_len - 1)) / 2) + base_lr for i in range(max_len)]
        elif name == "linear":
-            return [i / (max_len_for_down_or_up - 1) + base_lr for i in range(max_len_for_down_or_up)]
+            return [i / (max_len - 1) + base_lr for i in range(max_len)]
        elif name == "reverse_linear":
-            return [i / (max_len_for_down_or_up - 1) + base_lr for i in reversed(range(max_len_for_down_or_up))]
+            return [i / (max_len - 1) + base_lr for i in reversed(range(max_len))]
        elif name == "zeros":
-            return [0.0 + base_lr] * max_len_for_down_or_up
+            return [0.0 + base_lr] * max_len
        else:
            logger.error(
                "Unknown lr_weight argument %s is used. Valid arguments:  / 不明なlr_weightの引数 %s が使われました。有効な引数:\n\tcosine, sine, linear, reverse_linear, zeros"
@@ -636,36 +607,20 @@ def get_block_lr_weight(
    if type(up_lr_weight) == str:
        up_lr_weight = get_list(up_lr_weight)

-    if (up_lr_weight != None and len(up_lr_weight) > max_len_for_down_or_up) or (
-        down_lr_weight != None and len(down_lr_weight) > max_len_for_down_or_up
-    ):
-        logger.warning("down_weight or up_weight is too long. Parameters after %d-th are ignored." % max_len_for_down_or_up)
-        logger.warning("down_weightもしくはup_weightが長すぎます。%d個目以降のパラメータは無視されます。" % max_len_for_down_or_up)
-        up_lr_weight = up_lr_weight[:max_len_for_down_or_up]
-        down_lr_weight = down_lr_weight[:max_len_for_down_or_up]
+    if (up_lr_weight != None and len(up_lr_weight) > max_len) or (down_lr_weight != None and len(down_lr_weight) > max_len):
+        logger.warning("down_weight or up_weight is too long. Parameters after %d-th are ignored." % max_len)
+        logger.warning("down_weightもしくはup_weightが長すぎます。%d個目以降のパラメータは無視されます。" % max_len)
+        up_lr_weight = up_lr_weight[:max_len]
+        down_lr_weight = down_lr_weight[:max_len]

-    if mid_lr_weight != None and len(mid_lr_weight) > max_len_for_mid:
-        logger.warning("mid_weight is too long. Parameters after %d-th are ignored." % max_len_for_mid)
-        logger.warning("mid_weightが長すぎます。%d個目以降のパラメータは無視されます。" % max_len_for_mid)
-        mid_lr_weight = mid_lr_weight[:max_len_for_mid]
+    if (up_lr_weight != None and len(up_lr_weight) < max_len) or (down_lr_weight != None and len(down_lr_weight) < max_len):
+        logger.warning("down_weight or up_weight is too short. Parameters after %d-th are filled with 1." % max_len)
+        logger.warning("down_weightもしくはup_weightが短すぎます。%d個目までの不足したパラメータは1で補われます。" % max_len)

-    if (up_lr_weight != None and len(up_lr_weight) < max_len_for_down_or_up) or (
-        down_lr_weight != None and len(down_lr_weight) < max_len_for_down_or_up
-    ):
-        logger.warning("down_weight or up_weight is too short. Parameters after %d-th are filled with 1." % max_len_for_down_or_up)
-        logger.warning(
-            "down_weightもしくはup_weightが短すぎます。%d個目までの不足したパラメータは1で補われます。" % max_len_for_down_or_up
-        )
-
-        if down_lr_weight != None and len(down_lr_weight) < max_len_for_down_or_up:
-            down_lr_weight = down_lr_weight + [1.0] * (max_len_for_down_or_up - len(down_lr_weight))
-        if up_lr_weight != None and len(up_lr_weight) < max_len_for_down_or_up:
-            up_lr_weight = up_lr_weight + [1.0] * (max_len_for_down_or_up - len(up_lr_weight))
-
-    if mid_lr_weight != None and len(mid_lr_weight) < max_len_for_mid:
-        logger.warning("mid_weight is too short. Parameters after %d-th are filled with 1." % max_len_for_mid)
-        logger.warning("mid_weightが短すぎます。%d個目までの不足したパラメータは1で補われます。" % max_len_for_mid)
-        mid_lr_weight = mid_lr_weight + [1.0] * (max_len_for_mid - len(mid_lr_weight))
+        if down_lr_weight != None and len(down_lr_weight) < max_len:
+            down_lr_weight = down_lr_weight + [1.0] * (max_len - len(down_lr_weight))
+        if up_lr_weight != None and len(up_lr_weight) < max_len:
+            up_lr_weight = up_lr_weight + [1.0] * (max_len - len(up_lr_weight))

    if (up_lr_weight != None) or (mid_lr_weight != None) or (down_lr_weight != None):
        logger.info("apply block learning rate / 階層別学習率を適用します。")
@@ -673,139 +628,78 @@ def get_block_lr_weight(
            down_lr_weight = [w if w > zero_threshold else 0 for w in down_lr_weight]
            logger.info(f"down_lr_weight (shallower -> deeper, 浅い層->深い層): {down_lr_weight}")
        else:
-            down_lr_weight = [1.0] * max_len_for_down_or_up
            logger.info("down_lr_weight: all 1.0, すべて1.0")

        if mid_lr_weight != None:
-            mid_lr_weight = [w if w > zero_threshold else 0 for w in mid_lr_weight]
+            mid_lr_weight = mid_lr_weight if mid_lr_weight > zero_threshold else 0
            logger.info(f"mid_lr_weight: {mid_lr_weight}")
        else:
-            mid_lr_weight = [1.0] * max_len_for_mid
-            logger.info("mid_lr_weight: all 1.0, すべて1.0")
+            logger.info("mid_lr_weight: 1.0")

        if up_lr_weight != None:
            up_lr_weight = [w if w > zero_threshold else 0 for w in up_lr_weight]
            logger.info(f"up_lr_weight (deeper -> shallower, 深い層->浅い層): {up_lr_weight}")
        else:
-            up_lr_weight = [1.0] * max_len_for_down_or_up
            logger.info("up_lr_weight: all 1.0, すべて1.0")

-    lr_weight = down_lr_weight + mid_lr_weight + up_lr_weight
-
-    if is_sdxl:
-        lr_weight = [1.0] + lr_weight + [1.0]  # add 1.0 for emb_layers and out
-
-    assert (not is_sdxl and len(lr_weight) == LoRANetwork.NUM_OF_BLOCKS * 2 + LoRANetwork.NUM_OF_MID_BLOCKS) or (
-        is_sdxl and len(lr_weight) == 1 + LoRANetwork.SDXL_NUM_OF_BLOCKS * 2 + LoRANetwork.SDXL_NUM_OF_MID_BLOCKS + 1
-    ), f"lr_weight length is invalid: {len(lr_weight)}"
-
-    return lr_weight
+    return down_lr_weight, mid_lr_weight, up_lr_weight


 # lr_weightが0のblockをblock_dimsから除外する、外部から呼び出す可能性を考慮しておく
 def remove_block_dims_and_alphas(
-    is_sdxl, block_dims, block_alphas, conv_block_dims, conv_block_alphas, block_lr_weight: Optional[List[float]]
+    block_dims, block_alphas, conv_block_dims, conv_block_alphas, down_lr_weight, mid_lr_weight, up_lr_weight
 ):
-    if block_lr_weight is not None:
-        for i, lr in enumerate(block_lr_weight):
+    # set 0 to block dim without learning rate to remove the block
+    if down_lr_weight != None:
+        for i, lr in enumerate(down_lr_weight):
            if lr == 0:
                block_dims[i] = 0
                if conv_block_dims is not None:
                    conv_block_dims[i] = 0
+    if mid_lr_weight != None:
+        if mid_lr_weight == 0:
+            block_dims[LoRANetwork.NUM_OF_BLOCKS] = 0
+            if conv_block_dims is not None:
+                conv_block_dims[LoRANetwork.NUM_OF_BLOCKS] = 0
+    if up_lr_weight != None:
+        for i, lr in enumerate(up_lr_weight):
+            if lr == 0:
+                block_dims[LoRANetwork.NUM_OF_BLOCKS + 1 + i] = 0
+                if conv_block_dims is not None:
+                    conv_block_dims[LoRANetwork.NUM_OF_BLOCKS + 1 + i] = 0
+
    return block_dims, block_alphas, conv_block_dims, conv_block_alphas


 # 外部から呼び出す可能性を考慮しておく
-def get_block_index(lora_name: str, is_sdxl: bool = False) -> int:
+def get_block_index(lora_name: str) -> int:
    block_idx = -1  # invalid lora name
-    if not is_sdxl:
-        m = RE_UPDOWN.search(lora_name)
-        if m:
-            g = m.groups()
-            i = int(g[1])
-            j = int(g[3])
-            if g[2] == "resnets":
-                idx = 3 * i + j
-            elif g[2] == "attentions":
-                idx = 3 * i + j
-            elif g[2] == "upsamplers" or g[2] == "downsamplers":
-                idx = 3 * i + 2

-            if g[0] == "down":
-                block_idx = 1 + idx  # 0に該当するLoRAは存在しない
-            elif g[0] == "up":
-                block_idx = LoRANetwork.NUM_OF_BLOCKS + 1 + idx
-        elif "mid_block_" in lora_name:
-            block_idx = LoRANetwork.NUM_OF_BLOCKS  # idx=12
-    else:
-        # copy from sdxl_train
-        if lora_name.startswith("lora_unet_"):
-            name = lora_name[len("lora_unet_") :]
-            if name.startswith("time_embed_") or name.startswith("label_emb_"):  # No LoRA
-                block_idx = 0  # 0
-            elif name.startswith("input_blocks_"):  # 1-9
-                block_idx = 1 + int(name.split("_")[2])
-            elif name.startswith("middle_block_"):  # 10-12
-                block_idx = 10 + int(name.split("_")[2])
-            elif name.startswith("output_blocks_"):  # 13-21
-                block_idx = 13 + int(name.split("_")[2])
-            elif name.startswith("out_"):  # 22, out, no LoRA
-                block_idx = 22
+    m = RE_UPDOWN.search(lora_name)
+    if m:
+        g = m.groups()
+        i = int(g[1])
+        j = int(g[3])
+        if g[2] == "resnets":
+            idx = 3 * i + j
+        elif g[2] == "attentions":
+            idx = 3 * i + j
+        elif g[2] == "upsamplers" or g[2] == "downsamplers":
+            idx = 3 * i + 2
+
+        if g[0] == "down":
+            block_idx = 1 + idx  # 0に該当するLoRAは存在しない
+        elif g[0] == "up":
+            block_idx = LoRANetwork.NUM_OF_BLOCKS + 1 + idx
+
+    elif "mid_block_" in lora_name:
+        block_idx = LoRANetwork.NUM_OF_BLOCKS  # idx=12

    return block_idx


-def convert_diffusers_to_sai_if_needed(weights_sd):
-    # only supports U-Net LoRA modules
-
-    found_up_down_blocks = False
-    for k in list(weights_sd.keys()):
-        if "down_blocks" in k:
-            found_up_down_blocks = True
-            break
-        if "up_blocks" in k:
-            found_up_down_blocks = True
-            break
-    if not found_up_down_blocks:
-        return
-
-    from library.sdxl_model_util import make_unet_conversion_map
-
-    unet_conversion_map = make_unet_conversion_map()
-    unet_conversion_map = {hf.replace(".", "_")[:-1]: sd.replace(".", "_")[:-1] for sd, hf in unet_conversion_map}
-
-    # # add extra conversion
-    # unet_conversion_map["up_blocks_1_upsamplers_0"] = "lora_unet_output_blocks_2_2_conv"
-
-    logger.info(f"Converting LoRA keys from Diffusers to SAI")
-    lora_unet_prefix = "lora_unet_"
-    for k in list(weights_sd.keys()):
-        if not k.startswith(lora_unet_prefix):
-            continue
-
-        unet_module_name = k[len(lora_unet_prefix) :].split(".")[0]
-
-        # search for conversion: this is slow because the algorithm is O(n^2), but the number of keys is small
-        for hf_module_name, sd_module_name in unet_conversion_map.items():
-            if hf_module_name in unet_module_name:
-                new_key = (
-                    lora_unet_prefix
-                    + unet_module_name.replace(hf_module_name, sd_module_name)
-                    + k[len(lora_unet_prefix) + len(unet_module_name) :]
-                )
-                weights_sd[new_key] = weights_sd.pop(k)
-                found = True
-                break
-
-        if not found:
-            logger.warning(f"Key {k} is not found in unet_conversion_map")
-
-
 # Create network from weights for inference, weights are not loaded here (because can be merged)
 def create_network_from_weights(multiplier, file, vae, text_encoder, unet, weights_sd=None, for_inference=False, **kwargs):
-    # if unet is an instance of SdxlUNet2DConditionModel or subclass, set is_sdxl to True
-    is_sdxl = unet is not None and issubclass(unet.__class__, SdxlUNet2DConditionModel)
-
    if weights_sd is None:
        if os.path.splitext(file)[1] == ".safetensors":
            from safetensors.torch import load_file, safe_open
@@ -814,10 +708,6 @@ def create_network_from_weights(multiplier, file, vae, text_encoder, unet, weigh
        else:
            weights_sd = torch.load(file, map_location="cpu")

-    # if keys are Diffusers based, convert to SAI based
-    if is_sdxl:
-        convert_diffusers_to_sai_if_needed(weights_sd)
-
    # get dim/alpha mapping
    modules_dim = {}
    modules_alpha = {}
@@ -841,32 +731,23 @@ def create_network_from_weights(multiplier, file, vae, text_encoder, unet, weigh
    module_class = LoRAInfModule if for_inference else LoRAModule

    network = LoRANetwork(
-        text_encoder,
-        unet,
-        multiplier=multiplier,
-        modules_dim=modules_dim,
-        modules_alpha=modules_alpha,
-        module_class=module_class,
-        is_sdxl=is_sdxl,
+        text_encoder, unet, multiplier=multiplier, modules_dim=modules_dim, modules_alpha=modules_alpha, module_class=module_class
    )

    # block lr
-    block_lr_weight = parse_block_lr_kwargs(is_sdxl, kwargs)
-    if block_lr_weight is not None:
-        network.set_block_lr_weight(block_lr_weight)
+    down_lr_weight, mid_lr_weight, up_lr_weight = parse_block_lr_kwargs(kwargs)
+    if up_lr_weight is not None or mid_lr_weight is not None or down_lr_weight is not None:
+        network.set_block_lr_weight(up_lr_weight, mid_lr_weight, down_lr_weight)

    return network, weights_sd


 class LoRANetwork(torch.nn.Module):
    NUM_OF_BLOCKS = 12  # フルモデル相当でのup,downの層の数
-    NUM_OF_MID_BLOCKS = 1
-    SDXL_NUM_OF_BLOCKS = 9  # SDXLのモデルでのinput/outputの層の数 total=1(base) 9(input) + 3(mid) + 9(output) + 1(out) = 23
-    SDXL_NUM_OF_MID_BLOCKS = 3

    UNET_TARGET_REPLACE_MODULE = ["Transformer2DModel"]
    UNET_TARGET_REPLACE_MODULE_CONV2D_3X3 = ["ResnetBlock2D", "Downsample2D", "Upsample2D"]
-    TEXT_ENCODER_TARGET_REPLACE_MODULE = ["CLIPAttention", "CLIPSdpaAttention", "CLIPMLP"]
+    TEXT_ENCODER_TARGET_REPLACE_MODULE = ["CLIPAttention", "CLIPMLP"]
    LORA_PREFIX_UNET = "lora_unet"
    LORA_PREFIX_TEXT_ENCODER = "lora_te"

@@ -894,7 +775,6 @@ class LoRANetwork(torch.nn.Module):
        modules_alpha: Optional[Dict[str, int]] = None,
        module_class: Type[object] = LoRAModule,
        varbose: Optional[bool] = False,
-        is_sdxl: Optional[bool] = False,
    ) -> None:
        """
        LoRA network: すごく引数が多いが、パターンは以下の通り
@@ -915,10 +795,6 @@ class LoRANetwork(torch.nn.Module):
        self.rank_dropout = rank_dropout
        self.module_dropout = module_dropout

-        self.loraplus_lr_ratio = None
-        self.loraplus_unet_lr_ratio = None
-        self.loraplus_text_encoder_lr_ratio = None
-
        if modules_dim is not None:
            logger.info(f"create LoRA network from weights")
        elif block_dims is not None:
@@ -965,9 +841,14 @@ class LoRANetwork(torch.nn.Module):
                        is_linear = child_module.__class__.__name__ == "Linear"
                        is_conv2d = child_module.__class__.__name__ == "Conv2d"
                        is_conv2d_1x1 = is_conv2d and child_module.kernel_size == (1, 1)
+                        is_group_conv2d = is_conv2d and child_module.groups > 1

-                        if is_linear or is_conv2d:
-                            lora_name = prefix + "." + name + "." + child_name
+                        # if is_group_conv2d:
+                        #     logger.info(f"skip group conv2d: {name}.{child_name}")
+                        #     continue
+
+                        if is_linear or (is_conv2d and not is_group_conv2d):
+                            lora_name = prefix + "." + name + ("." + child_name if child_name else "")
                            lora_name = lora_name.replace(".", "_")

                            dim = None
@@ -980,7 +861,7 @@ class LoRANetwork(torch.nn.Module):
                                    alpha = modules_alpha[lora_name]
                            elif is_unet and block_dims is not None:
                                # U-Netでblock_dims指定あり
-                                block_idx = get_block_index(lora_name, is_sdxl)
+                                block_idx = get_block_index(lora_name)
                                if is_linear or is_conv2d_1x1:
                                    dim = block_dims[block_idx]
                                    alpha = block_alphas[block_idx]
@@ -1039,6 +920,11 @@ class LoRANetwork(torch.nn.Module):
        if modules_dim is not None or self.conv_lora_dim is not None or conv_block_dims is not None:
            target_modules += LoRANetwork.UNET_TARGET_REPLACE_MODULE_CONV2D_3X3

+        # XXX temporary solution for Stable Cascade Stage C: replace all modules
+        if "StageC" in unet.__class__.__name__:
+            logger.info("replace all modules for Stable Cascade Stage C")
+            target_modules = ["Linear", "Conv2d"]
+
        self.unet_loras, skipped_un = create_modules(True, None, unet, target_modules)
        logger.info(f"create LoRA for U-Net: {len(self.unet_loras)} modules.")

@@ -1050,7 +936,9 @@ class LoRANetwork(torch.nn.Module):
            for name in skipped:
                logger.info(f"\t{name}")

-        self.block_lr_weight = None
+        self.up_lr_weight: List[float] = None
+        self.down_lr_weight: List[float] = None
+        self.mid_lr_weight: float = None
        self.block_lr = False

        # assertion
@@ -1081,12 +969,12 @@ class LoRANetwork(torch.nn.Module):

    def apply_to(self, text_encoder, unet, apply_text_encoder=True, apply_unet=True):
        if apply_text_encoder:
-            logger.info(f"enable LoRA for text encoder: {len(self.text_encoder_loras)} modules")
+            logger.info("enable LoRA for text encoder")
        else:
            self.text_encoder_loras = []

        if apply_unet:
-            logger.info(f"enable LoRA for U-Net: {len(self.unet_loras)} modules")
+            logger.info("enable LoRA for U-Net")
        else:
            self.unet_loras = []

@@ -1127,117 +1015,81 @@ class LoRANetwork(torch.nn.Module):
        logger.info(f"weights are merged")

    # 層別学習率用に層ごとの学習率に対する倍率を定義する　引数の順番が逆だがとりあえず気にしない
-    def set_block_lr_weight(self, block_lr_weight: Optional[List[float]]):
+    def set_block_lr_weight(
+        self,
+        up_lr_weight: List[float] = None,
+        mid_lr_weight: float = None,
+        down_lr_weight: List[float] = None,
+    ):
        self.block_lr = True
-        self.block_lr_weight = block_lr_weight
+        self.down_lr_weight = down_lr_weight
+        self.mid_lr_weight = mid_lr_weight
+        self.up_lr_weight = up_lr_weight

-    def get_lr_weight(self, block_idx: int) -> float:
-        if not self.block_lr or self.block_lr_weight is None:
-            return 1.0
-        return self.block_lr_weight[block_idx]
+    def get_lr_weight(self, lora: LoRAModule) -> float:
+        lr_weight = 1.0
+        block_idx = get_block_index(lora.lora_name)
+        if block_idx < 0:
+            return lr_weight

-    def set_loraplus_lr_ratio(self, loraplus_lr_ratio, loraplus_unet_lr_ratio, loraplus_text_encoder_lr_ratio):
-        self.loraplus_lr_ratio = loraplus_lr_ratio
-        self.loraplus_unet_lr_ratio = loraplus_unet_lr_ratio
-        self.loraplus_text_encoder_lr_ratio = loraplus_text_encoder_lr_ratio
+        if block_idx < LoRANetwork.NUM_OF_BLOCKS:
+            if self.down_lr_weight != None:
+                lr_weight = self.down_lr_weight[block_idx]
+        elif block_idx == LoRANetwork.NUM_OF_BLOCKS:
+            if self.mid_lr_weight != None:
+                lr_weight = self.mid_lr_weight
+        elif block_idx > LoRANetwork.NUM_OF_BLOCKS:
+            if self.up_lr_weight != None:
+                lr_weight = self.up_lr_weight[block_idx - LoRANetwork.NUM_OF_BLOCKS - 1]

-        logger.info(f"LoRA+ UNet LR Ratio: {self.loraplus_unet_lr_ratio or self.loraplus_lr_ratio}")
-        logger.info(f"LoRA+ Text Encoder LR Ratio: {self.loraplus_text_encoder_lr_ratio or self.loraplus_lr_ratio}")
+        return lr_weight

    # 二つのText Encoderに別々の学習率を設定できるようにするといいかも
    def prepare_optimizer_params(self, text_encoder_lr, unet_lr, default_lr):
-        # TODO warn if optimizer is not compatible with LoRA+ (but it will cause error so we don't need to check it here?)
-        # if (
-        #     self.loraplus_lr_ratio is not None
-        #     or self.loraplus_text_encoder_lr_ratio is not None
-        #     or self.loraplus_unet_lr_ratio is not None
-        # ):
-        #     assert (
-        #         optimizer_type.lower() != "prodigy" and "dadapt" not in optimizer_type.lower()
-        #     ), "LoRA+ and Prodigy/DAdaptation is not supported / LoRA+とProdigy/DAdaptationの組み合わせはサポートされていません"
-
        self.requires_grad_(True)
-
        all_params = []
-        lr_descriptions = []
-
-        def assemble_params(loras, lr, ratio):
-            param_groups = {"lora": {}, "plus": {}}
-            for lora in loras:
-                for name, param in lora.named_parameters():
-                    if ratio is not None and "lora_up" in name:
-                        param_groups["plus"][f"{lora.lora_name}.{name}"] = param
-                    else:
-                        param_groups["lora"][f"{lora.lora_name}.{name}"] = param

+        def enumerate_params(loras):
            params = []
-            descriptions = []
-            for key in param_groups.keys():
-                param_data = {"params": param_groups[key].values()}
-
-                if len(param_data["params"]) == 0:
-                    continue
-
-                if lr is not None:
-                    if key == "plus":
-                        param_data["lr"] = lr * ratio
-                    else:
-                        param_data["lr"] = lr
-
-                if param_data.get("lr", None) == 0 or param_data.get("lr", None) is None:
-                    logger.info("NO LR skipping!")
-                    continue
-
-                params.append(param_data)
-                descriptions.append("plus" if key == "plus" else "")
-
-            return params, descriptions
+            for lora in loras:
+                params.extend(lora.parameters())
+            return params

        if self.text_encoder_loras:
-            params, descriptions = assemble_params(
-                self.text_encoder_loras,
-                text_encoder_lr if text_encoder_lr is not None else default_lr,
-                self.loraplus_text_encoder_lr_ratio or self.loraplus_lr_ratio,
-            )
-            all_params.extend(params)
-            lr_descriptions.extend(["textencoder" + (" " + d if d else "") for d in descriptions])
+            param_data = {"params": enumerate_params(self.text_encoder_loras)}
+            if text_encoder_lr is not None:
+                param_data["lr"] = text_encoder_lr
+            all_params.append(param_data)

        if self.unet_loras:
            if self.block_lr:
-                is_sdxl = False
-                for lora in self.unet_loras:
-                    if "input_blocks" in lora.lora_name or "output_blocks" in lora.lora_name:
-                        is_sdxl = True
-                        break
-
                # 学習率のグラフをblockごとにしたいので、blockごとにloraを分類
                block_idx_to_lora = {}
                for lora in self.unet_loras:
-                    idx = get_block_index(lora.lora_name, is_sdxl)
+                    idx = get_block_index(lora.lora_name)
                    if idx not in block_idx_to_lora:
                        block_idx_to_lora[idx] = []
                    block_idx_to_lora[idx].append(lora)

                # blockごとにパラメータを設定する
                for idx, block_loras in block_idx_to_lora.items():
-                    params, descriptions = assemble_params(
-                        block_loras,
-                        (unet_lr if unet_lr is not None else default_lr) * self.get_lr_weight(idx),
-                        self.loraplus_unet_lr_ratio or self.loraplus_lr_ratio,
-                    )
-                    all_params.extend(params)
-                    lr_descriptions.extend([f"unet_block{idx}" + (" " + d if d else "") for d in descriptions])
+                    param_data = {"params": enumerate_params(block_loras)}
+
+                    if unet_lr is not None:
+                        param_data["lr"] = unet_lr * self.get_lr_weight(block_loras[0])
+                    elif default_lr is not None:
+                        param_data["lr"] = default_lr * self.get_lr_weight(block_loras[0])
+                    if ("lr" in param_data) and (param_data["lr"] == 0):
+                        continue
+                    all_params.append(param_data)

            else:
-                params, descriptions = assemble_params(
-                    self.unet_loras,
-                    unet_lr if unet_lr is not None else default_lr,
-                    self.loraplus_unet_lr_ratio or self.loraplus_lr_ratio,
-                )
-                all_params.extend(params)
-                lr_descriptions.extend(["unet" + (" " + d if d else "") for d in descriptions])
+                param_data = {"params": enumerate_params(self.unet_loras)}
+                if unet_lr is not None:
+                    param_data["lr"] = unet_lr
+                all_params.append(param_data)

-        return all_params, lr_descriptions
+        return all_params

    def enable_gradient_checkpointing(self):
        # not supported
--- a/networks/lora_diffusers.py
+++ b/networks/lora_diffusers.py
@@ -278,7 +278,7 @@ def merge_lora_weights(pipe, weights_sd: Dict, multiplier: float = 1.0):
 class LoRANetwork(torch.nn.Module):
    UNET_TARGET_REPLACE_MODULE = ["Transformer2DModel"]
    UNET_TARGET_REPLACE_MODULE_CONV2D_3X3 = ["ResnetBlock2D", "Downsample2D", "Upsample2D"]
-    TEXT_ENCODER_TARGET_REPLACE_MODULE = ["CLIPAttention", "CLIPSdpaAttention", "CLIPMLP"]
+    TEXT_ENCODER_TARGET_REPLACE_MODULE = ["CLIPAttention", "CLIPMLP"]
    LORA_PREFIX_UNET = "lora_unet"
    LORA_PREFIX_TEXT_ENCODER = "lora_te"

--- a/networks/lora_fa.py
+++ b/networks/lora_fa.py
@@ -755,7 +755,7 @@ class LoRANetwork(torch.nn.Module):

    UNET_TARGET_REPLACE_MODULE = ["Transformer2DModel"]
    UNET_TARGET_REPLACE_MODULE_CONV2D_3X3 = ["ResnetBlock2D", "Downsample2D", "Upsample2D"]
-    TEXT_ENCODER_TARGET_REPLACE_MODULE = ["CLIPAttention", "CLIPSdpaAttention", "CLIPMLP"]
+    TEXT_ENCODER_TARGET_REPLACE_MODULE = ["CLIPAttention", "CLIPMLP"]
    LORA_PREFIX_UNET = "lora_unet"
    LORA_PREFIX_TEXT_ENCODER = "lora_te"

--- a/networks/oft.py
+++ b/networks/oft.py
@@ -4,17 +4,13 @@ import math
 import os
 from typing import Dict, List, Optional, Tuple, Type, Union
 from diffusers import AutoencoderKL
-import einops
 from transformers import CLIPTextModel
 import numpy as np
 import torch
-import torch.nn.functional as F
 import re
 from library.utils import setup_logging
-
 setup_logging()
 import logging
-
 logger = logging.getLogger(__name__)

 RE_UPDOWN = re.compile(r"(up|down)_blocks_(\d+)_(resnets|upsamplers|downsamplers|attentions)_(\d+)_")
@@ -49,16 +45,11 @@ class OFTModule(torch.nn.Module):

        if type(alpha) == torch.Tensor:
            alpha = alpha.detach().numpy()
-        
-        # constraint in original paper is alpha * out_dim * out_dim, but we use alpha * out_dim for backward compatibility
-        # original alpha is 1e-6, so we use 1e-3 or 1e-4 for alpha
-        self.constraint = alpha * out_dim 
-        
+        self.constraint = alpha * out_dim
        self.register_buffer("alpha", torch.tensor(alpha))

        self.block_size = out_dim // self.num_blocks
        self.oft_blocks = torch.nn.Parameter(torch.zeros(self.num_blocks, self.block_size, self.block_size))
-        self.I = torch.eye(self.block_size).unsqueeze(0).repeat(self.num_blocks, 1, 1)  # cpu

        self.out_dim = out_dim
        self.shape = org_module.weight.shape
@@ -78,36 +69,27 @@ class OFTModule(torch.nn.Module):
        norm_Q = torch.norm(block_Q.flatten())
        new_norm_Q = torch.clamp(norm_Q, max=self.constraint)
        block_Q = block_Q * ((new_norm_Q + 1e-8) / (norm_Q + 1e-8))
+        I = torch.eye(self.block_size, device=self.oft_blocks.device).unsqueeze(0).repeat(self.num_blocks, 1, 1)
+        block_R = torch.matmul(I + block_Q, (I - block_Q).inverse())

-        if self.I.device != block_Q.device:
-            self.I = self.I.to(block_Q.device)
-        I = self.I
-        block_R = torch.matmul(I + block_Q, (I - block_Q).float().inverse())
-        block_R_weighted = self.multiplier * (block_R - I) + I
-        return block_R_weighted
+        block_R_weighted = self.multiplier * block_R + (1 - self.multiplier) * I
+        R = torch.block_diag(*block_R_weighted)
+
+        return R

    def forward(self, x, scale=None):
+        x = self.org_forward(x)
        if self.multiplier == 0.0:
-            return self.org_forward(x)
-        org_module = self.org_module[0]
-        org_dtype = x.dtype
+            return x

-        R = self.get_weight().to(torch.float32)
-        W = org_module.weight.to(torch.float32)
-
-        if len(W.shape) == 4:  # Conv2d
-            W_reshaped = einops.rearrange(W, "(k n) ... -> k n ...", k=self.num_blocks, n=self.block_size)
-            RW = torch.einsum("k n m, k n ... -> k m ...", R, W_reshaped)
-            RW = einops.rearrange(RW, "k m ... -> (k m) ...")
-            result = F.conv2d(
-                x, RW.to(org_dtype), org_module.bias, org_module.stride, org_module.padding, org_module.dilation, org_module.groups
-            )
-        else:  # Linear
-            W_reshaped = einops.rearrange(W, "(k n) m -> k n m", k=self.num_blocks, n=self.block_size)
-            RW = torch.einsum("k n m, k n p -> k m p", R, W_reshaped)
-            RW = einops.rearrange(RW, "k m p -> (k m) p")
-            result = F.linear(x, RW.to(org_dtype), org_module.bias)
-        return result
+        R = self.get_weight().to(x.device, dtype=x.dtype)
+        if x.dim() == 4:
+            x = x.permute(0, 2, 3, 1)
+            x = torch.matmul(x, R)
+            x = x.permute(0, 3, 1, 2)
+        else:
+            x = torch.matmul(x, R)
+        return x


 class OFTInfModule(OFTModule):
@@ -133,19 +115,18 @@ class OFTInfModule(OFTModule):
            return self.org_forward(x)
        return super().forward(x, scale)

-    def merge_to(self, multiplier=None):
+    def merge_to(self, multiplier=None, sign=1):
+        R = self.get_weight(multiplier) * sign
+
        # get org weight
        org_sd = self.org_module[0].state_dict()
-        org_weight = org_sd["weight"].to(torch.float32)
+        org_weight = org_sd["weight"]
+        R = R.to(org_weight.device, dtype=org_weight.dtype)

-        R = self.get_weight(multiplier).to(torch.float32)
-
-        weight = org_weight.reshape(self.num_blocks, self.block_size, -1)
-        weight = torch.einsum("k n m, k n ... -> k m ...", R, weight)
-        weight = weight.reshape(org_weight.shape)
-
-        # convert back to original dtype
-        weight = weight.to(org_sd["weight"].dtype)
+        if org_weight.dim() == 4:
+            weight = torch.einsum("oihw, op -> pihw", org_weight, R)
+        else:
+            weight = torch.einsum("oi, op -> pi", org_weight, R)

        # set weight to org_module
        org_sd["weight"] = weight
@@ -164,16 +145,8 @@ def create_network(
 ):
    if network_dim is None:
        network_dim = 4  # default
-    if network_alpha is None:  # should be set
-        logger.info(
-            "network_alpha is not set, use default value 1e-3 / network_alphaが設定されていないのでデフォルト値 1e-3 を使用します"
-        )
-        network_alpha = 1e-3
-    elif network_alpha >= 1:
-        logger.warning(
-            "network_alpha is too large (>=1, maybe default value is too large), please consider to set smaller value like 1e-3"
-            " / network_alphaが大きすぎるようです(>=1, デフォルト値が大きすぎる可能性があります)。1e-3のような小さな値を推奨"
-        )
+    if network_alpha is None:
+        network_alpha = 1.0

    enable_all_linear = kwargs.get("enable_all_linear", None)
    enable_conv = kwargs.get("enable_conv", None)
@@ -217,11 +190,12 @@ def create_network_from_weights(multiplier, file, vae, text_encoder, unet, weigh
        else:
            if dim is None:
                dim = param.size()[0]
-            if has_conv2d is None and "in_layers_2" in name:
+            if has_conv2d is None and param.dim() == 4:
                has_conv2d = True
-            if all_linear is None and "_ff_" in name:
-                all_linear = True
-        if dim is not None and alpha is not None and has_conv2d is not None and all_linear is not None:
+            if all_linear is None:
+                if param.dim() == 3 and "attn" not in name:
+                    all_linear = True
+        if dim is not None and alpha is not None and has_conv2d is not None:
            break
    if has_conv2d is None:
        has_conv2d = False
@@ -267,7 +241,7 @@ class OFTNetwork(torch.nn.Module):
        self.alpha = alpha

        logger.info(
-            f"create OFT network. num blocks: {self.dim}, constraint: {self.alpha}, multiplier: {self.multiplier}, enable_conv: {enable_conv}, enable_all_linear: {enable_all_linear}"
+            f"create OFT network. num blocks: {self.dim}, constraint: {self.alpha}, multiplier: {self.multiplier}, enable_conv: {enable_conv}"
        )

        # create module instances
--- a/networks/resize_lora.py
+++ b/networks/resize_lora.py
@@ -39,7 +39,12 @@ def load_state_dict(file_name, dtype):
    return sd, metadata


-def save_to_file(file_name, state_dict, metadata):
+def save_to_file(file_name, state_dict, dtype, metadata):
+    if dtype is not None:
+        for key in list(state_dict.keys()):
+            if type(state_dict[key]) == torch.Tensor:
+                state_dict[key] = state_dict[key].to(dtype)
+
    if model_util.is_safetensors(file_name):
        save_file(state_dict, file_name, metadata)
    else:
@@ -344,18 +349,12 @@ def resize(args):
        metadata["ss_network_dim"] = "Dynamic"
        metadata["ss_network_alpha"] = "Dynamic"

-    # cast to save_dtype before calculating hashes
-    for key in list(state_dict.keys()):
-        value = state_dict[key]
-        if type(value) == torch.Tensor and value.dtype.is_floating_point and value.dtype != save_dtype:
-            state_dict[key] = value.to(save_dtype)
-
    model_hash, legacy_hash = train_util.precalculate_safetensors_hashes(state_dict, metadata)
    metadata["sshs_model_hash"] = model_hash
    metadata["sshs_legacy_hash"] = legacy_hash

    logger.info(f"saving model to: {args.save_to}")
-    save_to_file(args.save_to, state_dict, metadata)
+    save_to_file(args.save_to, state_dict, save_dtype, metadata)


 def setup_parser() -> argparse.ArgumentParser:
--- a/networks/sdxl_merge_lora.py
+++ b/networks/sdxl_merge_lora.py
@@ -1,25 +1,18 @@
-import itertools
 import math
 import argparse
 import os
 import time
-import concurrent.futures
 import torch
 from safetensors.torch import load_file, save_file
 from tqdm import tqdm
 from library import sai_model_spec, sdxl_model_util, train_util
 import library.model_util as model_util
 import lora
-import oft
-from svd_merge_lora import format_lbws, get_lbw_block_index, LAYER26
 from library.utils import setup_logging
-
 setup_logging()
 import logging
-
 logger = logging.getLogger(__name__)

-
 def load_state_dict(file_name, dtype):
    if os.path.splitext(file_name)[1] == ".safetensors":
        sd = load_file(file_name)
@@ -35,58 +28,36 @@ def load_state_dict(file_name, dtype):
    return sd, metadata


-def save_to_file(file_name, model, metadata):
+def save_to_file(file_name, model, state_dict, dtype, metadata):
+    if dtype is not None:
+        for key in list(state_dict.keys()):
+            if type(state_dict[key]) == torch.Tensor:
+                state_dict[key] = state_dict[key].to(dtype)
+
    if os.path.splitext(file_name)[1] == ".safetensors":
        save_file(model, file_name, metadata=metadata)
    else:
        torch.save(model, file_name)


-def detect_method_from_training_model(models, dtype):
-    for model in models:
-        # TODO It is better to use key names to detect the method
-        lora_sd, _ = load_state_dict(model, dtype)
-        for key in tqdm(lora_sd.keys()):
-            if "lora_up" in key or "lora_down" in key:
-                return "LoRA"
-            elif "oft_blocks" in key:
-                return "OFT"
-
-
-def merge_to_sd_model(text_encoder1, text_encoder2, unet, models, ratios, lbws, merge_dtype):
+def merge_to_sd_model(text_encoder1, text_encoder2, unet, models, ratios, merge_dtype):
+    text_encoder1.to(merge_dtype)
    text_encoder1.to(merge_dtype)
-    text_encoder2.to(merge_dtype)
    unet.to(merge_dtype)

-    # detect the method: OFT or LoRA_module
-    method = detect_method_from_training_model(models, merge_dtype)
-    logger.info(f"method:{method}")
-
-    if lbws:
-        lbws, _, LBW_TARGET_IDX = format_lbws(lbws)
-    else:
-        LBW_TARGET_IDX = []
-
    # create module map
    name_to_module = {}
    for i, root_module in enumerate([text_encoder1, text_encoder2, unet]):
-        if method == "LoRA":
-            if i <= 1:
-                if i == 0:
-                    prefix = lora.LoRANetwork.LORA_PREFIX_TEXT_ENCODER1
-                else:
-                    prefix = lora.LoRANetwork.LORA_PREFIX_TEXT_ENCODER2
-                target_replace_modules = lora.LoRANetwork.TEXT_ENCODER_TARGET_REPLACE_MODULE
+        if i <= 1:
+            if i == 0:
+                prefix = lora.LoRANetwork.LORA_PREFIX_TEXT_ENCODER1
            else:
-                prefix = lora.LoRANetwork.LORA_PREFIX_UNET
-                target_replace_modules = (
-                    lora.LoRANetwork.UNET_TARGET_REPLACE_MODULE + lora.LoRANetwork.UNET_TARGET_REPLACE_MODULE_CONV2D_3X3
-                )
-        elif method == "OFT":
-            prefix = oft.OFTNetwork.OFT_PREFIX_UNET
-            # ALL_LINEAR includes ATTN_ONLY, so we don't need to specify ATTN_ONLY
+                prefix = lora.LoRANetwork.LORA_PREFIX_TEXT_ENCODER2
+            target_replace_modules = lora.LoRANetwork.TEXT_ENCODER_TARGET_REPLACE_MODULE
+        else:
+            prefix = lora.LoRANetwork.LORA_PREFIX_UNET
            target_replace_modules = (
-                oft.OFTNetwork.UNET_TARGET_REPLACE_MODULE_ALL_LINEAR + oft.OFTNetwork.UNET_TARGET_REPLACE_MODULE_CONV2D_3X3
+                lora.LoRANetwork.UNET_TARGET_REPLACE_MODULE + lora.LoRANetwork.UNET_TARGET_REPLACE_MODULE_CONV2D_3X3
            )

        for name, module in root_module.named_modules():
@@ -97,172 +68,65 @@ def merge_to_sd_model(text_encoder1, text_encoder2, unet, models, ratios, lbws,
                        lora_name = lora_name.replace(".", "_")
                        name_to_module[lora_name] = child_module

-    for model, ratio, lbw in itertools.zip_longest(models, ratios, lbws):
+    for model, ratio in zip(models, ratios):
        logger.info(f"loading: {model}")
        lora_sd, _ = load_state_dict(model, merge_dtype)

        logger.info(f"merging...")
+        for key in tqdm(lora_sd.keys()):
+            if "lora_down" in key:
+                up_key = key.replace("lora_down", "lora_up")
+                alpha_key = key[: key.index("lora_down")] + "alpha"

-        if lbw:
-            lbw_weights = [1] * 26
-            for index, value in zip(LBW_TARGET_IDX, lbw):
-                lbw_weights[index] = value
-            logger.info(f"lbw: {dict(zip(LAYER26.keys(), lbw_weights))}")
-
-        if method == "LoRA":
-            for key in tqdm(lora_sd.keys()):
-                if "lora_down" in key:
-                    up_key = key.replace("lora_down", "lora_up")
-                    alpha_key = key[: key.index("lora_down")] + "alpha"
-
-                    # find original module for this lora
-                    module_name = ".".join(key.split(".")[:-2])  # remove trailing ".lora_down.weight"
-                    if module_name not in name_to_module:
-                        logger.info(f"no module found for LoRA weight: {key}")
-                        continue
-                    module = name_to_module[module_name]
-                    # logger.info(f"apply {key} to {module}")
-
-                    down_weight = lora_sd[key]
-                    up_weight = lora_sd[up_key]
-
-                    dim = down_weight.size()[0]
-                    alpha = lora_sd.get(alpha_key, dim)
-                    scale = alpha / dim
-
-                    if lbw:
-                        index = get_lbw_block_index(key, True)
-                        is_lbw_target = index in LBW_TARGET_IDX
-                        if is_lbw_target:
-                            scale *= lbw_weights[index]  # keyがlbwの対象であれば、lbwの重みを掛ける
-
-                    # W <- W + U * D
-                    weight = module.weight
-                    # logger.info(module_name, down_weight.size(), up_weight.size())
-                    if len(weight.size()) == 2:
-                        # linear
-                        weight = weight + ratio * (up_weight @ down_weight) * scale
-                    elif down_weight.size()[2:4] == (1, 1):
-                        # conv2d 1x1
-                        weight = (
-                            weight
-                            + ratio
-                            * (up_weight.squeeze(3).squeeze(2) @ down_weight.squeeze(3).squeeze(2)).unsqueeze(2).unsqueeze(3)
-                            * scale
-                        )
-                    else:
-                        # conv2d 3x3
-                        conved = torch.nn.functional.conv2d(down_weight.permute(1, 0, 2, 3), up_weight).permute(1, 0, 2, 3)
-                        # logger.info(conved.size(), weight.size(), module.stride, module.padding)
-                        weight = weight + ratio * conved * scale
-
-                    module.weight = torch.nn.Parameter(weight)
-
-        elif method == "OFT":
-
-            device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-
-            for key in tqdm(lora_sd.keys()):
-                if "oft_blocks" in key:
-                    oft_blocks = lora_sd[key]
-                    dim = oft_blocks.shape[0]
-                    break
-            for key in tqdm(lora_sd.keys()):
-                if "alpha" in key:
-                    oft_blocks = lora_sd[key]
-                    alpha = oft_blocks.item()
-                    break
-
-            def merge_to(key):
-                if "alpha" in key:
-                    return
-
-                # find original module for this OFT
-                module_name = ".".join(key.split(".")[:-1])
+                # find original module for this lora
+                module_name = ".".join(key.split(".")[:-2])  # remove trailing ".lora_down.weight"
                if module_name not in name_to_module:
-                    logger.info(f"no module found for OFT weight: {key}")
-                    return
+                    logger.info(f"no module found for LoRA weight: {key}")
+                    continue
                module = name_to_module[module_name]
-
                # logger.info(f"apply {key} to {module}")

-                oft_blocks = lora_sd[key]
+                down_weight = lora_sd[key]
+                up_weight = lora_sd[up_key]

-                if isinstance(module, torch.nn.Linear):
-                    out_dim = module.out_features
-                elif isinstance(module, torch.nn.Conv2d):
-                    out_dim = module.out_channels
+                dim = down_weight.size()[0]
+                alpha = lora_sd.get(alpha_key, dim)
+                scale = alpha / dim

-                num_blocks = dim
-                block_size = out_dim // dim
-                constraint = (0 if alpha is None else alpha) * out_dim
-
-                multiplier = 1
-                if lbw:
-                    index = get_lbw_block_index(key, False)
-                    is_lbw_target = index in LBW_TARGET_IDX
-                    if is_lbw_target:
-                        multiplier *= lbw_weights[index]
-
-                block_Q = oft_blocks - oft_blocks.transpose(1, 2)
-                norm_Q = torch.norm(block_Q.flatten())
-                new_norm_Q = torch.clamp(norm_Q, max=constraint)
-                block_Q = block_Q * ((new_norm_Q + 1e-8) / (norm_Q + 1e-8))
-                I = torch.eye(block_size, device=oft_blocks.device).unsqueeze(0).repeat(num_blocks, 1, 1)
-                block_R = torch.matmul(I + block_Q, (I - block_Q).inverse())
-                block_R_weighted = multiplier * block_R + (1 - multiplier) * I
-                R = torch.block_diag(*block_R_weighted)
-
-                # get org weight
-                org_sd = module.state_dict()
-                org_weight = org_sd["weight"].to(device)
-
-                R = R.to(org_weight.device, dtype=org_weight.dtype)
-
-                if org_weight.dim() == 4:
-                    weight = torch.einsum("oihw, op -> pihw", org_weight, R)
+                # W <- W + U * D
+                weight = module.weight
+                # logger.info(module_name, down_weight.size(), up_weight.size())
+                if len(weight.size()) == 2:
+                    # linear
+                    weight = weight + ratio * (up_weight @ down_weight) * scale
+                elif down_weight.size()[2:4] == (1, 1):
+                    # conv2d 1x1
+                    weight = (
+                        weight
+                        + ratio
+                        * (up_weight.squeeze(3).squeeze(2) @ down_weight.squeeze(3).squeeze(2)).unsqueeze(2).unsqueeze(3)
+                        * scale
+                    )
                else:
-                    weight = torch.einsum("oi, op -> pi", org_weight, R)
-
-                weight = weight.contiguous()  # Make Tensor contiguous; required due to ThreadPoolExecutor
+                    # conv2d 3x3
+                    conved = torch.nn.functional.conv2d(down_weight.permute(1, 0, 2, 3), up_weight).permute(1, 0, 2, 3)
+                    # logger.info(conved.size(), weight.size(), module.stride, module.padding)
+                    weight = weight + ratio * conved * scale

                module.weight = torch.nn.Parameter(weight)

-            # TODO multi-threading may cause OOM on CPU if cpu_count is too high and RAM is not enough
-            max_workers = 1 if device.type != "cpu" else None  # avoid OOM on GPU
-            with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
-                list(tqdm(executor.map(merge_to, lora_sd.keys()), total=len(lora_sd.keys())))

-
-def merge_lora_models(models, ratios, lbws, merge_dtype, concat=False, shuffle=False):
+def merge_lora_models(models, ratios, merge_dtype, concat=False, shuffle=False):
    base_alphas = {}  # alpha for merged model
    base_dims = {}

-    # detect the method: OFT or LoRA_module
-    method = detect_method_from_training_model(models, merge_dtype)
-    if method == "OFT":
-        raise ValueError(
-            "OFT model is not supported for merging OFT models. / OFTモデルはOFTモデル同士のマージには対応していません"
-        )
-
-    if lbws:
-        lbws, _, LBW_TARGET_IDX = format_lbws(lbws)
-    else:
-        LBW_TARGET_IDX = []
-
    merged_sd = {}
    v2 = None
    base_model = None
-    for model, ratio, lbw in itertools.zip_longest(models, ratios, lbws):
+    for model, ratio in zip(models, ratios):
        logger.info(f"loading: {model}")
        lora_sd, lora_metadata = load_state_dict(model, merge_dtype)

-        if lbw:
-            lbw_weights = [1] * 26
-            for index, value in zip(LBW_TARGET_IDX, lbw):
-                lbw_weights[index] = value
-            logger.info(f"lbw: {dict(zip(LAYER26.keys(), lbw_weights))}")
-
        if lora_metadata is not None:
            if v2 is None:
                v2 = lora_metadata.get(train_util.SS_METADATA_KEY_V2, None)  # returns string, SDXLはv2がないのでFalseのはず
@@ -300,7 +164,7 @@ def merge_lora_models(models, ratios, lbws, merge_dtype, concat=False, shuffle=F
        for key in tqdm(lora_sd.keys()):
            if "alpha" in key:
                continue
-
+            
            if "lora_up" in key and concat:
                concat_dim = 1
            elif "lora_down" in key and concat:
@@ -314,14 +178,8 @@ def merge_lora_models(models, ratios, lbws, merge_dtype, concat=False, shuffle=F
            alpha = alphas[lora_module_name]

            scale = math.sqrt(alpha / base_alpha) * ratio
-            scale = abs(scale) if "lora_up" in key else scale  # マイナスの重みに対応する。
-
-            if lbw:
-                index = get_lbw_block_index(key, True)
-                is_lbw_target = index in LBW_TARGET_IDX
-                if is_lbw_target:
-                    scale *= lbw_weights[index]  # keyがlbwの対象であれば、lbwの重みを掛ける
-
+            scale = abs(scale) if "lora_up" in key else scale # マイナスの重みに対応する。
+            
            if key in merged_sd:
                assert (
                    merged_sd[key].size() == lora_sd[key].size() or concat_dim is not None
@@ -343,7 +201,7 @@ def merge_lora_models(models, ratios, lbws, merge_dtype, concat=False, shuffle=F
            dim = merged_sd[key_down].shape[0]
            perm = torch.randperm(dim)
            merged_sd[key_down] = merged_sd[key_down][perm]
-            merged_sd[key_up] = merged_sd[key_up][:, perm]
+            merged_sd[key_up] = merged_sd[key_up][:,perm]

    logger.info("merged model")
    logger.info(f"dim: {list(set(base_dims.values()))}, alpha: {list(set(base_alphas.values()))}")
@@ -371,15 +229,7 @@ def merge_lora_models(models, ratios, lbws, merge_dtype, concat=False, shuffle=F


 def merge(args):
-    assert len(args.models) == len(
-        args.ratios
-    ), f"number of models must be equal to number of ratios / モデルの数と重みの数は合わせてください"
-    if args.lbws:
-        assert len(args.models) == len(
-            args.lbws
-        ), f"number of models must be equal to number of ratios / モデルの数と層別適用率の数は合わせてください"
-    else:
-        args.lbws = []  # zip_longestで扱えるようにlbws未使用時には空のリストにしておく
+    assert len(args.models) == len(args.ratios), f"number of models must be equal to number of ratios / モデルの数と重みの数は合わせてください"

    def str_to_dtype(p):
        if p == "float":
@@ -407,7 +257,7 @@ def merge(args):
            ckpt_info,
        ) = sdxl_model_util.load_models_from_sdxl_checkpoint(sdxl_model_util.MODEL_VERSION_SDXL_BASE_V1_0, args.sd_model, "cpu")

-        merge_to_sd_model(text_model1, text_model2, unet, args.models, args.ratios, args.lbws, merge_dtype)
+        merge_to_sd_model(text_model1, text_model2, unet, args.models, args.ratios, merge_dtype)

        if args.no_metadata:
            sai_metadata = None
@@ -423,13 +273,7 @@ def merge(args):
            args.save_to, text_model1, text_model2, unet, 0, 0, ckpt_info, vae, logit_scale, sai_metadata, save_dtype
        )
    else:
-        state_dict, metadata = merge_lora_models(args.models, args.ratios, args.lbws, merge_dtype, args.concat, args.shuffle)
-
-        # cast to save_dtype before calculating hashes
-        for key in list(state_dict.keys()):
-            value = state_dict[key]
-            if type(value) == torch.Tensor and value.dtype.is_floating_point and value.dtype != save_dtype:
-                state_dict[key] = value.to(save_dtype)
+        state_dict, metadata = merge_lora_models(args.models, args.ratios, merge_dtype, args.concat, args.shuffle)

        logger.info(f"calculating hashes and creating metadata...")

@@ -446,7 +290,7 @@ def merge(args):
            metadata.update(sai_metadata)

        logger.info(f"saving model to: {args.save_to}")
-        save_to_file(args.save_to, state_dict, metadata)
+        save_to_file(args.save_to, state_dict, state_dict, save_dtype, metadata)


 def setup_parser() -> argparse.ArgumentParser:
@@ -472,19 +316,12 @@ def setup_parser() -> argparse.ArgumentParser:
        help="Stable Diffusion model to load: ckpt or safetensors file, merge LoRA models if omitted / 読み込むモデル、ckptまたはsafetensors。省略時はLoRAモデル同士をマージする",
    )
    parser.add_argument(
-        "--save_to",
-        type=str,
-        default=None,
-        help="destination file name: ckpt or safetensors file / 保存先のファイル名、ckptまたはsafetensors",
+        "--save_to", type=str, default=None, help="destination file name: ckpt or safetensors file / 保存先のファイル名、ckptまたはsafetensors"
    )
    parser.add_argument(
-        "--models",
-        type=str,
-        nargs="*",
-        help="LoRA models to merge: ckpt or safetensors file / マージするLoRAモデル、ckptまたはsafetensors",
+        "--models", type=str, nargs="*", help="LoRA models to merge: ckpt or safetensors file / マージするLoRAモデル、ckptまたはsafetensors"
    )
    parser.add_argument("--ratios", type=float, nargs="*", help="ratios for each model / それぞれのLoRAモデルの比率")
-    parser.add_argument("--lbws", type=str, nargs="*", help="lbw for each model / それぞれのLoRAモデルの層別適用率")
    parser.add_argument(
        "--no_metadata",
        action="store_true",
@@ -500,7 +337,8 @@ def setup_parser() -> argparse.ArgumentParser:
    parser.add_argument(
        "--shuffle",
        action="store_true",
-        help="shuffle lora weight./ " + "LoRAの重みをシャッフルする",
+        help="shuffle lora weight./ "
+        + "LoRAの重みをシャッフルする",
    )

    return parser
--- a/networks/svd_merge_lora.py
+++ b/networks/svd_merge_lora.py
@@ -1,8 +1,5 @@
 import argparse
-import itertools
-import json
 import os
-import re
 import time
 import torch
 from safetensors.torch import load_file, save_file
@@ -11,195 +8,12 @@ from library import sai_model_spec, train_util
 import library.model_util as model_util
 import lora
 from library.utils import setup_logging
-
 setup_logging()
 import logging
-
 logger = logging.getLogger(__name__)

 CLAMP_QUANTILE = 0.99

-ACCEPTABLE = [12, 17, 20, 26]
-SDXL_LAYER_NUM = [12, 20]
-
-LAYER12 = {
-    "BASE": True,
-    "IN00": False,
-    "IN01": False,
-    "IN02": False,
-    "IN03": False,
-    "IN04": True,
-    "IN05": True,
-    "IN06": False,
-    "IN07": True,
-    "IN08": True,
-    "IN09": False,
-    "IN10": False,
-    "IN11": False,
-    "MID": True,
-    "OUT00": True,
-    "OUT01": True,
-    "OUT02": True,
-    "OUT03": True,
-    "OUT04": True,
-    "OUT05": True,
-    "OUT06": False,
-    "OUT07": False,
-    "OUT08": False,
-    "OUT09": False,
-    "OUT10": False,
-    "OUT11": False,
-}
-
-LAYER17 = {
-    "BASE": True,
-    "IN00": False,
-    "IN01": True,
-    "IN02": True,
-    "IN03": False,
-    "IN04": True,
-    "IN05": True,
-    "IN06": False,
-    "IN07": True,
-    "IN08": True,
-    "IN09": False,
-    "IN10": False,
-    "IN11": False,
-    "MID": True,
-    "OUT00": False,
-    "OUT01": False,
-    "OUT02": False,
-    "OUT03": True,
-    "OUT04": True,
-    "OUT05": True,
-    "OUT06": True,
-    "OUT07": True,
-    "OUT08": True,
-    "OUT09": True,
-    "OUT10": True,
-    "OUT11": True,
-}
-
-LAYER20 = {
-    "BASE": True,
-    "IN00": True,
-    "IN01": True,
-    "IN02": True,
-    "IN03": True,
-    "IN04": True,
-    "IN05": True,
-    "IN06": True,
-    "IN07": True,
-    "IN08": True,
-    "IN09": False,
-    "IN10": False,
-    "IN11": False,
-    "MID": True,
-    "OUT00": True,
-    "OUT01": True,
-    "OUT02": True,
-    "OUT03": True,
-    "OUT04": True,
-    "OUT05": True,
-    "OUT06": True,
-    "OUT07": True,
-    "OUT08": True,
-    "OUT09": False,
-    "OUT10": False,
-    "OUT11": False,
-}
-
-LAYER26 = {
-    "BASE": True,
-    "IN00": True,
-    "IN01": True,
-    "IN02": True,
-    "IN03": True,
-    "IN04": True,
-    "IN05": True,
-    "IN06": True,
-    "IN07": True,
-    "IN08": True,
-    "IN09": True,
-    "IN10": True,
-    "IN11": True,
-    "MID": True,
-    "OUT00": True,
-    "OUT01": True,
-    "OUT02": True,
-    "OUT03": True,
-    "OUT04": True,
-    "OUT05": True,
-    "OUT06": True,
-    "OUT07": True,
-    "OUT08": True,
-    "OUT09": True,
-    "OUT10": True,
-    "OUT11": True,
-}
-
-assert len([v for v in LAYER12.values() if v]) == 12
-assert len([v for v in LAYER17.values() if v]) == 17
-assert len([v for v in LAYER20.values() if v]) == 20
-assert len([v for v in LAYER26.values() if v]) == 26
-
-RE_UPDOWN = re.compile(r"(up|down)_blocks_(\d+)_(resnets|upsamplers|downsamplers|attentions)_(\d+)_")
-
-
-def get_lbw_block_index(lora_name: str, is_sdxl: bool = False) -> int:
-    # lbw block index is 0-based, but 0 for text encoder, so we return 0 for text encoder
-    if "text_model_encoder_" in lora_name:  # LoRA for text encoder
-        return 0
-
-    # lbw block index is 1-based for U-Net, and no "input_blocks.0" in CompVis SD, so "input_blocks.1" have index 2
-    block_idx = -1  # invalid lora name
-    if not is_sdxl:
-        NUM_OF_BLOCKS = 12  # up/down blocks
-        m = RE_UPDOWN.search(lora_name)
-        if m:
-            g = m.groups()
-            up_down = g[0]
-            i = int(g[1])
-            j = int(g[3])
-            if up_down == "down":
-                if g[2] == "resnets" or g[2] == "attentions":
-                    idx = 3 * i + j + 1
-                elif g[2] == "downsamplers":
-                    idx = 3 * (i + 1)
-                else:
-                    return block_idx  # invalid lora name
-            elif up_down == "up":
-                if g[2] == "resnets" or g[2] == "attentions":
-                    idx = 3 * i + j
-                elif g[2] == "upsamplers":
-                    idx = 3 * i + 2
-                else:
-                    return block_idx  # invalid lora name
-
-            if g[0] == "down":
-                block_idx = 1 + idx  # 1-based index, down block index
-            elif g[0] == "up":
-                block_idx = 1 + NUM_OF_BLOCKS + 1 + idx  # 1-based index, num blocks, mid block, up block index
-
-        elif "mid_block_" in lora_name:
-            block_idx = 1 + NUM_OF_BLOCKS  # 1-based index, num blocks, mid block
-    else:
-        # SDXL: some numbers are skipped
-        if lora_name.startswith("lora_unet_"):
-            name = lora_name[len("lora_unet_") :]
-            if name.startswith("time_embed_") or name.startswith("label_emb_"):  # 1, No LoRA in sd-scripts
-                block_idx = 1
-            elif name.startswith("input_blocks_"):  # 1-8 to 2-9
-                block_idx = 1 + int(name.split("_")[2])
-            elif name.startswith("middle_block_"):  # 13
-                block_idx = 13
-            elif name.startswith("output_blocks_"):  # 0-8 to 14-22
-                block_idx = 14 + int(name.split("_")[2])
-            elif name.startswith("out_"):  # 23, No LoRA in sd-scripts
-                block_idx = 23
-
-    return block_idx
-

 def load_state_dict(file_name, dtype):
    if os.path.splitext(file_name)[1] == ".safetensors":
@@ -216,53 +30,24 @@ def load_state_dict(file_name, dtype):
    return sd, metadata


-def save_to_file(file_name, state_dict, metadata):
+def save_to_file(file_name, state_dict, dtype, metadata):
+    if dtype is not None:
+        for key in list(state_dict.keys()):
+            if type(state_dict[key]) == torch.Tensor:
+                state_dict[key] = state_dict[key].to(dtype)
+
    if os.path.splitext(file_name)[1] == ".safetensors":
        save_file(state_dict, file_name, metadata=metadata)
    else:
        torch.save(state_dict, file_name)


-def format_lbws(lbws):
-    try:
-        # lbwは"[1,1,1,1,1,1,1,1,1,1,1,1]"のような文字列で与えられることを期待している
-        lbws = [json.loads(lbw) for lbw in lbws]
-    except Exception:
-        raise ValueError(f"format of lbws are must be json / 層別適用率はJSON形式で書いてください")
-    assert all(isinstance(lbw, list) for lbw in lbws), f"lbws are must be list / 層別適用率はリストにしてください"
-    assert len(set(len(lbw) for lbw in lbws)) == 1, "all lbws should have the same length  / 層別適用率は同じ長さにしてください"
-    assert all(
-        len(lbw) in ACCEPTABLE for lbw in lbws
-    ), f"length of lbw are must be in {ACCEPTABLE} / 層別適用率の長さは{ACCEPTABLE}のいずれかにしてください"
-    assert all(
-        all(isinstance(weight, (int, float)) for weight in lbw) for lbw in lbws
-    ), f"values of lbs are must be numbers / 層別適用率の値はすべて数値にしてください"
-
-    layer_num = len(lbws[0])
-    is_sdxl = True if layer_num in SDXL_LAYER_NUM else False
-    FLAGS = {
-        "12": LAYER12.values(),
-        "17": LAYER17.values(),
-        "20": LAYER20.values(),
-        "26": LAYER26.values(),
-    }[str(layer_num)]
-    LBW_TARGET_IDX = [i for i, flag in enumerate(FLAGS) if flag]
-    return lbws, is_sdxl, LBW_TARGET_IDX
-
-
-def merge_lora_models(models, ratios, lbws, new_rank, new_conv_rank, device, merge_dtype):
+def merge_lora_models(models, ratios, new_rank, new_conv_rank, device, merge_dtype):
    logger.info(f"new rank: {new_rank}, new conv rank: {new_conv_rank}")
    merged_sd = {}
-    v2 = None  # This is meaning LoRA Metadata v2, Not meaning SD2
+    v2 = None
    base_model = None
-
-    if lbws:
-        lbws, is_sdxl, LBW_TARGET_IDX = format_lbws(lbws)
-    else:
-        is_sdxl = False
-        LBW_TARGET_IDX = []
-
-    for model, ratio, lbw in itertools.zip_longest(models, ratios, lbws):
+    for model, ratio in zip(models, ratios):
        logger.info(f"loading: {model}")
        lora_sd, lora_metadata = load_state_dict(model, merge_dtype)

@@ -272,12 +57,6 @@ def merge_lora_models(models, ratios, lbws, new_rank, new_conv_rank, device, mer
            if base_model is None:
                base_model = lora_metadata.get(train_util.SS_METADATA_KEY_BASE_MODEL_VERSION, None)

-        if lbw:
-            lbw_weights = [1] * 26
-            for index, value in zip(LBW_TARGET_IDX, lbw):
-                lbw_weights[index] = value
-            logger.info(f"lbw: {dict(zip(LAYER26.keys(), lbw_weights))}")
-
        # merge
        logger.info(f"merging...")
        for key in tqdm(list(lora_sd.keys())):
@@ -301,10 +80,10 @@ def merge_lora_models(models, ratios, lbws, new_rank, new_conv_rank, device, mer
            # make original weight if not exist
            if lora_module_name not in merged_sd:
                weight = torch.zeros((out_dim, in_dim, *kernel_size) if conv2d else (out_dim, in_dim), dtype=merge_dtype)
+                if device:
+                    weight = weight.to(device)
            else:
                weight = merged_sd[lora_module_name]
-            if device:
-                weight = weight.to(device)

            # merge to weight
            if device:
@@ -314,12 +93,6 @@ def merge_lora_models(models, ratios, lbws, new_rank, new_conv_rank, device, mer
            # W <- W + U * D
            scale = alpha / network_dim

-            if lbw:
-                index = get_lbw_block_index(key, is_sdxl)
-                is_lbw_target = index in LBW_TARGET_IDX
-                if is_lbw_target:
-                    scale *= lbw_weights[index]  # keyがlbwの対象であれば、lbwの重みを掛ける
-
            if device:  # and isinstance(scale, torch.Tensor):
                scale = scale.to(device)

@@ -336,16 +109,13 @@ def merge_lora_models(models, ratios, lbws, new_rank, new_conv_rank, device, mer
                conved = torch.nn.functional.conv2d(down_weight.permute(1, 0, 2, 3), up_weight).permute(1, 0, 2, 3)
                weight = weight + ratio * conved * scale

-            merged_sd[lora_module_name] = weight.to("cpu")
+            merged_sd[lora_module_name] = weight

    # extract from merged weights
    logger.info("extract new lora...")
    merged_lora_sd = {}
    with torch.no_grad():
        for lora_module_name, mat in tqdm(list(merged_sd.items())):
-            if device:
-                mat = mat.to(device)
-
            conv2d = len(mat.size()) == 4
            kernel_size = None if not conv2d else mat.size()[2:4]
            conv2d_3x3 = conv2d and kernel_size != (1, 1)
@@ -384,7 +154,7 @@ def merge_lora_models(models, ratios, lbws, new_rank, new_conv_rank, device, mer

            merged_lora_sd[lora_module_name + ".lora_up.weight"] = up_weight.to("cpu").contiguous()
            merged_lora_sd[lora_module_name + ".lora_down.weight"] = down_weight.to("cpu").contiguous()
-            merged_lora_sd[lora_module_name + ".alpha"] = torch.tensor(module_new_rank, device="cpu")
+            merged_lora_sd[lora_module_name + ".alpha"] = torch.tensor(module_new_rank)

    # build minimum metadata
    dims = f"{new_rank}"
@@ -399,15 +169,7 @@ def merge_lora_models(models, ratios, lbws, new_rank, new_conv_rank, device, mer


 def merge(args):
-    assert len(args.models) == len(
-        args.ratios
-    ), f"number of models must be equal to number of ratios / モデルの数と重みの数は合わせてください"
-    if args.lbws:
-        assert len(args.models) == len(
-            args.lbws
-        ), f"number of models must be equal to number of ratios / モデルの数と層別適用率の数は合わせてください"
-    else:
-        args.lbws = []  # zip_longestで扱えるようにlbws未使用時には空のリストにしておく
+    assert len(args.models) == len(args.ratios), f"number of models must be equal to number of ratios / モデルの数と重みの数は合わせてください"

    def str_to_dtype(p):
        if p == "float":
@@ -425,15 +187,9 @@ def merge(args):

    new_conv_rank = args.new_conv_rank if args.new_conv_rank is not None else args.new_rank
    state_dict, metadata, v2, base_model = merge_lora_models(
-        args.models, args.ratios, args.lbws, args.new_rank, new_conv_rank, args.device, merge_dtype
+        args.models, args.ratios, args.new_rank, new_conv_rank, args.device, merge_dtype
    )

-    # cast to save_dtype before calculating hashes
-    for key in list(state_dict.keys()):
-        value = state_dict[key]
-        if type(value) == torch.Tensor and value.dtype.is_floating_point and value.dtype != save_dtype:
-            state_dict[key] = value.to(save_dtype)
-
    logger.info(f"calculating hashes and creating metadata...")

    model_hash, legacy_hash = train_util.precalculate_safetensors_hashes(state_dict, metadata)
@@ -455,7 +211,7 @@ def merge(args):
        metadata.update(sai_metadata)

    logger.info(f"saving model to: {args.save_to}")
-    save_to_file(args.save_to, state_dict, metadata)
+    save_to_file(args.save_to, state_dict, save_dtype, metadata)


 def setup_parser() -> argparse.ArgumentParser:
@@ -475,19 +231,12 @@ def setup_parser() -> argparse.ArgumentParser:
        help="precision in merging (float is recommended) / マージの計算時の精度（floatを推奨）",
    )
    parser.add_argument(
-        "--save_to",
-        type=str,
-        default=None,
-        help="destination file name: ckpt or safetensors file / 保存先のファイル名、ckptまたはsafetensors",
+        "--save_to", type=str, default=None, help="destination file name: ckpt or safetensors file / 保存先のファイル名、ckptまたはsafetensors"
    )
    parser.add_argument(
-        "--models",
-        type=str,
-        nargs="*",
-        help="LoRA models to merge: ckpt or safetensors file / マージするLoRAモデル、ckptまたはsafetensors",
+        "--models", type=str, nargs="*", help="LoRA models to merge: ckpt or safetensors file / マージするLoRAモデル、ckptまたはsafetensors"
    )
    parser.add_argument("--ratios", type=float, nargs="*", help="ratios for each model / それぞれのLoRAモデルの比率")
-    parser.add_argument("--lbws", type=str, nargs="*", help="lbw for each model / それぞれのLoRAモデルの層別適用率")
    parser.add_argument("--new_rank", type=int, default=4, help="Specify rank of output LoRA / 出力するLoRAのrank (dim)")
    parser.add_argument(
        "--new_conv_rank",
@@ -495,9 +244,7 @@ def setup_parser() -> argparse.ArgumentParser:
        default=None,
        help="Specify rank of output LoRA for Conv2d 3x3, None for same as new_rank / 出力するConv2D 3x3 LoRAのrank (dim)、Noneでnew_rankと同じ",
    )
-    parser.add_argument(
-        "--device", type=str, default=None, help="device to use, cuda for GPU / 計算を行うデバイス、cuda でGPUを使う"
-    )
+    parser.add_argument("--device", type=str, default=None, help="device to use, cuda for GPU / 計算を行うデバイス、cuda でGPUを使う")
    parser.add_argument(
        "--no_metadata",
        action="store_true",
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,24 +1,20 @@
-accelerate==0.30.0
-transformers==4.44.0
+accelerate==0.25.0
+transformers==4.36.2
 diffusers[torch]==0.25.0
 ftfy==6.1.1
 # albumentations==1.3.0
-opencv-python==4.8.1.78
+opencv-python==4.7.0.68
 einops==0.7.0
 pytorch-lightning==1.9.0
-bitsandbytes==0.44.0
-prodigyopt==1.0
-lion-pytorch==0.0.6
-tensorboard
+# bitsandbytes==0.39.1
+tensorboard==2.10.1
 safetensors==0.4.2
 # gradio==3.16.2
 altair==4.2.2
 easygui==0.98.3
 toml==0.10.2
 voluptuous==0.13.1
-huggingface-hub==0.24.5
-# for Image utils
-imagesize==1.4.1
+huggingface-hub==0.20.1
 # for BLIP captioning
 # requests==2.28.2
 # timm==0.6.12
@@ -26,16 +22,13 @@ imagesize==1.4.1
 # for WD14 captioning (tensorflow)
 # tensorflow==2.10.1
 # for WD14 captioning (onnx)
-# onnx==1.15.0
-# onnxruntime-gpu==1.17.1
-# onnxruntime==1.17.1
-# for cuda 12.1(default 11.8)
-# onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/
-
+# onnx==1.14.1
+# onnxruntime-gpu==1.16.0
+# onnxruntime==1.16.0
 # this is for onnx: 
 # protobuf==3.20.3
 # open clip for SDXL
-# open-clip-torch==2.20.0
+open-clip-torch==2.20.0
 # For logging
 rich==13.7.0
 # for kohya_ss library
--- a/sdxl_minimal_inference.py
+++ b/sdxl_minimal_inference.py
@@ -11,24 +11,20 @@ import numpy as np

 import torch
 from library.device_utils import init_ipex, get_preferred_device
-
 init_ipex()

 from tqdm import tqdm
 from transformers import CLIPTokenizer
 from diffusers import EulerDiscreteScheduler
 from PIL import Image
-
-# import open_clip
+import open_clip
 from safetensors.torch import load_file

 from library import model_util, sdxl_model_util
 import networks.lora as lora
 from library.utils import setup_logging
-
 setup_logging()
 import logging
-
 logger = logging.getLogger(__name__)

 # scheduler: このあたりの設定はSD1/2と同じでいいらしい
@@ -158,13 +154,12 @@ if __name__ == "__main__":
    text_model2.eval()

    unet.set_use_memory_efficient_attention(True, False)
-    if torch.__version__ >= "2.0.0":  # PyTorch 2.0.0 以上対応のxformersなら以下が使える
+    if torch.__version__ >= "2.0.0": # PyTorch 2.0.0 以上対応のxformersなら以下が使える
        vae.set_use_memory_efficient_attention_xformers(True)

    # Tokenizers
    tokenizer1 = CLIPTokenizer.from_pretrained(text_encoder_1_name)
-    # tokenizer2 = lambda x: open_clip.tokenize(x, context_length=77)
-    tokenizer2 = CLIPTokenizer.from_pretrained(text_encoder_2_name)
+    tokenizer2 = lambda x: open_clip.tokenize(x, context_length=77)

    # LoRA
    for weights_file in args.lora_weights:
@@ -197,9 +192,7 @@ if __name__ == "__main__":
            emb3 = get_timestep_embedding(torch.FloatTensor([target_height, target_width]).unsqueeze(0), 256)
            # logger.info("emb1", emb1.shape)
            c_vector = torch.cat([emb1, emb2, emb3], dim=1).to(DEVICE, dtype=DTYPE)
-            uc_vector = c_vector.clone().to(
-                DEVICE, dtype=DTYPE
-            )  # ちょっとここ正しいかどうかわからない I'm not sure if this is right
+            uc_vector = c_vector.clone().to(DEVICE, dtype=DTYPE)  # ちょっとここ正しいかどうかわからない I'm not sure if this is right

            # crossattn

@@ -222,22 +215,13 @@ if __name__ == "__main__":
                # text_embedding = pipe.text_encoder.text_model.final_layer_norm(text_embedding)    # layer normは通さないらしい

            # text encoder 2
-            # tokens = tokenizer2(text2).to(DEVICE)
-            tokens = tokenizer2(
-                text,
-                truncation=True,
-                return_length=True,
-                return_overflowing_tokens=False,
-                padding="max_length",
-                return_tensors="pt",
-            )
-            tokens = batch_encoding["input_ids"].to(DEVICE)
-
            with torch.no_grad():
+                tokens = tokenizer2(text2).to(DEVICE)
+
                enc_out = text_model2(tokens, output_hidden_states=True, return_dict=True)
                text_embedding2_penu = enc_out["hidden_states"][-2]
                # logger.info("hidden_states2", text_embedding2_penu.shape)
-                text_embedding2_pool = enc_out["text_embeds"]  # do not support Textual Inversion
+                text_embedding2_pool = enc_out["text_embeds"]   # do not support Textual Inversion

            # 連結して終了 concat and finish
            text_embedding = torch.cat([text_embedding1, text_embedding2_penu], dim=2)
--- a/sdxl_train.py
+++ b/sdxl_train.py
@@ -11,13 +11,11 @@ from tqdm import tqdm

 import torch
 from library.device_utils import init_ipex, clean_memory_on_device
-
-
 init_ipex()

 from accelerate.utils import set_seed
 from diffusers import DDPMScheduler
-from library import deepspeed_utils, sdxl_model_util
+from library import sdxl_model_util

 import library.train_util as train_util

@@ -41,7 +39,6 @@ from library.custom_train_functions import (
    scale_v_prediction_loss_like_noise_prediction,
    add_v_prediction_like_loss,
    apply_debiased_estimation,
-    apply_masked_loss,
 )
 from library.sdxl_original_unet import SdxlUNet2DConditionModel

@@ -100,7 +97,6 @@ def train(args):
    train_util.verify_training_args(args)
    train_util.prepare_dataset_args(args, True)
    sdxl_train_util.verify_sdxl_training_args(args)
-    deepspeed_utils.prepare_deepspeed_args(args)
    setup_logging(args, reset=True)

    assert (
@@ -128,7 +124,7 @@ def train(args):

    # データセットを準備する
    if args.dataset_class is None:
-        blueprint_generator = BlueprintGenerator(ConfigSanitizer(True, True, args.masked_loss, True))
+        blueprint_generator = BlueprintGenerator(ConfigSanitizer(True, True, False, True))
        if args.dataset_config is not None:
            logger.info(f"Load dataset config from {args.dataset_config}")
            user_config = config_util.load_user_config(args.dataset_config)
@@ -272,7 +268,7 @@ def train(args):
    # 学習を準備する：モデルを適切な状態にする
    if args.gradient_checkpointing:
        unet.enable_gradient_checkpointing()
-    train_unet = args.learning_rate != 0
+    train_unet = args.learning_rate > 0
    train_text_encoder1 = False
    train_text_encoder2 = False

@@ -284,8 +280,8 @@ def train(args):
            text_encoder2.gradient_checkpointing_enable()
        lr_te1 = args.learning_rate_te1 if args.learning_rate_te1 is not None else args.learning_rate  # 0 means not train
        lr_te2 = args.learning_rate_te2 if args.learning_rate_te2 is not None else args.learning_rate  # 0 means not train
-        train_text_encoder1 = lr_te1 != 0
-        train_text_encoder2 = lr_te2 != 0
+        train_text_encoder1 = lr_te1 > 0
+        train_text_encoder2 = lr_te2 > 0

        # caching one text encoder output is not supported
        if not train_text_encoder1:
@@ -345,8 +341,8 @@ def train(args):

    # calculate number of trainable parameters
    n_params = 0
-    for group in params_to_optimize:
-        for p in group["params"]:
+    for params in params_to_optimize:
+        for p in params["params"]:
            n_params += p.numel()

    accelerator.print(f"train unet: {train_unet}, text_encoder1: {train_text_encoder1}, text_encoder2: {train_text_encoder2}")
@@ -355,53 +351,7 @@ def train(args):

    # 学習に必要なクラスを準備する
    accelerator.print("prepare optimizer, data loader etc.")
-
-    if args.fused_optimizer_groups:
-        # fused backward pass: https://pytorch.org/tutorials/intermediate/optimizer_step_in_backward_tutorial.html
-        # Instead of creating an optimizer for all parameters as in the tutorial, we create an optimizer for each group of parameters.
-        # This balances memory usage and management complexity.
-
-        # calculate total number of parameters
-        n_total_params = sum(len(params["params"]) for params in params_to_optimize)
-        params_per_group = math.ceil(n_total_params / args.fused_optimizer_groups)
-
-        # split params into groups, keeping the learning rate the same for all params in a group
-        # this will increase the number of groups if the learning rate is different for different params (e.g. U-Net and text encoders)
-        grouped_params = []
-        param_group = []
-        param_group_lr = -1
-        for group in params_to_optimize:
-            lr = group["lr"]
-            for p in group["params"]:
-                # if the learning rate is different for different params, start a new group
-                if lr != param_group_lr:
-                    if param_group:
-                        grouped_params.append({"params": param_group, "lr": param_group_lr})
-                        param_group = []
-                    param_group_lr = lr
-
-                param_group.append(p)
-
-                # if the group has enough parameters, start a new group
-                if len(param_group) == params_per_group:
-                    grouped_params.append({"params": param_group, "lr": param_group_lr})
-                    param_group = []
-                    param_group_lr = -1
-
-        if param_group:
-            grouped_params.append({"params": param_group, "lr": param_group_lr})
-
-        # prepare optimizers for each group
-        optimizers = []
-        for group in grouped_params:
-            _, _, optimizer = train_util.get_optimizer(args, trainable_params=[group])
-            optimizers.append(optimizer)
-        optimizer = optimizers[0]  # avoid error in the following code
-
-        logger.info(f"using {len(optimizers)} optimizers for fused optimizer groups")
-
-    else:
-        _, _, optimizer = train_util.get_optimizer(args, trainable_params=params_to_optimize)
+    _, _, optimizer = train_util.get_optimizer(args, trainable_params=params_to_optimize)

    # dataloaderを準備する
    # DataLoaderのプロセス数：0 は persistent_workers が使えないので注意
@@ -428,12 +378,7 @@ def train(args):
    train_dataset_group.set_max_train_steps(args.max_train_steps)

    # lr schedulerを用意する
-    if args.fused_optimizer_groups:
-        # prepare lr schedulers for each optimizer
-        lr_schedulers = [train_util.get_scheduler_fix(args, optimizer, accelerator.num_processes) for optimizer in optimizers]
-        lr_scheduler = lr_schedulers[0]  # avoid error in the following code
-    else:
-        lr_scheduler = train_util.get_scheduler_fix(args, optimizer, accelerator.num_processes)
+    lr_scheduler = train_util.get_scheduler_fix(args, optimizer, accelerator.num_processes)

    # 実験的機能：勾配も含めたfp16/bf16学習を行う　モデル全体をfp16/bf16にする
    if args.full_fp16:
@@ -453,33 +398,18 @@ def train(args):
        text_encoder1.to(weight_dtype)
        text_encoder2.to(weight_dtype)

-    # freeze last layer and final_layer_norm in te1 since we use the output of the penultimate layer
+    # acceleratorがなんかよろしくやってくれるらしい
+    if train_unet:
+        unet = accelerator.prepare(unet)
    if train_text_encoder1:
+        # freeze last layer and final_layer_norm in te1 since we use the output of the penultimate layer
        text_encoder1.text_model.encoder.layers[-1].requires_grad_(False)
        text_encoder1.text_model.final_layer_norm.requires_grad_(False)
+        text_encoder1 = accelerator.prepare(text_encoder1)
+    if train_text_encoder2:
+        text_encoder2 = accelerator.prepare(text_encoder2)

-    if args.deepspeed:
-        ds_model = deepspeed_utils.prepare_deepspeed_model(
-            args,
-            unet=unet if train_unet else None,
-            text_encoder1=text_encoder1 if train_text_encoder1 else None,
-            text_encoder2=text_encoder2 if train_text_encoder2 else None,
-        )
-        # most of ZeRO stage uses optimizer partitioning, so we have to prepare optimizer and ds_model at the same time. # pull/1139#issuecomment-1986790007
-        ds_model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
-            ds_model, optimizer, train_dataloader, lr_scheduler
-        )
-        training_models = [ds_model]
-
-    else:
-        # acceleratorがなんかよろしくやってくれるらしい
-        if train_unet:
-            unet = accelerator.prepare(unet)
-        if train_text_encoder1:
-            text_encoder1 = accelerator.prepare(text_encoder1)
-        if train_text_encoder2:
-            text_encoder2 = accelerator.prepare(text_encoder2)
-        optimizer, train_dataloader, lr_scheduler = accelerator.prepare(optimizer, train_dataloader, lr_scheduler)
+    optimizer, train_dataloader, lr_scheduler = accelerator.prepare(optimizer, train_dataloader, lr_scheduler)

    # TextEncoderの出力をキャッシュするときにはCPUへ移動する
    if args.cache_text_encoder_outputs:
@@ -494,64 +424,11 @@ def train(args):

    # 実験的機能：勾配も含めたfp16学習を行う　PyTorchにパッチを当ててfp16でのgrad scaleを有効にする
    if args.full_fp16:
-        # During deepseed training, accelerate not handles fp16/bf16|mixed precision directly via scaler. Let deepspeed engine do.
-        # -> But we think it's ok to patch accelerator even if deepspeed is enabled.
        train_util.patch_accelerator_for_fp16_training(accelerator)

    # resumeする
    train_util.resume_from_local_or_hf_if_specified(accelerator, args)

-    if args.fused_backward_pass:
-        # use fused optimizer for backward pass: other optimizers will be supported in the future
-        import library.adafactor_fused
-
-        library.adafactor_fused.patch_adafactor_fused(optimizer)
-        for param_group in optimizer.param_groups:
-            for parameter in param_group["params"]:
-                if parameter.requires_grad:
-
-                    def __grad_hook(tensor: torch.Tensor, param_group=param_group):
-                        if accelerator.sync_gradients and args.max_grad_norm != 0.0:
-                            accelerator.clip_grad_norm_(tensor, args.max_grad_norm)
-                        optimizer.step_param(tensor, param_group)
-                        tensor.grad = None
-
-                    parameter.register_post_accumulate_grad_hook(__grad_hook)
-
-    elif args.fused_optimizer_groups:
-        # prepare for additional optimizers and lr schedulers
-        for i in range(1, len(optimizers)):
-            optimizers[i] = accelerator.prepare(optimizers[i])
-            lr_schedulers[i] = accelerator.prepare(lr_schedulers[i])
-
-        # counters are used to determine when to step the optimizer
-        global optimizer_hooked_count
-        global num_parameters_per_group
-        global parameter_optimizer_map
-
-        optimizer_hooked_count = {}
-        num_parameters_per_group = [0] * len(optimizers)
-        parameter_optimizer_map = {}
-
-        for opt_idx, optimizer in enumerate(optimizers):
-            for param_group in optimizer.param_groups:
-                for parameter in param_group["params"]:
-                    if parameter.requires_grad:
-
-                        def optimizer_hook(parameter: torch.Tensor):
-                            if accelerator.sync_gradients and args.max_grad_norm != 0.0:
-                                accelerator.clip_grad_norm_(parameter, args.max_grad_norm)
-
-                            i = parameter_optimizer_map[parameter]
-                            optimizer_hooked_count[i] += 1
-                            if optimizer_hooked_count[i] == num_parameters_per_group[i]:
-                                optimizers[i].step()
-                                optimizers[i].zero_grad(set_to_none=True)
-
-                        parameter.register_post_accumulate_grad_hook(optimizer_hook)
-                        parameter_optimizer_map[parameter] = opt_idx
-                        num_parameters_per_group[opt_idx] += 1
-
    # epoch数を計算する
    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
    num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)
@@ -589,11 +466,7 @@ def train(args):
            init_kwargs["wandb"] = {"name": args.wandb_run_name}
        if args.log_tracker_config is not None:
            init_kwargs = toml.load(args.log_tracker_config)
-        accelerator.init_trackers(
-            "finetuning" if args.log_tracker_name is None else args.log_tracker_name,
-            config=train_util.get_sanitized_config_or_none(args),
-            init_kwargs=init_kwargs,
-        )
+        accelerator.init_trackers("finetuning" if args.log_tracker_name is None else args.log_tracker_name, init_kwargs=init_kwargs)

    # For --sample_at_first
    sdxl_train_util.sample_images(
@@ -610,10 +483,6 @@ def train(args):

        for step, batch in enumerate(train_dataloader):
            current_step.value = global_step
-
-            if args.fused_optimizer_groups:
-                optimizer_hooked_count = {i: 0 for i in range(len(optimizers))}  # reset counter for each step
-
            with accelerator.accumulate(*training_models):
                if "latents" in batch and batch["latents"] is not None:
                    latents = batch["latents"].to(accelerator.device).to(dtype=weight_dtype)
@@ -692,9 +561,7 @@ def train(args):

                # Sample noise, sample a random timestep for each image, and add noise to the latents,
                # with noise offset and/or multires noise if specified
-                noise, noisy_latents, timesteps, huber_c = train_util.get_noise_noisy_latents_and_timesteps(
-                    args, noise_scheduler, latents
-                )
+                noise, noisy_latents, timesteps = train_util.get_noise_noisy_latents_and_timesteps(args, noise_scheduler, latents)

                noisy_latents = noisy_latents.to(weight_dtype)  # TODO check why noisy_latents is not weight_dtype

@@ -702,60 +569,41 @@ def train(args):
                with accelerator.autocast():
                    noise_pred = unet(noisy_latents, timesteps, text_embedding, vector_embedding)

-                if args.v_parameterization:
-                    # v-parameterization training
-                    target = noise_scheduler.get_velocity(latents, noise, timesteps)
-                else:
-                    target = noise
+                target = noise

                if (
                    args.min_snr_gamma
                    or args.scale_v_pred_loss_like_noise_pred
                    or args.v_pred_like_loss
                    or args.debiased_estimation_loss
-                    or args.masked_loss
                ):
                    # do not mean over batch dimension for snr weight or scale v-pred loss
-                    loss = train_util.conditional_loss(
-                        noise_pred.float(), target.float(), reduction="none", loss_type=args.loss_type, huber_c=huber_c
-                    )
-                    if args.masked_loss or ("alpha_masks" in batch and batch["alpha_masks"] is not None):
-                        loss = apply_masked_loss(loss, batch)
+                    loss = torch.nn.functional.mse_loss(noise_pred.float(), target.float(), reduction="none")
                    loss = loss.mean([1, 2, 3])

                    if args.min_snr_gamma:
-                        loss = apply_snr_weight(loss, timesteps, noise_scheduler, args.min_snr_gamma, args.v_parameterization)
+                        loss = apply_snr_weight(loss, timesteps, noise_scheduler, args.min_snr_gamma)
                    if args.scale_v_pred_loss_like_noise_pred:
                        loss = scale_v_prediction_loss_like_noise_prediction(loss, timesteps, noise_scheduler)
                    if args.v_pred_like_loss:
                        loss = add_v_prediction_like_loss(loss, timesteps, noise_scheduler, args.v_pred_like_loss)
                    if args.debiased_estimation_loss:
-                        loss = apply_debiased_estimation(loss, timesteps, noise_scheduler, args.v_parameterization)
+                        loss = apply_debiased_estimation(loss, timesteps, noise_scheduler)

                    loss = loss.mean()  # mean over batch dimension
                else:
-                    loss = train_util.conditional_loss(
-                        noise_pred.float(), target.float(), reduction="mean", loss_type=args.loss_type, huber_c=huber_c
-                    )
+                    loss = torch.nn.functional.mse_loss(noise_pred.float(), target.float(), reduction="mean")

                accelerator.backward(loss)
+                if accelerator.sync_gradients and args.max_grad_norm != 0.0:
+                    params_to_clip = []
+                    for m in training_models:
+                        params_to_clip.extend(m.parameters())
+                    accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)

-                if not (args.fused_backward_pass or args.fused_optimizer_groups):
-                    if accelerator.sync_gradients and args.max_grad_norm != 0.0:
-                        params_to_clip = []
-                        for m in training_models:
-                            params_to_clip.extend(m.parameters())
-                        accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
-
-                    optimizer.step()
-                    lr_scheduler.step()
-                    optimizer.zero_grad(set_to_none=True)
-                else:
-                    # optimizer.step() and optimizer.zero_grad() are called in the optimizer hook
-                    lr_scheduler.step()
-                    if args.fused_optimizer_groups:
-                        for i in range(1, len(optimizers)):
-                            lr_schedulers[i].step()
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.zero_grad(set_to_none=True)

            # Checks if the accelerator has performed an optimization step behind the scenes
            if accelerator.sync_gradients:
@@ -864,7 +712,7 @@ def train(args):

    accelerator.end_training()

-    if args.save_state or args.save_state_on_train_end:
+    if args.save_state:  # and is_main_process:
        train_util.save_state_on_train_end(args, accelerator)

    del accelerator  # この後メモリを使うのでこれは消す
@@ -896,8 +744,6 @@ def setup_parser() -> argparse.ArgumentParser:
    train_util.add_sd_models_arguments(parser)
    train_util.add_dataset_arguments(parser, True, True, True)
    train_util.add_training_arguments(parser, False)
-    train_util.add_masked_loss_arguments(parser)
-    deepspeed_utils.add_deepspeed_arguments(parser)
    train_util.add_sd_saving_arguments(parser)
    train_util.add_optimizer_arguments(parser)
    config_util.add_config_arguments(parser)
@@ -933,12 +779,7 @@ def setup_parser() -> argparse.ArgumentParser:
        help=f"learning rates for each block of U-Net, comma-separated, {UNET_NUM_BLOCKS_FOR_BLOCK_LR} values / "
        + f"U-Netの各ブロックの学習率、カンマ区切り、{UNET_NUM_BLOCKS_FOR_BLOCK_LR}個の値",
    )
-    parser.add_argument(
-        "--fused_optimizer_groups",
-        type=int,
-        default=None,
-        help="number of optimizers for fused backward pass and optimizer step / fused backward passとoptimizer stepのためのoptimizer数",
-    )
+
    return parser


@@ -946,7 +787,6 @@ if __name__ == "__main__":
    parser = setup_parser()

    args = parser.parse_args()
-    train_util.verify_command_line_training_args(args)
    args = train_util.read_config_from_file(args, parser)

    train(args)
--- a/sdxl_train_control_net_lllite.py
+++ b/sdxl_train_control_net_lllite.py
@@ -15,7 +15,6 @@ from tqdm import tqdm

 import torch
 from library.device_utils import init_ipex, clean_memory_on_device
-
 init_ipex()

 from torch.nn.parallel import DistributedDataParallel as DDP
@@ -23,7 +22,7 @@ from accelerate.utils import set_seed
 import accelerate
 from diffusers import DDPMScheduler, ControlNetModel
 from safetensors.torch import load_file
-from library import deepspeed_utils, sai_model_spec, sdxl_model_util, sdxl_original_unet, sdxl_train_util
+from library import sai_model_spec, sdxl_model_util, sdxl_original_unet, sdxl_train_util

 import library.model_util as model_util
 import library.train_util as train_util
@@ -289,9 +288,6 @@ def train(args):
    # acceleratorがなんかよろしくやってくれるらしい
    unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(unet, optimizer, train_dataloader, lr_scheduler)

-    if isinstance(unet, DDP):
-        unet._set_static_graph() # avoid error for multiple use of the parameter
-
    if args.gradient_checkpointing:
        unet.train()  # according to TI example in Diffusers, train is required -> これオリジナルのU-Netしたので本当は外せる
    else:
@@ -357,7 +353,7 @@ def train(args):
        if args.log_tracker_config is not None:
            init_kwargs = toml.load(args.log_tracker_config)
        accelerator.init_trackers(
-            "lllite_control_net_train" if args.log_tracker_name is None else args.log_tracker_name, config=train_util.get_sanitized_config_or_none(args), init_kwargs=init_kwargs
+            "lllite_control_net_train" if args.log_tracker_name is None else args.log_tracker_name, init_kwargs=init_kwargs
        )

    loss_recorder = train_util.LossRecorder()
@@ -398,10 +394,10 @@ def train(args):
            with accelerator.accumulate(unet):
                with torch.no_grad():
                    if "latents" in batch and batch["latents"] is not None:
-                        latents = batch["latents"].to(accelerator.device).to(dtype=weight_dtype)
+                        latents = batch["latents"].to(accelerator.device)
                    else:
                        # latentに変換
-                        latents = vae.encode(batch["images"].to(dtype=vae_dtype)).latent_dist.sample().to(dtype=weight_dtype)
+                        latents = vae.encode(batch["images"].to(dtype=vae_dtype)).latent_dist.sample()

                        # NaNが含まれていれば警告を表示し0に置き換える
                        if torch.any(torch.isnan(latents)):
@@ -443,9 +439,7 @@ def train(args):

                # Sample noise, sample a random timestep for each image, and add noise to the latents,
                # with noise offset and/or multires noise if specified
-                noise, noisy_latents, timesteps, huber_c = train_util.get_noise_noisy_latents_and_timesteps(
-                    args, noise_scheduler, latents
-                )
+                noise, noisy_latents, timesteps = train_util.get_noise_noisy_latents_and_timesteps(args, noise_scheduler, latents)

                noisy_latents = noisy_latents.to(weight_dtype)  # TODO check why noisy_latents is not weight_dtype

@@ -464,9 +458,7 @@ def train(args):
                else:
                    target = noise

-                loss = train_util.conditional_loss(
-                    noise_pred.float(), target.float(), reduction="none", loss_type=args.loss_type, huber_c=huber_c
-                )
+                loss = torch.nn.functional.mse_loss(noise_pred.float(), target.float(), reduction="none")
                loss = loss.mean([1, 2, 3])

                loss_weights = batch["loss_weights"]  # 各sampleごとのweight
@@ -479,13 +471,13 @@ def train(args):
                if args.v_pred_like_loss:
                    loss = add_v_prediction_like_loss(loss, timesteps, noise_scheduler, args.v_pred_like_loss)
                if args.debiased_estimation_loss:
-                    loss = apply_debiased_estimation(loss, timesteps, noise_scheduler, args.v_parameterization)
+                    loss = apply_debiased_estimation(loss, timesteps, noise_scheduler)

                loss = loss.mean()  # 平均なのでbatch_sizeで割る必要なし

                accelerator.backward(loss)
                if accelerator.sync_gradients and args.max_grad_norm != 0.0:
-                    params_to_clip = accelerator.unwrap_model(unet).get_trainable_params()
+                    params_to_clip = unet.get_trainable_params()
                    accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)

                optimizer.step()
@@ -557,7 +549,7 @@ def train(args):

    accelerator.end_training()

-    if is_main_process and (args.save_state or args.save_state_on_train_end):
+    if is_main_process and args.save_state:
        train_util.save_state_on_train_end(args, accelerator)

    if is_main_process:
@@ -574,7 +566,6 @@ def setup_parser() -> argparse.ArgumentParser:
    train_util.add_sd_models_arguments(parser)
    train_util.add_dataset_arguments(parser, False, True, True)
    train_util.add_training_arguments(parser, False)
-    deepspeed_utils.add_deepspeed_arguments(parser)
    train_util.add_optimizer_arguments(parser)
    config_util.add_config_arguments(parser)
    custom_train_functions.add_custom_train_arguments(parser)
@@ -620,7 +611,6 @@ if __name__ == "__main__":
    parser = setup_parser()

    args = parser.parse_args()
-    train_util.verify_command_line_training_args(args)
    args = train_util.read_config_from_file(args, parser)

    train(args)
--- a/sdxl_train_control_net_lllite_old.py
+++ b/sdxl_train_control_net_lllite_old.py
@@ -18,7 +18,7 @@ from torch.nn.parallel import DistributedDataParallel as DDP
 from accelerate.utils import set_seed
 from diffusers import DDPMScheduler, ControlNetModel
 from safetensors.torch import load_file
-from library import deepspeed_utils, sai_model_spec, sdxl_model_util, sdxl_original_unet, sdxl_train_util
+from library import sai_model_spec, sdxl_model_util, sdxl_original_unet, sdxl_train_util

 import library.model_util as model_util
 import library.train_util as train_util
@@ -324,7 +324,7 @@ def train(args):
        if args.log_tracker_config is not None:
            init_kwargs = toml.load(args.log_tracker_config)
        accelerator.init_trackers(
-            "lllite_control_net_train" if args.log_tracker_name is None else args.log_tracker_name, config=train_util.get_sanitized_config_or_none(args), init_kwargs=init_kwargs
+            "lllite_control_net_train" if args.log_tracker_name is None else args.log_tracker_name, init_kwargs=init_kwargs
        )

    loss_recorder = train_util.LossRecorder()
@@ -361,10 +361,10 @@ def train(args):
            with accelerator.accumulate(network):
                with torch.no_grad():
                    if "latents" in batch and batch["latents"] is not None:
-                        latents = batch["latents"].to(accelerator.device).to(dtype=weight_dtype)
+                        latents = batch["latents"].to(accelerator.device)
                    else:
                        # latentに変換
-                        latents = vae.encode(batch["images"].to(dtype=vae_dtype)).latent_dist.sample().to(dtype=weight_dtype)
+                        latents = vae.encode(batch["images"].to(dtype=vae_dtype)).latent_dist.sample()

                        # NaNが含まれていれば警告を表示し0に置き換える
                        if torch.any(torch.isnan(latents)):
@@ -406,7 +406,7 @@ def train(args):

                # Sample noise, sample a random timestep for each image, and add noise to the latents,
                # with noise offset and/or multires noise if specified
-                noise, noisy_latents, timesteps, huber_c = train_util.get_noise_noisy_latents_and_timesteps(args, noise_scheduler, latents)
+                noise, noisy_latents, timesteps = train_util.get_noise_noisy_latents_and_timesteps(args, noise_scheduler, latents)

                noisy_latents = noisy_latents.to(weight_dtype)  # TODO check why noisy_latents is not weight_dtype

@@ -426,7 +426,7 @@ def train(args):
                else:
                    target = noise

-                loss = train_util.conditional_loss(noise_pred.float(), target.float(), reduction="none", loss_type=args.loss_type, huber_c=huber_c)
+                loss = torch.nn.functional.mse_loss(noise_pred.float(), target.float(), reduction="none")
                loss = loss.mean([1, 2, 3])

                loss_weights = batch["loss_weights"]  # 各sampleごとのweight
@@ -439,7 +439,7 @@ def train(args):
                if args.v_pred_like_loss:
                    loss = add_v_prediction_like_loss(loss, timesteps, noise_scheduler, args.v_pred_like_loss)
                if args.debiased_estimation_loss:
-                    loss = apply_debiased_estimation(loss, timesteps, noise_scheduler, args.v_parameterization)
+                    loss = apply_debiased_estimation(loss, timesteps, noise_scheduler)

                loss = loss.mean()  # 平均なのでbatch_sizeで割る必要なし

@@ -534,7 +534,6 @@ def setup_parser() -> argparse.ArgumentParser:
    train_util.add_sd_models_arguments(parser)
    train_util.add_dataset_arguments(parser, False, True, True)
    train_util.add_training_arguments(parser, False)
-    deepspeed_utils.add_deepspeed_arguments(parser)
    train_util.add_optimizer_arguments(parser)
    config_util.add_config_arguments(parser)
    custom_train_functions.add_custom_train_arguments(parser)
@@ -580,7 +579,6 @@ if __name__ == "__main__":
    parser = setup_parser()

    args = parser.parse_args()
-    train_util.verify_command_line_training_args(args)
    args = train_util.read_config_from_file(args, parser)

    train(args)
--- a/sdxl_train_network.py
+++ b/sdxl_train_network.py
@@ -178,7 +178,6 @@ if __name__ == "__main__":
    parser = setup_parser()

    args = parser.parse_args()
-    train_util.verify_command_line_training_args(args)
    args = train_util.read_config_from_file(args, parser)

    trainer = SdxlNetworkTrainer()
--- a/sdxl_train_textual_inversion.py
+++ b/sdxl_train_textual_inversion.py
@@ -7,6 +7,7 @@ import torch
 from library.device_utils import init_ipex
 init_ipex()

+import open_clip
 from library import sdxl_model_util, sdxl_train_util, train_util

 import train_textual_inversion
@@ -131,7 +132,6 @@ if __name__ == "__main__":
    parser = setup_parser()

    args = parser.parse_args()
-    train_util.verify_command_line_training_args(args)
    args = train_util.read_config_from_file(args, parser)

    trainer = SdxlTextualInversionTrainer()
--- a/stable_cascade_gen_img.py
+++ b/stable_cascade_gen_img.py
@@ -0,0 +1,367 @@
+import argparse
+import importlib
+import math
+import os
+import random
+import time
+import numpy as np
+
+from safetensors.torch import load_file, save_file
+import torch
+from tqdm import tqdm
+from transformers import AutoTokenizer, CLIPTextModelWithProjection, CLIPTextConfig
+from PIL import Image
+from accelerate import init_empty_weights
+
+import library.stable_cascade as sc
+import library.stable_cascade_utils as sc_utils
+import library.device_utils as device_utils
+from library import train_util
+from library.sdxl_model_util import _load_state_dict_on_device
+
+
+def main(args):
+    device = device_utils.get_preferred_device()
+
+    loading_device = device if not args.lowvram else "cpu"
+    text_model_device = "cpu"
+
+    dtype = torch.float32
+    if args.bf16:
+        dtype = torch.bfloat16
+    elif args.fp16:
+        dtype = torch.float16
+
+    text_model_dtype = torch.float32
+
+    # EfficientNet encoder
+    effnet = sc_utils.load_effnet(args.effnet_checkpoint_path, loading_device)
+    effnet.eval().requires_grad_(False).to(loading_device)
+
+    generator_c = sc_utils.load_stage_c_model(args.stage_c_checkpoint_path, dtype=dtype, device=loading_device)
+    generator_c.eval().requires_grad_(False).to(loading_device)
+    # if args.xformers or args.sdpa:
+    print(f"Stage C: use_xformers_or_sdpa: {args.xformers} {args.sdpa}")
+    generator_c.set_use_xformers_or_sdpa(args.xformers, args.sdpa)
+
+    generator_b = sc_utils.load_stage_b_model(args.stage_b_checkpoint_path, dtype=dtype, device=loading_device)
+    generator_b.eval().requires_grad_(False).to(loading_device)
+    # if args.xformers or args.sdpa:
+    print(f"Stage B: use_xformers_or_sdpa: {args.xformers} {args.sdpa}")
+    generator_b.set_use_xformers_or_sdpa(args.xformers, args.sdpa)
+
+    # CLIP encoders
+    tokenizer = sc_utils.load_tokenizer(args)
+
+    text_model = sc_utils.load_clip_text_model(
+        args.text_model_checkpoint_path, text_model_dtype, text_model_device, args.save_text_model
+    )
+    text_model = text_model.requires_grad_(False).to(text_model_dtype).to(text_model_device)
+
+    # image_model = (
+    #     CLIPVisionModelWithProjection.from_pretrained(clip_image_model_name).requires_grad_(False).to(dtype).to(device)
+    # )
+
+    # vqGAN
+    stage_a = sc_utils.load_stage_a_model(args.stage_a_checkpoint_path, dtype=dtype, device=loading_device)
+    stage_a.eval().requires_grad_(False)
+
+    # previewer
+    if args.previewer_checkpoint_path is not None:
+        previewer = sc_utils.load_previewer_model(args.previewer_checkpoint_path, dtype=dtype, device=loading_device)
+        previewer.eval().requires_grad_(False)
+    else:
+        previewer = None
+
+    # LoRA
+    if args.network_module:
+        for i, network_module in enumerate(args.network_module):
+            print("import network module:", network_module)
+            imported_module = importlib.import_module(network_module)
+
+            network_mul = 1.0 if args.network_mul is None or len(args.network_mul) <= i else args.network_mul[i]
+
+            net_kwargs = {}
+            if args.network_args and i < len(args.network_args):
+                network_args = args.network_args[i]
+                # TODO escape special chars
+                network_args = network_args.split(";")
+                for net_arg in network_args:
+                    key, value = net_arg.split("=")
+                    net_kwargs[key] = value
+
+            if args.network_weights is None or len(args.network_weights) <= i:
+                raise ValueError("No weight. Weight is required.")
+
+            network_weight = args.network_weights[i]
+            print("load network weights from:", network_weight)
+
+            network, weights_sd = imported_module.create_network_from_weights(
+                network_mul, network_weight, effnet, text_model, generator_c, for_inference=True, **net_kwargs
+            )
+            if network is None:
+                return
+
+            mergeable = network.is_mergeable()
+            assert mergeable, "not-mergeable network is not supported yet."
+
+            network.merge_to(text_model, generator_c, weights_sd, dtype, device)
+
+    # 謎のクラス gdf
+    gdf_c = sc.GDF(
+        schedule=sc.CosineSchedule(clamp_range=[0.0001, 0.9999]),
+        input_scaler=sc.VPScaler(),
+        target=sc.EpsilonTarget(),
+        noise_cond=sc.CosineTNoiseCond(),
+        loss_weight=None,
+    )
+    gdf_b = sc.GDF(
+        schedule=sc.CosineSchedule(clamp_range=[0.0001, 0.9999]),
+        input_scaler=sc.VPScaler(),
+        target=sc.EpsilonTarget(),
+        noise_cond=sc.CosineTNoiseCond(),
+        loss_weight=None,
+    )
+
+    # Stage C Parameters
+
+    # extras.sampling_configs["cfg"] = 4
+    # extras.sampling_configs["shift"] = 2
+    # extras.sampling_configs["timesteps"] = 20
+    # extras.sampling_configs["t_start"] = 1.0
+
+    # # Stage B Parameters
+    # extras_b.sampling_configs["cfg"] = 1.1
+    # extras_b.sampling_configs["shift"] = 1
+    # extras_b.sampling_configs["timesteps"] = 10
+    # extras_b.sampling_configs["t_start"] = 1.0
+    b_cfg = 1.1
+    b_shift = 1
+    b_timesteps = 10
+    b_t_start = 1.0
+
+    # caption = "Cinematic photo of an anthropomorphic penguin sitting in a cafe reading a book and having a coffee"
+    # height, width = 1024, 1024
+
+    while True:
+        print("type caption:")
+        # if Ctrl+Z is pressed, it will raise EOFError
+        try:
+            caption = input()
+        except EOFError:
+            break
+
+        caption = caption.strip()
+        if caption == "":
+            continue
+
+        # parse options: '--w' and  '--h' for size, '--l' for cfg, '--s' for timesteps, '--f' for shift. if not specified, use default values
+        # e.g. "caption --w 4 --h 4 --l 20 --s 20 --f 1.0"
+
+        tokens = caption.split()
+        width = height = 1024
+        cfg = 4
+        timesteps = 20
+        shift = 2
+        t_start = 1.0
+        negative_prompt = ""
+        seed = None
+
+        caption_tokens = []
+        i = 0
+        while i < len(tokens):
+            token = tokens[i]
+            if i == len(tokens) - 1:
+                caption_tokens.append(token)
+                i += 1
+                continue
+
+            if token == "--w":
+                width = int(tokens[i + 1])
+            elif token == "--h":
+                height = int(tokens[i + 1])
+            elif token == "--l":
+                cfg = float(tokens[i + 1])
+            elif token == "--s":
+                timesteps = int(tokens[i + 1])
+            elif token == "--f":
+                shift = float(tokens[i + 1])
+            elif token == "--t":
+                t_start = float(tokens[i + 1])
+            elif token == "--n":
+                negative_prompt = tokens[i + 1]
+            elif token == "--d":
+                seed = int(tokens[i + 1])
+            else:
+                caption_tokens.append(token)
+                i += 1
+                continue
+
+            i += 2
+
+        caption = " ".join(caption_tokens)
+
+        stage_c_latent_shape, stage_b_latent_shape = sc_utils.calculate_latent_sizes(height, width, batch_size=1)
+
+        # PREPARE CONDITIONS
+        # cond_text, cond_pooled = sc.get_clip_conditions([caption], None, tokenizer, text_model)
+        input_ids = tokenizer(
+            [caption], truncation=True, padding="max_length", max_length=tokenizer.model_max_length, return_tensors="pt"
+        )["input_ids"].to(text_model.device)
+        cond_text, cond_pooled = train_util.get_hidden_states_stable_cascade(
+            tokenizer.model_max_length, input_ids, tokenizer, text_model
+        )
+        cond_text = cond_text.to(device, dtype=dtype)
+        cond_pooled = cond_pooled.unsqueeze(1).to(device, dtype=dtype)
+
+        # uncond_text, uncond_pooled = sc.get_clip_conditions([""], None, tokenizer, text_model)
+        input_ids = tokenizer(
+            [negative_prompt], truncation=True, padding="max_length", max_length=tokenizer.model_max_length, return_tensors="pt"
+        )["input_ids"].to(text_model.device)
+        uncond_text, uncond_pooled = train_util.get_hidden_states_stable_cascade(
+            tokenizer.model_max_length, input_ids, tokenizer, text_model
+        )
+        uncond_text = uncond_text.to(device, dtype=dtype)
+        uncond_pooled = uncond_pooled.unsqueeze(1).to(device, dtype=dtype)
+
+        zero_img_emb = torch.zeros(1, 768, device=device)
+
+        # 辞書にしたくないけど GDF から先の変更が面倒だからとりあえず辞書にしておく
+        conditions = {"clip_text_pooled": cond_pooled, "clip": cond_pooled, "clip_text": cond_text, "clip_img": zero_img_emb}
+        unconditions = {
+            "clip_text_pooled": uncond_pooled,
+            "clip": uncond_pooled,
+            "clip_text": uncond_text,
+            "clip_img": zero_img_emb,
+        }
+        conditions_b = {}
+        conditions_b.update(conditions)
+        unconditions_b = {}
+        unconditions_b.update(unconditions)
+
+        # seed everything
+        if seed is not None:
+            torch.manual_seed(seed)
+            torch.cuda.manual_seed_all(seed)
+            random.seed(seed)
+            np.random.seed(seed)
+            # torch.backends.cudnn.deterministic = True
+            # torch.backends.cudnn.benchmark = False
+
+        if args.lowvram:
+            generator_c = generator_c.to(device)
+
+        with torch.no_grad(), torch.cuda.amp.autocast(dtype=dtype):
+            sampling_c = gdf_c.sample(
+                generator_c,
+                conditions,
+                stage_c_latent_shape,
+                unconditions,
+                device=device,
+                cfg=cfg,
+                shift=shift,
+                timesteps=timesteps,
+                t_start=t_start,
+            )
+            for sampled_c, _, _ in tqdm(sampling_c, total=timesteps):
+                sampled_c = sampled_c
+
+            conditions_b["effnet"] = sampled_c
+            unconditions_b["effnet"] = torch.zeros_like(sampled_c)
+
+        if previewer is not None:
+            with torch.no_grad(), torch.cuda.amp.autocast(dtype=dtype):
+                preview = previewer(sampled_c)
+                preview = preview.clamp(0, 1)
+            preview = preview.permute(0, 2, 3, 1).squeeze(0)
+            preview = preview.detach().float().cpu().numpy()
+            preview = Image.fromarray((preview * 255).astype(np.uint8))
+
+            timestamp_str = time.strftime("%Y%m%d_%H%M%S")
+            os.makedirs(args.outdir, exist_ok=True)
+            preview.save(os.path.join(args.outdir, f"preview_{timestamp_str}.png"))
+
+        if args.lowvram:
+            generator_c = generator_c.to(loading_device)
+            device_utils.clean_memory_on_device(device)
+            generator_b = generator_b.to(device)
+
+        with torch.no_grad(), torch.cuda.amp.autocast(dtype=dtype):
+            sampling_b = gdf_b.sample(
+                generator_b,
+                conditions_b,
+                stage_b_latent_shape,
+                unconditions_b,
+                device=device,
+                cfg=b_cfg,
+                shift=b_shift,
+                timesteps=b_timesteps,
+                t_start=b_t_start,
+            )
+            for sampled_b, _, _ in tqdm(sampling_b, total=b_t_start):
+                sampled_b = sampled_b
+
+        if args.lowvram:
+            generator_b = generator_b.to(loading_device)
+            device_utils.clean_memory_on_device(device)
+            stage_a = stage_a.to(device)
+
+        with torch.no_grad(), torch.cuda.amp.autocast(dtype=dtype):
+            sampled = stage_a.decode(sampled_b).float()
+        # print(sampled.shape, sampled.min(), sampled.max())
+
+        if args.lowvram:
+            stage_a = stage_a.to(loading_device)
+            device_utils.clean_memory_on_device(device)
+
+        # float 0-1 to PIL Image
+        sampled = sampled.clamp(0, 1)
+        sampled = sampled.mul(255).to(dtype=torch.uint8)
+        sampled = sampled.permute(0, 2, 3, 1)
+        sampled = sampled.cpu().numpy()
+        sampled = Image.fromarray(sampled[0])
+
+        timestamp_str = time.strftime("%Y%m%d_%H%M%S")
+        os.makedirs(args.outdir, exist_ok=True)
+        sampled.save(os.path.join(args.outdir, f"sampled_{timestamp_str}.png"))
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+
+    sc_utils.add_effnet_arguments(parser)
+    train_util.add_tokenizer_arguments(parser)
+    sc_utils.add_stage_a_arguments(parser)
+    sc_utils.add_stage_b_arguments(parser)
+    sc_utils.add_stage_c_arguments(parser)
+    sc_utils.add_previewer_arguments(parser)
+    sc_utils.add_text_model_arguments(parser)
+    parser.add_argument("--bf16", action="store_true")
+    parser.add_argument("--fp16", action="store_true")
+    parser.add_argument("--xformers", action="store_true")
+    parser.add_argument("--sdpa", action="store_true")
+    parser.add_argument("--outdir", type=str, default="../outputs", help="dir to write results to / 生成画像の出力先")
+    parser.add_argument("--lowvram", action="store_true", help="if specified, use low VRAM mode")
+    parser.add_argument(
+        "--network_module",
+        type=str,
+        default=None,
+        nargs="*",
+        help="additional network module to use / 追加ネットワークを使う時そのモジュール名",
+    )
+    parser.add_argument(
+        "--network_weights", type=str, default=None, nargs="*", help="additional network weights to load / 追加ネットワークの重み"
+    )
+    parser.add_argument(
+        "--network_mul", type=float, default=None, nargs="*", help="additional network multiplier / 追加ネットワークの効果の倍率"
+    )
+    parser.add_argument(
+        "--network_args",
+        type=str,
+        default=None,
+        nargs="*",
+        help="additional arguments for network (key=value) / ネットワークへの追加の引数",
+    )
+    args = parser.parse_args()
+
+    main(args)
--- a/stable_cascade_train_c_network.py
+++ b/stable_cascade_train_c_network.py
--- a/stable_cascade_train_stage_c.py
+++ b/stable_cascade_train_stage_c.py
@@ -0,0 +1,564 @@
+# training with captions
+
+import argparse
+import math
+import os
+from multiprocessing import Value
+from typing import List
+import toml
+
+from tqdm import tqdm
+
+import torch
+from library.device_utils import init_ipex, clean_memory_on_device
+
+init_ipex()
+
+from accelerate.utils import set_seed
+from diffusers import DDPMScheduler
+
+import library.train_util as train_util
+from library.sdxl_train_util import add_sdxl_training_arguments
+import library.stable_cascade_utils as sc_utils
+import library.stable_cascade as sc
+
+from library.utils import setup_logging, add_logging_arguments
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+import library.config_util as config_util
+from library.config_util import (
+    ConfigSanitizer,
+    BlueprintGenerator,
+)
+
+
+def train(args):
+    train_util.verify_training_args(args)
+    train_util.prepare_dataset_args(args, True)
+    setup_logging(args, reset=True)
+
+    # assert (
+    #     not args.weighted_captions
+    # ), "weighted_captions is not supported currently / weighted_captionsは現在サポートされていません"
+
+    # TODO add assertions for other unsupported options
+
+    cache_latents = args.cache_latents
+    use_dreambooth_method = args.in_json is None
+
+    if args.seed is not None:
+        set_seed(args.seed)  # 乱数系列を初期化する
+
+    tokenizer = sc_utils.load_tokenizer(args)
+
+    # データセットを準備する
+    if args.dataset_class is None:
+        blueprint_generator = BlueprintGenerator(ConfigSanitizer(True, True, False, True))
+        if args.dataset_config is not None:
+            logger.info(f"Load dataset config from {args.dataset_config}")
+            user_config = config_util.load_user_config(args.dataset_config)
+            ignored = ["train_data_dir", "in_json"]
+            if any(getattr(args, attr) is not None for attr in ignored):
+                logger.warning(
+                    "ignore following options because config file is found: {0} / 設定ファイルが利用されるため以下のオプションは無視されます: {0}".format(
+                        ", ".join(ignored)
+                    )
+                )
+        else:
+            if use_dreambooth_method:
+                logger.info("Using DreamBooth method.")
+                user_config = {
+                    "datasets": [
+                        {
+                            "subsets": config_util.generate_dreambooth_subsets_config_by_subdirs(
+                                args.train_data_dir, args.reg_data_dir
+                            )
+                        }
+                    ]
+                }
+            else:
+                logger.info("Training with captions.")
+                user_config = {
+                    "datasets": [
+                        {
+                            "subsets": [
+                                {
+                                    "image_dir": args.train_data_dir,
+                                    "metadata_file": args.in_json,
+                                }
+                            ]
+                        }
+                    ]
+                }
+
+        blueprint = blueprint_generator.generate(user_config, args, tokenizer=[tokenizer])
+        train_dataset_group = config_util.generate_dataset_group_by_blueprint(blueprint.dataset_group)
+    else:
+        train_dataset_group = train_util.load_arbitrary_dataset(args, [tokenizer])
+
+    current_epoch = Value("i", 0)
+    current_step = Value("i", 0)
+    ds_for_collator = train_dataset_group if args.max_data_loader_n_workers == 0 else None
+    collator = train_util.collator_class(current_epoch, current_step, ds_for_collator)
+
+    train_dataset_group.verify_bucket_reso_steps(32)
+
+    if args.debug_dataset:
+        train_util.debug_dataset(train_dataset_group, True)
+        return
+    if len(train_dataset_group) == 0:
+        logger.error(
+            "No data found. Please verify the metadata file and train_data_dir option. / 画像がありません。メタデータおよびtrain_data_dirオプションを確認してください。"
+        )
+        return
+
+    if cache_latents:
+        assert (
+            train_dataset_group.is_latent_cacheable()
+        ), "when caching latents, either color_aug or random_crop cannot be used / latentをキャッシュするときはcolor_augとrandom_cropは使えません"
+
+    if args.cache_text_encoder_outputs:
+        assert (
+            train_dataset_group.is_text_encoder_output_cacheable()
+        ), "when caching text encoder output, either caption_dropout_rate, shuffle_caption, token_warmup_step or caption_tag_dropout_rate cannot be used / text encoderの出力をキャッシュするときはcaption_dropout_rate, shuffle_caption, token_warmup_step, caption_tag_dropout_rateは使えません"
+
+    # acceleratorを準備する
+    logger.info("prepare accelerator")
+    accelerator = train_util.prepare_accelerator(args)
+
+    # mixed precisionに対応した型を用意しておき適宜castする
+    weight_dtype, save_dtype = train_util.prepare_dtype(args)
+    effnet_dtype = torch.float32 if args.no_half_vae else weight_dtype
+
+    # モデルを読み込む
+    loading_device = accelerator.device if args.lowram else "cpu"
+    effnet = sc_utils.load_effnet(args.effnet_checkpoint_path, loading_device)
+    stage_c = sc_utils.load_stage_c_model(args.stage_c_checkpoint_path, device=loading_device)  # dtype is as it is
+    text_encoder1 = sc_utils.load_clip_text_model(args.text_model_checkpoint_path, dtype=weight_dtype, device=loading_device)
+
+    if args.sample_at_first or args.sample_every_n_steps is not None or args.sample_every_n_epochs is not None:
+        # Previewer is small enough to be loaded on CPU
+        previewer = sc_utils.load_previewer_model(args.previewer_checkpoint_path, dtype=torch.float32, device="cpu")
+        previewer.eval()
+    else:
+        previewer = None
+
+    # モデルに xformers とか memory efficient attention を組み込む
+    stage_c.set_use_xformers_or_sdpa(args.xformers, args.sdpa)
+
+    # 学習を準備する
+    if cache_latents:
+        effnet.to(accelerator.device, dtype=effnet_dtype)
+        effnet.requires_grad_(False)
+        effnet.eval()
+        with torch.no_grad():
+            train_dataset_group.cache_latents(
+                effnet,
+                args.vae_batch_size,
+                args.cache_latents_to_disk,
+                accelerator.is_main_process,
+                train_util.STABLE_CASCADE_LATENTS_CACHE_SUFFIX,
+                32,
+            )
+        effnet.to("cpu")
+        clean_memory_on_device(accelerator.device)
+
+        accelerator.wait_for_everyone()
+
+    # 学習を準備する：モデルを適切な状態にする
+    if args.gradient_checkpointing:
+        accelerator.print("enable gradient checkpointing")
+        stage_c.set_gradient_checkpointing(True)
+
+    train_stage_c = args.learning_rate > 0
+    train_text_encoder1 = False
+
+    if args.train_text_encoder:
+        accelerator.print("enable text encoder training")
+        if args.gradient_checkpointing:
+            text_encoder1.gradient_checkpointing_enable()
+        lr_te1 = args.learning_rate_te1 if args.learning_rate_te1 is not None else args.learning_rate  # 0 means not train
+        train_text_encoder1 = lr_te1 > 0
+        assert (
+            train_text_encoder1
+        ), "text_encoder1 learning rate is 0. Please set a positive value / text_encoder1の学習率が0です。正の値を設定してください。"
+
+        if not train_text_encoder1:
+            text_encoder1.to(weight_dtype)
+        text_encoder1.requires_grad_(train_text_encoder1)
+        text_encoder1.train(train_text_encoder1)
+    else:
+        text_encoder1.to(weight_dtype)
+        text_encoder1.requires_grad_(False)
+        text_encoder1.eval()
+
+    # TextEncoderの出力をキャッシュする
+    if args.cache_text_encoder_outputs:
+        # Text Encodes are eval and no grad
+        with torch.no_grad(), accelerator.autocast():
+            train_dataset_group.cache_text_encoder_outputs(
+                (tokenizer,),
+                (text_encoder1,),
+                accelerator.device,
+                None,
+                args.cache_text_encoder_outputs_to_disk,
+                accelerator.is_main_process,
+                sc_utils.TEXT_ENCODER_OUTPUTS_CACHE_SUFFIX,
+            )
+        accelerator.wait_for_everyone()
+
+    if not cache_latents:
+        effnet.requires_grad_(False)
+        effnet.eval()
+        effnet.to(accelerator.device, dtype=effnet_dtype)
+
+    stage_c.requires_grad_(True)
+    if not train_stage_c:
+        stage_c.to(accelerator.device, dtype=weight_dtype)  # because of stage_c will not be prepared
+
+    training_models = []
+    params_to_optimize = []
+    if train_stage_c:
+        training_models.append(stage_c)
+        params_to_optimize.append({"params": list(stage_c.parameters()), "lr": args.learning_rate})
+
+    if train_text_encoder1:
+        training_models.append(text_encoder1)
+        params_to_optimize.append({"params": list(text_encoder1.parameters()), "lr": args.learning_rate_te1 or args.learning_rate})
+
+    # calculate number of trainable parameters
+    n_params = 0
+    for params in params_to_optimize:
+        for p in params["params"]:
+            n_params += p.numel()
+
+    accelerator.print(f"train stage-C: {train_stage_c}, text_encoder1: {train_text_encoder1}")
+    accelerator.print(f"number of models: {len(training_models)}")
+    accelerator.print(f"number of trainable parameters: {n_params}")
+
+    # 学習に必要なクラスを準備する
+    accelerator.print("prepare optimizer, data loader etc.")
+    _, _, optimizer = train_util.get_optimizer(args, trainable_params=params_to_optimize)
+
+    # dataloaderを準備する
+    # DataLoaderのプロセス数：0はメインプロセスになる
+    n_workers = min(args.max_data_loader_n_workers, os.cpu_count() - 1)  # cpu_count-1 ただし最大で指定された数まで
+    train_dataloader = torch.utils.data.DataLoader(
+        train_dataset_group,
+        batch_size=1,
+        shuffle=True,
+        collate_fn=collator,
+        num_workers=n_workers,
+        persistent_workers=args.persistent_data_loader_workers,
+    )
+
+    # 学習ステップ数を計算する
+    if args.max_train_epochs is not None:
+        args.max_train_steps = args.max_train_epochs * math.ceil(
+            len(train_dataloader) / accelerator.num_processes / args.gradient_accumulation_steps
+        )
+        accelerator.print(
+            f"override steps. steps for {args.max_train_epochs} epochs is / 指定エポックまでのステップ数: {args.max_train_steps}"
+        )
+
+    # データセット側にも学習ステップを送信
+    train_dataset_group.set_max_train_steps(args.max_train_steps)
+
+    # lr schedulerを用意する
+    lr_scheduler = train_util.get_scheduler_fix(args, optimizer, accelerator.num_processes)
+
+    # 実験的機能：勾配も含めたfp16/bf16学習を行う　モデル全体をfp16/bf16にする
+    if args.full_fp16:
+        assert (
+            args.mixed_precision == "fp16"
+        ), "full_fp16 requires mixed precision='fp16' / full_fp16を使う場合はmixed_precision='fp16'を指定してください。"
+        accelerator.print("enable full fp16 training.")
+        stage_c.to(weight_dtype)
+        text_encoder1.to(weight_dtype)
+    elif args.full_bf16:
+        assert (
+            args.mixed_precision == "bf16"
+        ), "full_bf16 requires mixed precision='bf16' / full_bf16を使う場合はmixed_precision='bf16'を指定してください。"
+        accelerator.print("enable full bf16 training.")
+        stage_c.to(weight_dtype)
+        text_encoder1.to(weight_dtype)
+
+    # acceleratorがなんかよろしくやってくれるらしい
+    if train_stage_c:
+        stage_c = accelerator.prepare(stage_c)
+    if train_text_encoder1:
+        text_encoder1 = accelerator.prepare(text_encoder1)
+
+    optimizer, train_dataloader, lr_scheduler = accelerator.prepare(optimizer, train_dataloader, lr_scheduler)
+
+    # TextEncoderの出力をキャッシュするときにはCPUへ移動する
+    if args.cache_text_encoder_outputs:
+        # move Text Encoders for sampling images. Text Encoder doesn't work on CPU with fp16
+        text_encoder1.to("cpu", dtype=torch.float32)
+        clean_memory_on_device(accelerator.device)
+    else:
+        # make sure Text Encoders are on GPU
+        text_encoder1.to(accelerator.device)
+
+    # 実験的機能：勾配も含めたfp16学習を行う　PyTorchにパッチを当ててfp16でのgrad scaleを有効にする
+    if args.full_fp16:
+        train_util.patch_accelerator_for_fp16_training(accelerator)
+
+    # resumeする
+    train_util.resume_from_local_or_hf_if_specified(accelerator, args)
+
+    # epoch数を計算する
+    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
+    num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)
+    if (args.save_n_epoch_ratio is not None) and (args.save_n_epoch_ratio > 0):
+        args.save_every_n_epochs = math.floor(num_train_epochs / args.save_n_epoch_ratio) or 1
+
+    # 学習する
+    # total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
+    accelerator.print("running training / 学習開始")
+    accelerator.print(f"  num examples / サンプル数: {train_dataset_group.num_train_images}")
+    accelerator.print(f"  num batches per epoch / 1epochのバッチ数: {len(train_dataloader)}")
+    accelerator.print(f"  num epochs / epoch数: {num_train_epochs}")
+    accelerator.print(
+        f"  batch size per device / バッチサイズ: {', '.join([str(d.batch_size) for d in train_dataset_group.datasets])}"
+    )
+    # accelerator.print(
+    #     f"  total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ（並列学習、勾配合計含む）: {total_batch_size}"
+    # )
+    accelerator.print(f"  gradient accumulation steps / 勾配を合計するステップ数 = {args.gradient_accumulation_steps}")
+    accelerator.print(f"  total optimization steps / 学習ステップ数: {args.max_train_steps}")
+
+    progress_bar = tqdm(range(args.max_train_steps), smoothing=0, disable=not accelerator.is_local_main_process, desc="steps")
+    global_step = 0
+
+    # 謎のクラス GDF
+    gdf = sc.GDF(
+        schedule=sc.CosineSchedule(clamp_range=[0.0001, 0.9999]),
+        input_scaler=sc.VPScaler(),
+        target=sc.EpsilonTarget(),
+        noise_cond=sc.CosineTNoiseCond(),
+        loss_weight=sc.AdaptiveLossWeight() if args.adaptive_loss_weight else sc.P2LossWeight(),
+    )
+
+    # 以下2つの変数は、どうもデフォルトのままっぽい
+    # gdf.loss_weight.bucket_ranges = torch.tensor(self.info.adaptive_loss['bucket_ranges'])
+    # gdf.loss_weight.bucket_losses = torch.tensor(self.info.adaptive_loss['bucket_losses'])
+
+    if accelerator.is_main_process:
+        init_kwargs = {}
+        if args.wandb_run_name:
+            init_kwargs["wandb"] = {"name": args.wandb_run_name}
+        if args.log_tracker_config is not None:
+            init_kwargs = toml.load(args.log_tracker_config)
+        accelerator.init_trackers("finetuning" if args.log_tracker_name is None else args.log_tracker_name, init_kwargs=init_kwargs)
+
+    # For --sample_at_first
+    sc_utils.sample_images(accelerator, args, 0, global_step, previewer, tokenizer, text_encoder1, stage_c, gdf)
+
+    loss_recorder = train_util.LossRecorder()
+    for epoch in range(num_train_epochs):
+        accelerator.print(f"\nepoch {epoch+1}/{num_train_epochs}")
+        current_epoch.value = epoch + 1
+
+        for m in training_models:
+            m.train()
+
+        for step, batch in enumerate(train_dataloader):
+            current_step.value = global_step
+            with accelerator.accumulate(*training_models):
+                if "latents" in batch and batch["latents"] is not None:
+                    latents = batch["latents"].to(accelerator.device).to(dtype=weight_dtype)
+                else:
+                    with torch.no_grad():
+                        # latentに変換
+                        # XXX Effnet preprocessing is included in encode method
+                        latents = effnet.encode(batch["images"].to(effnet_dtype)).latent_dist.sample().to(weight_dtype)
+
+                        # NaNが含まれていれば警告を表示し0に置き換える
+                        if torch.any(torch.isnan(latents)):
+                            accelerator.print("NaN found in latents, replacing with zeros")
+                            latents = torch.nan_to_num(latents, 0, out=latents)
+
+                # # debug: decode latent with previewer and save it
+                # import time
+                # import numpy as np
+                # from PIL import Image
+                # ts = time.time()
+                # images = previewer(latents.to(previewer.device, dtype=previewer.dtype))
+                # for i, img in enumerate(images):
+                #     img = img.detach().cpu().numpy().transpose(1, 2, 0)
+                #     img = np.clip(img, 0, 1)
+                #     img = (img * 255).astype(np.uint8)
+                #     img = Image.fromarray(img)
+                #     img.save(f"logs/previewer_{i}_{ts}.png")
+
+                if "text_encoder_outputs1_list" not in batch or batch["text_encoder_outputs1_list"] is None:
+                    input_ids1 = batch["input_ids"]
+                    with torch.set_grad_enabled(args.train_text_encoder):
+                        # Get the text embedding for conditioning
+                        # TODO support weighted captions
+                        input_ids1 = input_ids1.to(accelerator.device)
+                        # unwrap_model is fine for models not wrapped by accelerator
+                        encoder_hidden_states, pool = train_util.get_hidden_states_stable_cascade(
+                            args.max_token_length,
+                            input_ids1,
+                            tokenizer,
+                            text_encoder1,
+                            None if not args.full_fp16 else weight_dtype,
+                            accelerator,
+                        )
+                else:
+                    encoder_hidden_states = batch["text_encoder_outputs1_list"].to(accelerator.device).to(weight_dtype)
+                    pool = batch["text_encoder_pool2_list"].to(accelerator.device).to(weight_dtype)
+
+                pool = pool.unsqueeze(1)  # add extra dimension b,1280 -> b,1,1280
+
+                # FORWARD PASS
+                with torch.no_grad():
+                    noised, noise, target, logSNR, noise_cond, loss_weight = gdf.diffuse(latents, shift=1, loss_shift=1)
+
+                zero_img_emb = torch.zeros(noised.shape[0], 768, device=accelerator.device)
+                with accelerator.autocast():
+                    pred = stage_c(
+                        noised, noise_cond, clip_text=encoder_hidden_states, clip_text_pooled=pool, clip_img=zero_img_emb
+                    )
+                    loss = torch.nn.functional.mse_loss(pred, target, reduction="none").mean(dim=[1, 2, 3])
+                    loss_adjusted = (loss * loss_weight).mean()
+
+                if args.adaptive_loss_weight:
+                    gdf.loss_weight.update_buckets(logSNR, loss)  # use loss instead of loss_adjusted
+
+                accelerator.backward(loss_adjusted)
+                if accelerator.sync_gradients and args.max_grad_norm != 0.0:
+                    params_to_clip = []
+                    for m in training_models:
+                        params_to_clip.extend(m.parameters())
+                    accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
+
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.zero_grad(set_to_none=True)
+
+            # Checks if the accelerator has performed an optimization step behind the scenes
+            if accelerator.sync_gradients:
+                progress_bar.update(1)
+                global_step += 1
+
+                sc_utils.sample_images(accelerator, args, None, global_step, previewer, tokenizer, text_encoder1, stage_c, gdf)
+
+                # 指定ステップごとにモデルを保存
+                if args.save_every_n_steps is not None and global_step % args.save_every_n_steps == 0:
+                    accelerator.wait_for_everyone()
+                    if accelerator.is_main_process:
+                        sc_utils.save_stage_c_model_on_epoch_end_or_stepwise(
+                            args,
+                            False,
+                            accelerator,
+                            save_dtype,
+                            epoch,
+                            num_train_epochs,
+                            global_step,
+                            accelerator.unwrap_model(stage_c),
+                            accelerator.unwrap_model(text_encoder1) if train_text_encoder1 else None,
+                        )
+
+            current_loss = loss_adjusted.detach().item()  # 平均なのでbatch sizeは関係ないはず
+            if args.logging_dir is not None:
+                logs = {"loss": current_loss}
+                train_util.append_lr_to_logs(logs, lr_scheduler, args.optimizer_type, including_unet=True)
+
+                accelerator.log(logs, step=global_step)
+
+            loss_recorder.add(epoch=epoch, step=step, loss=current_loss)
+            avr_loss: float = loss_recorder.moving_average
+            logs = {"avr_loss": avr_loss}  # , "lr": lr_scheduler.get_last_lr()[0]}
+            progress_bar.set_postfix(**logs)
+
+            if global_step >= args.max_train_steps:
+                break
+
+        if args.logging_dir is not None:
+            logs = {"loss/epoch": loss_recorder.moving_average}
+            accelerator.log(logs, step=epoch + 1)
+
+        accelerator.wait_for_everyone()
+
+        if args.save_every_n_epochs is not None:
+            if accelerator.is_main_process:
+                sc_utils.save_stage_c_model_on_epoch_end_or_stepwise(
+                    args,
+                    True,
+                    accelerator,
+                    save_dtype,
+                    epoch,
+                    num_train_epochs,
+                    global_step,
+                    accelerator.unwrap_model(stage_c),
+                    accelerator.unwrap_model(text_encoder1) if train_text_encoder1 else None,
+                )
+
+        sc_utils.sample_images(accelerator, args, epoch + 1, global_step, previewer, tokenizer, text_encoder1, stage_c, gdf)
+
+    is_main_process = accelerator.is_main_process
+    # if is_main_process:
+    stage_c = accelerator.unwrap_model(stage_c)
+    text_encoder1 = accelerator.unwrap_model(text_encoder1)
+
+    accelerator.end_training()
+
+    if args.save_state:  # and is_main_process:
+        train_util.save_state_on_train_end(args, accelerator)
+
+    del accelerator  # この後メモリを使うのでこれは消す
+
+    if is_main_process:
+        sc_utils.save_stage_c_model_on_end(
+            args, save_dtype, epoch, global_step, stage_c, text_encoder1 if train_text_encoder1 else None
+        )
+        logger.info("model saved.")
+
+
+def setup_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser()
+
+    add_logging_arguments(parser)
+    sc_utils.add_effnet_arguments(parser)
+    sc_utils.add_stage_c_arguments(parser)
+    sc_utils.add_text_model_arguments(parser)
+    sc_utils.add_previewer_arguments(parser)
+    sc_utils.add_training_arguments(parser)
+    train_util.add_tokenizer_arguments(parser)
+    train_util.add_dataset_arguments(parser, True, True, True)
+    train_util.add_training_arguments(parser, False)
+    train_util.add_sd_saving_arguments(parser)
+    train_util.add_optimizer_arguments(parser)
+    config_util.add_config_arguments(parser)
+    add_sdxl_training_arguments(parser)  # cache text encoder outputs
+
+    parser.add_argument("--train_text_encoder", action="store_true", help="train text encoder / text encoderも学習する")
+    parser.add_argument(
+        "--learning_rate_te1",
+        type=float,
+        default=None,
+        help="learning rate for text encoder / text encoderの学習率",
+    )
+    parser.add_argument(
+        "--no_half_vae",
+        action="store_true",
+        help="do not use fp16/bf16 Effnet in mixed precision (use float Effnet) / mixed precisionでも fp16/bf16 Effnetを使わずfloat Effnetを使う",
+    )
+
+    return parser
+
+
+if __name__ == "__main__":
+    parser = setup_parser()
+
+    args = parser.parse_args()
+    args = train_util.read_config_from_file(args, parser)
+
+    train(args)
--- a/tools/cache_latents.py
+++ b/tools/cache_latents.py
@@ -16,15 +16,12 @@ from library.config_util import (
    ConfigSanitizer,
    BlueprintGenerator,
 )
-from library.utils import setup_logging, add_logging_arguments
+from library.utils import setup_logging
 setup_logging()
 import logging
-
 logger = logging.getLogger(__name__)

-
 def cache_to_disk(args: argparse.Namespace) -> None:
-    setup_logging(args, reset=True)
    train_util.prepare_dataset_args(args, True)

    # check cache latents arg
@@ -97,7 +94,6 @@ def cache_to_disk(args: argparse.Namespace) -> None:

    # acceleratorを準備する
    logger.info("prepare accelerator")
-    args.deepspeed = False
    accelerator = train_util.prepare_accelerator(args)

    # mixed precisionに対応した型を用意しておき適宜castする
@@ -111,7 +107,7 @@ def cache_to_disk(args: argparse.Namespace) -> None:
    else:
        _, vae, _, _ = train_util.load_target_model(args, weight_dtype, accelerator)

-    if torch.__version__ >= "2.0.0":  # PyTorch 2.0.0 以上対応のxformersなら以下が使える
+    if torch.__version__ >= "2.0.0": # PyTorch 2.0.0 以上対応のxformersなら以下が使える
        vae.set_use_memory_efficient_attention_xformers(args.xformers)
    vae.to(accelerator.device, dtype=vae_dtype)
    vae.requires_grad_(False)
@@ -140,7 +136,6 @@ def cache_to_disk(args: argparse.Namespace) -> None:
        b_size = len(batch["images"])
        vae_batch_size = b_size if args.vae_batch_size is None else args.vae_batch_size
        flip_aug = batch["flip_aug"]
-        alpha_mask = batch["alpha_mask"]
        random_crop = batch["random_crop"]
        bucket_reso = batch["bucket_reso"]

@@ -159,16 +154,14 @@ def cache_to_disk(args: argparse.Namespace) -> None:
                image_info.latents_npz = os.path.splitext(absolute_path)[0] + ".npz"

                if args.skip_existing:
-                    if train_util.is_disk_cached_latents_is_expected(
-                        image_info.bucket_reso, image_info.latents_npz, flip_aug, alpha_mask
-                    ):
+                    if train_util.is_disk_cached_latents_is_expected(image_info.bucket_reso, image_info.latents_npz, flip_aug):
                        logger.warning(f"Skipping {image_info.latents_npz} because it already exists.")
                        continue

                image_infos.append(image_info)

            if len(image_infos) > 0:
-                train_util.cache_batch_latents(vae, True, image_infos, flip_aug, alpha_mask, random_crop)
+                train_util.cache_batch_latents(vae, True, image_infos, flip_aug, random_crop)

    accelerator.wait_for_everyone()
    accelerator.print(f"Finished caching latents for {len(train_dataset_group)} batches.")
@@ -177,7 +170,6 @@ def cache_to_disk(args: argparse.Namespace) -> None:
 def setup_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser()

-    add_logging_arguments(parser)
    train_util.add_sd_models_arguments(parser)
    train_util.add_training_arguments(parser, True)
    train_util.add_dataset_arguments(parser, True, True, True)
--- a/tools/cache_text_encoder_outputs.py
+++ b/tools/cache_text_encoder_outputs.py
@@ -16,13 +16,12 @@ from library.config_util import (
    ConfigSanitizer,
    BlueprintGenerator,
 )
-from library.utils import setup_logging, add_logging_arguments
+from library.utils import setup_logging
 setup_logging()
 import logging
 logger = logging.getLogger(__name__)

 def cache_to_disk(args: argparse.Namespace) -> None:
-    setup_logging(args, reset=True)
    train_util.prepare_dataset_args(args, True)

    # check cache arg
@@ -100,7 +99,6 @@ def cache_to_disk(args: argparse.Namespace) -> None:

    # acceleratorを準備する
    logger.info("prepare accelerator")
-    args.deepspeed = False
    accelerator = train_util.prepare_accelerator(args)

    # mixed precisionに対応した型を用意しておき適宜castする
@@ -173,7 +171,6 @@ def cache_to_disk(args: argparse.Namespace) -> None:
 def setup_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser()

-    add_logging_arguments(parser)
    train_util.add_sd_models_arguments(parser)
    train_util.add_training_arguments(parser, True)
    train_util.add_dataset_arguments(parser, True, True, True)
--- a/tools/detect_face_rotate.py
+++ b/tools/detect_face_rotate.py
@@ -15,7 +15,7 @@ import os
 from anime_face_detector import create_detector
 from tqdm import tqdm
 import numpy as np
-from library.utils import setup_logging, pil_resize
+from library.utils import setup_logging
 setup_logging()
 import logging
 logger = logging.getLogger(__name__)
@@ -172,10 +172,7 @@ def process(args):
        if scale != 1.0:
          w = int(w * scale + .5)
          h = int(h * scale + .5)
-          if scale < 1.0:
-            face_img = cv2.resize(face_img, (w, h), interpolation=cv2.INTER_AREA)
-          else:
-            face_img = pil_resize(face_img, (w, h))
+          face_img = cv2.resize(face_img, (w, h), interpolation=cv2.INTER_AREA if scale < 1.0 else cv2.INTER_LANCZOS4)
          cx = int(cx * scale + .5)
          cy = int(cy * scale + .5)
          fw = int(fw * scale + .5)
--- a/tools/resize_images_to_resolution.py
+++ b/tools/resize_images_to_resolution.py
@@ -6,7 +6,7 @@ import shutil
 import math
 from PIL import Image
 import numpy as np
-from library.utils import setup_logging, pil_resize
+from library.utils import setup_logging
 setup_logging()
 import logging
 logger = logging.getLogger(__name__)
@@ -24,9 +24,9 @@ def resize_images(src_img_folder, dst_img_folder, max_resolution="512x512", divi

  # Select interpolation method
  if interpolation == 'lanczos4':
-    pil_interpolation = Image.LANCZOS
+    cv2_interpolation = cv2.INTER_LANCZOS4
  elif interpolation == 'cubic':
-    pil_interpolation = Image.BICUBIC
+    cv2_interpolation = cv2.INTER_CUBIC
  else:
    cv2_interpolation = cv2.INTER_AREA

@@ -64,10 +64,7 @@ def resize_images(src_img_folder, dst_img_folder, max_resolution="512x512", divi
        new_width = int(img.shape[1] * math.sqrt(scale_factor))

        # Resize image
-        if cv2_interpolation:
-          img = cv2.resize(img, (new_width, new_height), interpolation=cv2_interpolation)
-        else:
-          img = pil_resize(img, (new_width, new_height), interpolation=pil_interpolation)
+        img = cv2.resize(img, (new_width, new_height), interpolation=cv2_interpolation)
      else:
        new_height, new_width = img.shape[0:2]

--- a/tools/stable_cascade_cache_latents.py
+++ b/tools/stable_cascade_cache_latents.py
@@ -0,0 +1,191 @@
+# Stable Cascadeのlatentsをdiskにキャッシュする
+# cache latents of Stable Cascade to disk
+
+import argparse
+import math
+from multiprocessing import Value
+import os
+
+from accelerate.utils import set_seed
+import torch
+from tqdm import tqdm
+
+from library import stable_cascade_utils as sc_utils
+from library import config_util
+from library import train_util
+from library.config_util import (
+    ConfigSanitizer,
+    BlueprintGenerator,
+)
+from library.utils import setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+
+def cache_to_disk(args: argparse.Namespace) -> None:
+    train_util.prepare_dataset_args(args, True)
+
+    # check cache latents arg
+    assert args.cache_latents_to_disk, "cache_latents_to_disk must be True / cache_latents_to_diskはTrueである必要があります"
+
+    use_dreambooth_method = args.in_json is None
+
+    if args.seed is not None:
+        set_seed(args.seed)  # 乱数系列を初期化する
+
+    # tokenizerを準備する：datasetを動かすために必要
+    tokenizer = sc_utils.load_tokenizer(args)
+    tokenizers = [tokenizer]
+
+    # データセットを準備する
+    if args.dataset_class is None:
+        blueprint_generator = BlueprintGenerator(ConfigSanitizer(True, True, False, True))
+        if args.dataset_config is not None:
+            logger.info(f"Load dataset config from {args.dataset_config}")
+            user_config = config_util.load_user_config(args.dataset_config)
+            ignored = ["train_data_dir", "in_json"]
+            if any(getattr(args, attr) is not None for attr in ignored):
+                logger.warning(
+                    "ignore following options because config file is found: {0} / 設定ファイルが利用されるため以下のオプションは無視されます: {0}".format(
+                        ", ".join(ignored)
+                    )
+                )
+        else:
+            if use_dreambooth_method:
+                logger.info("Using DreamBooth method.")
+                user_config = {
+                    "datasets": [
+                        {
+                            "subsets": config_util.generate_dreambooth_subsets_config_by_subdirs(
+                                args.train_data_dir, args.reg_data_dir
+                            )
+                        }
+                    ]
+                }
+            else:
+                logger.info("Training with captions.")
+                user_config = {
+                    "datasets": [
+                        {
+                            "subsets": [
+                                {
+                                    "image_dir": args.train_data_dir,
+                                    "metadata_file": args.in_json,
+                                }
+                            ]
+                        }
+                    ]
+                }
+
+        blueprint = blueprint_generator.generate(user_config, args, tokenizer=tokenizers)
+        train_dataset_group = config_util.generate_dataset_group_by_blueprint(blueprint.dataset_group)
+    else:
+        train_dataset_group = train_util.load_arbitrary_dataset(args, tokenizers)
+
+    # datasetのcache_latentsを呼ばなければ、生の画像が返る
+
+    current_epoch = Value("i", 0)
+    current_step = Value("i", 0)
+    ds_for_collator = train_dataset_group if args.max_data_loader_n_workers == 0 else None
+    collator = train_util.collator_class(current_epoch, current_step, ds_for_collator)
+
+    # acceleratorを準備する
+    logger.info("prepare accelerator")
+    accelerator = train_util.prepare_accelerator(args)
+
+    # mixed precisionに対応した型を用意しておき適宜castする
+    weight_dtype, _ = train_util.prepare_dtype(args)
+    effnet_dtype = torch.float32 if args.no_half_vae else weight_dtype
+
+    # モデルを読み込む
+    logger.info("load model")
+    effnet = sc_utils.load_effnet(args.effnet_checkpoint_path, accelerator.device)
+    effnet.to(accelerator.device, dtype=effnet_dtype)
+    effnet.requires_grad_(False)
+    effnet.eval()
+
+    # dataloaderを準備する
+    train_dataset_group.set_caching_mode("latents")
+
+    # DataLoaderのプロセス数：0はメインプロセスになる
+    n_workers = min(args.max_data_loader_n_workers, os.cpu_count() - 1)  # cpu_count-1 ただし最大で指定された数まで
+
+    train_dataloader = torch.utils.data.DataLoader(
+        train_dataset_group,
+        batch_size=1,
+        shuffle=True,
+        collate_fn=collator,
+        num_workers=n_workers,
+        persistent_workers=args.persistent_data_loader_workers,
+    )
+
+    # acceleratorを使ってモデルを準備する：マルチGPUで使えるようになるはず
+    train_dataloader = accelerator.prepare(train_dataloader)
+
+    # データ取得のためのループ
+    for batch in tqdm(train_dataloader):
+        b_size = len(batch["images"])
+        vae_batch_size = b_size if args.vae_batch_size is None else args.vae_batch_size
+        flip_aug = batch["flip_aug"]
+        random_crop = batch["random_crop"]
+        bucket_reso = batch["bucket_reso"]
+
+        # バッチを分割して処理する
+        for i in range(0, b_size, vae_batch_size):
+            images = batch["images"][i : i + vae_batch_size]
+            absolute_paths = batch["absolute_paths"][i : i + vae_batch_size]
+            resized_sizes = batch["resized_sizes"][i : i + vae_batch_size]
+
+            image_infos = []
+            for i, (image, absolute_path, resized_size) in enumerate(zip(images, absolute_paths, resized_sizes)):
+                image_info = train_util.ImageInfo(absolute_path, 1, "dummy", False, absolute_path)
+                image_info.image = image
+                image_info.bucket_reso = bucket_reso
+                image_info.resized_size = resized_size
+                image_info.latents_npz = os.path.splitext(absolute_path)[0] + train_util.STABLE_CASCADE_LATENTS_CACHE_SUFFIX
+
+                if args.skip_existing:
+                    if train_util.is_disk_cached_latents_is_expected(image_info.bucket_reso, image_info.latents_npz, flip_aug, 32):
+                        logger.warning(f"Skipping {image_info.latents_npz} because it already exists.")
+                        continue
+
+                image_infos.append(image_info)
+
+            if len(image_infos) > 0:
+                train_util.cache_batch_latents(effnet, True, image_infos, flip_aug, random_crop)
+
+    accelerator.wait_for_everyone()
+    accelerator.print(f"Finished caching latents for {len(train_dataset_group)} batches.")
+
+
+def setup_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser()
+
+    train_util.add_tokenizer_arguments(parser)
+    sc_utils.add_effnet_arguments(parser)
+    train_util.add_training_arguments(parser, True)
+    train_util.add_dataset_arguments(parser, True, True, True)
+    config_util.add_config_arguments(parser)
+    parser.add_argument(
+        "--no_half_vae",
+        action="store_true",
+        help="do not use fp16/bf16 Effnet in mixed precision (use float Effnet) / mixed precisionでも fp16/bf16 Effnetを使わずfloat Effnetを使う",
+    )
+    parser.add_argument(
+        "--skip_existing",
+        action="store_true",
+        help="skip images if npz already exists (both normal and flipped exists if flip_aug is enabled) / npzが既に存在する画像をスキップする（flip_aug有効時は通常、反転の両方が存在する画像をスキップ）",
+    )
+    return parser
+
+
+if __name__ == "__main__":
+    parser = setup_parser()
+
+    args = parser.parse_args()
+    args = train_util.read_config_from_file(args, parser)
+
+    cache_to_disk(args)
--- a/tools/stable_cascade_cache_text_encoder_outputs.py
+++ b/tools/stable_cascade_cache_text_encoder_outputs.py
@@ -0,0 +1,183 @@
+# text encoder出力のdiskへの事前キャッシュを行う / cache text encoder outputs to disk in advance
+
+import argparse
+import math
+from multiprocessing import Value
+import os
+
+from accelerate.utils import set_seed
+import torch
+from tqdm import tqdm
+
+from library import config_util
+from library import train_util
+from library import sdxl_train_util
+from library import stable_cascade_utils as sc_utils
+from library.config_util import (
+    ConfigSanitizer,
+    BlueprintGenerator,
+)
+from library.utils import setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+
+def cache_to_disk(args: argparse.Namespace) -> None:
+    train_util.prepare_dataset_args(args, True)
+
+    # check cache arg
+    assert (
+        args.cache_text_encoder_outputs_to_disk
+    ), "cache_text_encoder_outputs_to_disk must be True / cache_text_encoder_outputs_to_diskはTrueである必要があります"
+
+    use_dreambooth_method = args.in_json is None
+
+    if args.seed is not None:
+        set_seed(args.seed)  # 乱数系列を初期化する
+
+    # tokenizerを準備する：datasetを動かすために必要
+    tokenizer = sc_utils.load_tokenizer(args)
+    tokenizers = [tokenizer]
+
+    # データセットを準備する
+    if args.dataset_class is None:
+        blueprint_generator = BlueprintGenerator(ConfigSanitizer(True, True, False, True))
+        if args.dataset_config is not None:
+            logger.info(f"Load dataset config from {args.dataset_config}")
+            user_config = config_util.load_user_config(args.dataset_config)
+            ignored = ["train_data_dir", "in_json"]
+            if any(getattr(args, attr) is not None for attr in ignored):
+                logger.warning(
+                    "ignore following options because config file is found: {0} / 設定ファイルが利用されるため以下のオプションは無視されます: {0}".format(
+                        ", ".join(ignored)
+                    )
+                )
+        else:
+            if use_dreambooth_method:
+                logger.info("Using DreamBooth method.")
+                user_config = {
+                    "datasets": [
+                        {
+                            "subsets": config_util.generate_dreambooth_subsets_config_by_subdirs(
+                                args.train_data_dir, args.reg_data_dir
+                            )
+                        }
+                    ]
+                }
+            else:
+                logger.info("Training with captions.")
+                user_config = {
+                    "datasets": [
+                        {
+                            "subsets": [
+                                {
+                                    "image_dir": args.train_data_dir,
+                                    "metadata_file": args.in_json,
+                                }
+                            ]
+                        }
+                    ]
+                }
+
+        blueprint = blueprint_generator.generate(user_config, args, tokenizer=tokenizers)
+        train_dataset_group = config_util.generate_dataset_group_by_blueprint(blueprint.dataset_group)
+    else:
+        train_dataset_group = train_util.load_arbitrary_dataset(args, tokenizers)
+
+    current_epoch = Value("i", 0)
+    current_step = Value("i", 0)
+    ds_for_collator = train_dataset_group if args.max_data_loader_n_workers == 0 else None
+    collator = train_util.collator_class(current_epoch, current_step, ds_for_collator)
+
+    # acceleratorを準備する
+    logger.info("prepare accelerator")
+    accelerator = train_util.prepare_accelerator(args)
+
+    # mixed precisionに対応した型を用意しておき適宜castする
+    weight_dtype, _ = train_util.prepare_dtype(args)
+
+    # モデルを読み込む
+    logger.info("load model")
+    text_encoder = sc_utils.load_clip_text_model(
+        args.text_model_checkpoint_path, weight_dtype, accelerator.device, args.save_text_model
+    )
+    text_encoders = [text_encoder]
+    for text_encoder in text_encoders:
+        text_encoder.to(accelerator.device, dtype=weight_dtype)
+        text_encoder.requires_grad_(False)
+        text_encoder.eval()
+
+    # dataloaderを準備する
+    train_dataset_group.set_caching_mode("text")
+
+    # DataLoaderのプロセス数：0はメインプロセスになる
+    n_workers = min(args.max_data_loader_n_workers, os.cpu_count() - 1)  # cpu_count-1 ただし最大で指定された数まで
+
+    train_dataloader = torch.utils.data.DataLoader(
+        train_dataset_group,
+        batch_size=1,
+        shuffle=True,
+        collate_fn=collator,
+        num_workers=n_workers,
+        persistent_workers=args.persistent_data_loader_workers,
+    )
+
+    # acceleratorを使ってモデルを準備する：マルチGPUで使えるようになるはず
+    train_dataloader = accelerator.prepare(train_dataloader)
+
+    # データ取得のためのループ
+    for batch in tqdm(train_dataloader):
+        absolute_paths = batch["absolute_paths"]
+        input_ids1_list = batch["input_ids1_list"]
+
+        image_infos = []
+        for absolute_path, input_ids1 in zip(absolute_paths, input_ids1_list):
+            image_info = train_util.ImageInfo(absolute_path, 1, "dummy", False, absolute_path)
+            image_info.text_encoder_outputs_npz = os.path.splitext(absolute_path)[0] + sc_utils.TEXT_ENCODER_OUTPUTS_CACHE_SUFFIX
+            image_info
+
+            if args.skip_existing:
+                if os.path.exists(image_info.text_encoder_outputs_npz):
+                    logger.warning(f"Skipping {image_info.text_encoder_outputs_npz} because it already exists.")
+                    continue
+
+            image_info.input_ids1 = input_ids1
+            image_infos.append(image_info)
+
+        if len(image_infos) > 0:
+            b_input_ids1 = torch.stack([image_info.input_ids1 for image_info in image_infos])
+            train_util.cache_batch_text_encoder_outputs(
+                image_infos, tokenizers, text_encoders, args.max_token_length, True, b_input_ids1, None, weight_dtype
+            )
+
+    accelerator.wait_for_everyone()
+    accelerator.print(f"Finished caching latents for {len(train_dataset_group)} batches.")
+
+
+def setup_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser()
+
+    train_util.add_tokenizer_arguments(parser)
+    sc_utils.add_text_model_arguments(parser)
+    train_util.add_training_arguments(parser, True)
+    train_util.add_dataset_arguments(parser, True, True, True)
+    config_util.add_config_arguments(parser)
+    sdxl_train_util.add_sdxl_training_arguments(parser)
+    parser.add_argument(
+        "--skip_existing",
+        action="store_true",
+        help="skip images if npz already exists (both normal and flipped exists if flip_aug is enabled) / npzが既に存在する画像をスキップする（flip_aug有効時は通常、反転の両方が存在する画像をスキップ）",
+    )
+    return parser
+
+
+if __name__ == "__main__":
+    parser = setup_parser()
+
+    args = parser.parse_args()
+    args = train_util.read_config_from_file(args, parser)
+
+    cache_to_disk(args)
--- a/train_controlnet.py
+++ b/train_controlnet.py
@@ -5,16 +5,13 @@ import os
 import random
 import time
 from multiprocessing import Value
-
-# from omegaconf import OmegaConf
+from types import SimpleNamespace
 import toml

 from tqdm import tqdm

 import torch
-from library import deepspeed_utils
 from library.device_utils import init_ipex, clean_memory_on_device
-
 init_ipex()

 from torch.nn.parallel import DistributedDataParallel as DDP
@@ -107,8 +104,6 @@ def train(args):
    ds_for_collator = train_dataset_group if args.max_data_loader_n_workers == 0 else None
    collator = train_util.collator_class(current_epoch, current_step, ds_for_collator)

-    train_dataset_group.verify_bucket_reso_steps(64)
-
    if args.debug_dataset:
        train_util.debug_dataset(train_dataset_group)
        return
@@ -152,10 +147,8 @@ def train(args):
            "in_channels": 4,
            "layers_per_block": 2,
            "mid_block_scale_factor": 1,
-            "mid_block_type": "UNetMidBlock2DCrossAttn",
            "norm_eps": 1e-05,
            "norm_num_groups": 32,
-            "num_attention_heads": [5, 10, 20, 20],
            "num_class_embeds": None,
            "only_cross_attention": False,
            "out_channels": 4,
@@ -185,10 +178,8 @@ def train(args):
            "in_channels": 4,
            "layers_per_block": 2,
            "mid_block_scale_factor": 1,
-            "mid_block_type": "UNetMidBlock2DCrossAttn",
            "norm_eps": 1e-05,
            "norm_num_groups": 32,
-            "num_attention_heads": 8,
            "out_channels": 4,
            "sample_size": 64,
            "up_block_types": ["UpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D"],
@@ -201,23 +192,7 @@ def train(args):
            "resnet_time_scale_shift": "default",
            "projection_class_embeddings_input_dim": None,
        }
-    # unet.config = OmegaConf.create(unet.config)
-
-    # make unet.config iterable and accessible by attribute
-    class CustomConfig:
-        def __init__(self, **kwargs):
-            self.__dict__.update(kwargs)
-
-        def __getattr__(self, name):
-            if name in self.__dict__:
-                return self.__dict__[name]
-            else:
-                raise AttributeError(f"'{self.__class__.__name__}' object has no attribute '{name}'")
-
-        def __contains__(self, name):
-            return name in self.__dict__
-
-    unet.config = CustomConfig(**unet.config)
+    unet.config = SimpleNamespace(**unet.config)

    controlnet = ControlNetModel.from_unet(unet)

@@ -250,7 +225,7 @@ def train(args):
            )
        vae.to("cpu")
        clean_memory_on_device(accelerator.device)
-
+        
        accelerator.wait_for_everyone()

    if args.gradient_checkpointing:
@@ -259,7 +234,7 @@ def train(args):
    # 学習に必要なクラスを準備する
    accelerator.print("prepare optimizer, data loader etc.")

-    trainable_params = list(controlnet.parameters())
+    trainable_params = controlnet.parameters()

    _, _, optimizer = train_util.get_optimizer(args, trainable_params)

@@ -368,9 +343,7 @@ def train(args):
        if args.log_tracker_config is not None:
            init_kwargs = toml.load(args.log_tracker_config)
        accelerator.init_trackers(
-            "controlnet_train" if args.log_tracker_name is None else args.log_tracker_name,
-            config=train_util.get_sanitized_config_or_none(args),
-            init_kwargs=init_kwargs,
+            "controlnet_train" if args.log_tracker_name is None else args.log_tracker_name, init_kwargs=init_kwargs
        )

    loss_recorder = train_util.LossRecorder()
@@ -423,7 +396,7 @@ def train(args):
            with accelerator.accumulate(controlnet):
                with torch.no_grad():
                    if "latents" in batch and batch["latents"] is not None:
-                        latents = batch["latents"].to(accelerator.device).to(dtype=weight_dtype)
+                        latents = batch["latents"].to(accelerator.device)
                    else:
                        # latentに変換
                        latents = vae.encode(batch["images"].to(dtype=weight_dtype)).latent_dist.sample()
@@ -446,10 +419,13 @@ def train(args):
                    )

                # Sample a random timestep for each image
-                timesteps, huber_c = train_util.get_timesteps_and_huber_c(
-                    args, 0, noise_scheduler.config.num_train_timesteps, noise_scheduler, b_size, latents.device
+                timesteps = torch.randint(
+                    0,
+                    noise_scheduler.config.num_train_timesteps,
+                    (b_size,),
+                    device=latents.device,
                )
-
+                timesteps = timesteps.long()
                # Add noise to the latents according to the noise magnitude at each timestep
                # (this is the forward diffusion process)
                noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)
@@ -480,9 +456,7 @@ def train(args):
                else:
                    target = noise

-                loss = train_util.conditional_loss(
-                    noise_pred.float(), target.float(), reduction="none", loss_type=args.loss_type, huber_c=huber_c
-                )
+                loss = torch.nn.functional.mse_loss(noise_pred.float(), target.float(), reduction="none")
                loss = loss.mean([1, 2, 3])

                loss_weights = batch["loss_weights"]  # 各sampleごとのweight
@@ -591,7 +565,7 @@ def train(args):

    accelerator.end_training()

-    if is_main_process and (args.save_state or args.save_state_on_train_end):
+    if is_main_process and args.save_state:
        train_util.save_state_on_train_end(args, accelerator)

    # del accelerator  # この後メモリを使うのでこれは消す→printで使うので消さずにおく
@@ -610,7 +584,6 @@ def setup_parser() -> argparse.ArgumentParser:
    train_util.add_sd_models_arguments(parser)
    train_util.add_dataset_arguments(parser, False, True, True)
    train_util.add_training_arguments(parser, False)
-    deepspeed_utils.add_deepspeed_arguments(parser)
    train_util.add_optimizer_arguments(parser)
    config_util.add_config_arguments(parser)
    custom_train_functions.add_custom_train_arguments(parser)
@@ -642,7 +615,6 @@ if __name__ == "__main__":
    parser = setup_parser()

    args = parser.parse_args()
-    train_util.verify_command_line_training_args(args)
    args = train_util.read_config_from_file(args, parser)

    train(args)
--- a/train_db.py
+++ b/train_db.py
@@ -11,10 +11,7 @@ import toml
 from tqdm import tqdm

 import torch
-from library import deepspeed_utils
 from library.device_utils import init_ipex, clean_memory_on_device
-
-
 init_ipex()

 from accelerate.utils import set_seed
@@ -35,7 +32,6 @@ from library.custom_train_functions import (
    apply_noise_offset,
    scale_v_prediction_loss_like_noise_prediction,
    apply_debiased_estimation,
-    apply_masked_loss,
 )
 from library.utils import setup_logging, add_logging_arguments

@@ -50,7 +46,6 @@ logger = logging.getLogger(__name__)
 def train(args):
    train_util.verify_training_args(args)
    train_util.prepare_dataset_args(args, False)
-    deepspeed_utils.prepare_deepspeed_args(args)
    setup_logging(args, reset=True)

    cache_latents = args.cache_latents
@@ -62,7 +57,7 @@ def train(args):

    # データセットを準備する
    if args.dataset_class is None:
-        blueprint_generator = BlueprintGenerator(ConfigSanitizer(True, False, args.masked_loss, True))
+        blueprint_generator = BlueprintGenerator(ConfigSanitizer(True, False, False, True))
        if args.dataset_config is not None:
            logger.info(f"Load dataset config from {args.dataset_config}")
            user_config = config_util.load_user_config(args.dataset_config)
@@ -93,8 +88,6 @@ def train(args):
    if args.no_token_padding:
        train_dataset_group.disable_token_padding()

-    train_dataset_group.verify_bucket_reso_steps(64)
-
    if args.debug_dataset:
        train_util.debug_dataset(train_dataset_group)
        return
@@ -226,25 +219,12 @@ def train(args):
        text_encoder.to(weight_dtype)

    # acceleratorがなんかよろしくやってくれるらしい
-    if args.deepspeed:
-        if args.train_text_encoder:
-            ds_model = deepspeed_utils.prepare_deepspeed_model(args, unet=unet, text_encoder=text_encoder)
-        else:
-            ds_model = deepspeed_utils.prepare_deepspeed_model(args, unet=unet)
-        ds_model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
-            ds_model, optimizer, train_dataloader, lr_scheduler
+    if train_text_encoder:
+        unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
+            unet, text_encoder, optimizer, train_dataloader, lr_scheduler
        )
-        training_models = [ds_model]
-
    else:
-        if train_text_encoder:
-            unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
-                unet, text_encoder, optimizer, train_dataloader, lr_scheduler
-            )
-            training_models = [unet, text_encoder]
-        else:
-            unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(unet, optimizer, train_dataloader, lr_scheduler)
-            training_models = [unet]
+        unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(unet, optimizer, train_dataloader, lr_scheduler)

    if not train_text_encoder:
        text_encoder.to(accelerator.device, dtype=weight_dtype)  # to avoid 'cpu' vs 'cuda' error
@@ -292,7 +272,7 @@ def train(args):
            init_kwargs["wandb"] = {"name": args.wandb_run_name}
        if args.log_tracker_config is not None:
            init_kwargs = toml.load(args.log_tracker_config)
-        accelerator.init_trackers("dreambooth" if args.log_tracker_name is None else args.log_tracker_name, config=train_util.get_sanitized_config_or_none(args), init_kwargs=init_kwargs)
+        accelerator.init_trackers("dreambooth" if args.log_tracker_name is None else args.log_tracker_name, init_kwargs=init_kwargs)

    # For --sample_at_first
    train_util.sample_images(accelerator, args, 0, global_step, accelerator.device, vae, tokenizer, text_encoder, unet)
@@ -316,14 +296,12 @@ def train(args):
                if not args.gradient_checkpointing:
                    text_encoder.train(False)
                text_encoder.requires_grad_(False)
-                if len(training_models) == 2:
-                    training_models = training_models[0]  # remove text_encoder from training_models

-            with accelerator.accumulate(*training_models):
+            with accelerator.accumulate(unet):
                with torch.no_grad():
                    # latentに変換
                    if cache_latents:
-                        latents = batch["latents"].to(accelerator.device).to(dtype=weight_dtype)
+                        latents = batch["latents"].to(accelerator.device)
                    else:
                        latents = vae.encode(batch["images"].to(dtype=weight_dtype)).latent_dist.sample()
                    latents = latents * 0.18215
@@ -348,7 +326,7 @@ def train(args):

                # Sample noise, sample a random timestep for each image, and add noise to the latents,
                # with noise offset and/or multires noise if specified
-                noise, noisy_latents, timesteps, huber_c = train_util.get_noise_noisy_latents_and_timesteps(args, noise_scheduler, latents)
+                noise, noisy_latents, timesteps = train_util.get_noise_noisy_latents_and_timesteps(args, noise_scheduler, latents)

                # Predict the noise residual
                with accelerator.autocast():
@@ -360,9 +338,7 @@ def train(args):
                else:
                    target = noise

-                loss = train_util.conditional_loss(noise_pred.float(), target.float(), reduction="none", loss_type=args.loss_type, huber_c=huber_c)
-                if args.masked_loss or ("alpha_masks" in batch and batch["alpha_masks"] is not None):
-                    loss = apply_masked_loss(loss, batch)
+                loss = torch.nn.functional.mse_loss(noise_pred.float(), target.float(), reduction="none")
                loss = loss.mean([1, 2, 3])

                loss_weights = batch["loss_weights"]  # 各sampleごとのweight
@@ -373,7 +349,7 @@ def train(args):
                if args.scale_v_pred_loss_like_noise_pred:
                    loss = scale_v_prediction_loss_like_noise_prediction(loss, timesteps, noise_scheduler)
                if args.debiased_estimation_loss:
-                    loss = apply_debiased_estimation(loss, timesteps, noise_scheduler, args.v_parameterization)
+                    loss = apply_debiased_estimation(loss, timesteps, noise_scheduler)

                loss = loss.mean()  # 平均なのでbatch_sizeで割る必要なし

@@ -468,7 +444,7 @@ def train(args):

    accelerator.end_training()

-    if is_main_process and (args.save_state or args.save_state_on_train_end):
+    if args.save_state and is_main_process:
        train_util.save_state_on_train_end(args, accelerator)

    del accelerator  # この後メモリを使うのでこれは消す
@@ -488,8 +464,6 @@ def setup_parser() -> argparse.ArgumentParser:
    train_util.add_sd_models_arguments(parser)
    train_util.add_dataset_arguments(parser, True, False, True)
    train_util.add_training_arguments(parser, True)
-    train_util.add_masked_loss_arguments(parser)
-    deepspeed_utils.add_deepspeed_arguments(parser)
    train_util.add_sd_saving_arguments(parser)
    train_util.add_optimizer_arguments(parser)
    config_util.add_config_arguments(parser)
@@ -525,7 +499,6 @@ if __name__ == "__main__":
    parser = setup_parser()

    args = parser.parse_args()
-    train_util.verify_command_line_training_args(args)
    args = train_util.read_config_from_file(args, parser)

    train(args)
--- a/train_network.py
+++ b/train_network.py
@@ -13,15 +13,18 @@ from tqdm import tqdm

 import torch
 from library.device_utils import init_ipex, clean_memory_on_device
-
 init_ipex()

+from torch.nn.parallel import DistributedDataParallel as DDP
+
 from accelerate.utils import set_seed
 from diffusers import DDPMScheduler
-from library import deepspeed_utils, model_util
+from library import model_util

 import library.train_util as train_util
-from library.train_util import DreamBoothDataset
+from library.train_util import (
+    DreamBoothDataset,
+)
 import library.config_util as config_util
 from library.config_util import (
    ConfigSanitizer,
@@ -36,7 +39,6 @@ from library.custom_train_functions import (
    scale_v_prediction_loss_like_noise_prediction,
    add_v_prediction_like_loss,
    apply_debiased_estimation,
-    apply_masked_loss,
 )
 from library.utils import setup_logging, add_logging_arguments

@@ -53,15 +55,7 @@ class NetworkTrainer:

    # TODO 他のスクリプトと共通化する
    def generate_step_logs(
-        self,
-        args: argparse.Namespace,
-        current_loss,
-        avr_loss,
-        lr_scheduler,
-        lr_descriptions,
-        keys_scaled=None,
-        mean_norm=None,
-        maximum_norm=None,
+        self, args: argparse.Namespace, current_loss, avr_loss, lr_scheduler, keys_scaled=None, mean_norm=None, maximum_norm=None
    ):
        logs = {"loss/current": current_loss, "loss/average": avr_loss}

@@ -71,31 +65,39 @@ class NetworkTrainer:
            logs["max_norm/max_key_norm"] = maximum_norm

        lrs = lr_scheduler.get_last_lr()
-        for i, lr in enumerate(lrs):
-            if lr_descriptions is not None:
-                lr_desc = lr_descriptions[i]
+
+        if args.network_train_text_encoder_only or len(lrs) <= 2:  # not block lr (or single block)
+            if args.network_train_unet_only:
+                logs["lr/unet"] = float(lrs[0])
+            elif args.network_train_text_encoder_only:
+                logs["lr/textencoder"] = float(lrs[0])
            else:
-                idx = i - (0 if args.network_train_unet_only else -1)
-                if idx == -1:
-                    lr_desc = "textencoder"
-                else:
-                    if len(lrs) > 2:
-                        lr_desc = f"group{idx}"
-                    else:
-                        lr_desc = "unet"
+                logs["lr/textencoder"] = float(lrs[0])
+                logs["lr/unet"] = float(lrs[-1])  # may be same to textencoder

-            logs[f"lr/{lr_desc}"] = lr
-
-            if args.optimizer_type.lower().startswith("DAdapt".lower()) or args.optimizer_type.lower() == "Prodigy".lower():
-                # tracking d*lr value
-                logs[f"lr/d*lr/{lr_desc}"] = (
-                    lr_scheduler.optimizers[-1].param_groups[i]["d"] * lr_scheduler.optimizers[-1].param_groups[i]["lr"]
+            if (
+                args.optimizer_type.lower().startswith("DAdapt".lower()) or args.optimizer_type.lower() == "Prodigy".lower()
+            ):  # tracking d*lr value of unet.
+                logs["lr/d*lr"] = (
+                    lr_scheduler.optimizers[-1].param_groups[0]["d"] * lr_scheduler.optimizers[-1].param_groups[0]["lr"]
                )
+        else:
+            idx = 0
+            if not args.network_train_unet_only:
+                logs["lr/textencoder"] = float(lrs[0])
+                idx = 1
+
+            for i in range(idx, len(lrs)):
+                logs[f"lr/group{i}"] = float(lrs[i])
+                if args.optimizer_type.lower().startswith("DAdapt".lower()) or args.optimizer_type.lower() == "Prodigy".lower():
+                    logs[f"lr/d*lr/group{i}"] = (
+                        lr_scheduler.optimizers[-1].param_groups[i]["d"] * lr_scheduler.optimizers[-1].param_groups[i]["lr"]
+                    )

        return logs

    def assert_extra_args(self, args, train_dataset_group):
-        train_dataset_group.verify_bucket_reso_steps(64)
+        pass

    def load_target_model(self, args, weight_dtype, accelerator):
        text_encoder, vae, unet, _ = train_util.load_target_model(args, weight_dtype, accelerator)
@@ -139,7 +141,6 @@ class NetworkTrainer:
        training_started_at = time.time()
        train_util.verify_training_args(args)
        train_util.prepare_dataset_args(args, True)
-        deepspeed_utils.prepare_deepspeed_args(args)
        setup_logging(args, reset=True)

        cache_latents = args.cache_latents
@@ -156,7 +157,7 @@ class NetworkTrainer:

        # データセットを準備する
        if args.dataset_class is None:
-            blueprint_generator = BlueprintGenerator(ConfigSanitizer(True, True, args.masked_loss, True))
+            blueprint_generator = BlueprintGenerator(ConfigSanitizer(True, True, False, True))
            if use_user_config:
                logger.info(f"Loading dataset config from {args.dataset_config}")
                user_config = config_util.load_user_config(args.dataset_config)
@@ -323,7 +324,6 @@ class NetworkTrainer:
        network.apply_to(text_encoder, unet, train_text_encoder, train_unet)

        if args.network_weights is not None:
-            # FIXME consider alpha of weights
            info = network.load_weights(args.network_weights)
            accelerator.print(f"load network weights from {args.network_weights}: {info}")

@@ -339,30 +339,12 @@ class NetworkTrainer:

        # 後方互換性を確保するよ
        try:
-            results = network.prepare_optimizer_params(args.text_encoder_lr, args.unet_lr, args.learning_rate)
-            if type(results) is tuple:
-                trainable_params = results[0]
-                lr_descriptions = results[1]
-            else:
-                trainable_params = results
-                lr_descriptions = None
-        except TypeError as e:
-            # logger.warning(f"{e}")
-            # accelerator.print(
-            #     "Deprecated: use prepare_optimizer_params(text_encoder_lr, unet_lr, learning_rate) instead of prepare_optimizer_params(text_encoder_lr, unet_lr)"
-            # )
+            trainable_params = network.prepare_optimizer_params(args.text_encoder_lr, args.unet_lr, args.learning_rate)
+        except TypeError:
+            accelerator.print(
+                "Deprecated: use prepare_optimizer_params(text_encoder_lr, unet_lr, learning_rate) instead of prepare_optimizer_params(text_encoder_lr, unet_lr)"
+            )
            trainable_params = network.prepare_optimizer_params(args.text_encoder_lr, args.unet_lr)
-            lr_descriptions = None
-
-        # if len(trainable_params) == 0:
-        #     accelerator.print("no trainable parameters found / 学習可能なパラメータが見つかりませんでした")
-        # for params in trainable_params:
-        #     for k, v in params.items():
-        #         if type(v) == float:
-        #             pass
-        #         else:
-        #             v = len(v)
-        #         accelerator.print(f"trainable_params: {k} = {v}")

        optimizer_name, optimizer_args, optimizer = train_util.get_optimizer(args, trainable_params)

@@ -431,36 +413,20 @@ class NetworkTrainer:
                t_enc.text_model.embeddings.to(dtype=(weight_dtype if te_weight_dtype != weight_dtype else te_weight_dtype))

        # acceleratorがなんかよろしくやってくれるらしい / accelerator will do something good
-        if args.deepspeed:
-            ds_model = deepspeed_utils.prepare_deepspeed_model(
-                args,
-                unet=unet if train_unet else None,
-                text_encoder1=text_encoders[0] if train_text_encoder else None,
-                text_encoder2=text_encoders[1] if train_text_encoder and len(text_encoders) > 1 else None,
-                network=network,
-            )
-            ds_model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
-                ds_model, optimizer, train_dataloader, lr_scheduler
-            )
-            training_model = ds_model
+        if train_unet:
+            unet = accelerator.prepare(unet)
        else:
-            if train_unet:
-                unet = accelerator.prepare(unet)
+            unet.to(accelerator.device, dtype=unet_weight_dtype)  # move to device because unet is not prepared by accelerator
+        if train_text_encoder:
+            if len(text_encoders) > 1:
+                text_encoder = text_encoders = [accelerator.prepare(t_enc) for t_enc in text_encoders]
            else:
-                unet.to(accelerator.device, dtype=unet_weight_dtype)  # move to device because unet is not prepared by accelerator
-            if train_text_encoder:
-                if len(text_encoders) > 1:
-                    text_encoder = text_encoders = [accelerator.prepare(t_enc) for t_enc in text_encoders]
-                else:
-                    text_encoder = accelerator.prepare(text_encoder)
-                    text_encoders = [text_encoder]
-            else:
-                pass  # if text_encoder is not trained, no need to prepare. and device and dtype are already set
+                text_encoder = accelerator.prepare(text_encoder)
+                text_encoders = [text_encoder]
+        else:
+            pass  # if text_encoder is not trained, no need to prepare. and device and dtype are already set

-            network, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
-                network, optimizer, train_dataloader, lr_scheduler
-            )
-            training_model = network
+        network, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(network, optimizer, train_dataloader, lr_scheduler)

        if args.gradient_checkpointing:
            # according to TI example in Diffusers, train is required
@@ -490,51 +456,6 @@ class NetworkTrainer:
        if args.full_fp16:
            train_util.patch_accelerator_for_fp16_training(accelerator)

-        # before resuming make hook for saving/loading to save/load the network weights only
-        def save_model_hook(models, weights, output_dir):
-            # pop weights of other models than network to save only network weights
-            # only main process or deepspeed https://github.com/huggingface/diffusers/issues/2606
-            if accelerator.is_main_process or args.deepspeed:
-                remove_indices = []
-                for i, model in enumerate(models):
-                    if not isinstance(model, type(accelerator.unwrap_model(network))):
-                        remove_indices.append(i)
-                for i in reversed(remove_indices):
-                    if len(weights) > i:
-                        weights.pop(i)
-                # print(f"save model hook: {len(weights)} weights will be saved")
-
-            # save current ecpoch and step
-            train_state_file = os.path.join(output_dir, "train_state.json")
-            # +1 is needed because the state is saved before current_step is set from global_step
-            logger.info(f"save train state to {train_state_file} at epoch {current_epoch.value} step {current_step.value+1}")
-            with open(train_state_file, "w", encoding="utf-8") as f:
-                json.dump({"current_epoch": current_epoch.value, "current_step": current_step.value + 1}, f)
-
-        steps_from_state = None
-
-        def load_model_hook(models, input_dir):
-            # remove models except network
-            remove_indices = []
-            for i, model in enumerate(models):
-                if not isinstance(model, type(accelerator.unwrap_model(network))):
-                    remove_indices.append(i)
-            for i in reversed(remove_indices):
-                models.pop(i)
-            # print(f"load model hook: {len(models)} models will be loaded")
-
-            # load current epoch and step to
-            nonlocal steps_from_state
-            train_state_file = os.path.join(input_dir, "train_state.json")
-            if os.path.exists(train_state_file):
-                with open(train_state_file, "r", encoding="utf-8") as f:
-                    data = json.load(f)
-                steps_from_state = data["current_step"]
-                logger.info(f"load train state from {train_state_file}: {data}")
-
-        accelerator.register_save_state_pre_hook(save_model_hook)
-        accelerator.register_load_state_pre_hook(load_model_hook)
-
        # resumeする
        train_util.resume_from_local_or_hf_if_specified(accelerator, args)

@@ -608,11 +529,6 @@ class NetworkTrainer:
            "ss_scale_weight_norms": args.scale_weight_norms,
            "ss_ip_noise_gamma": args.ip_noise_gamma,
            "ss_debiased_estimation": bool(args.debiased_estimation_loss),
-            "ss_noise_offset_random_strength": args.noise_offset_random_strength,
-            "ss_ip_noise_gamma_random_strength": args.ip_noise_gamma_random_strength,
-            "ss_loss_type": args.loss_type,
-            "ss_huber_schedule": args.huber_schedule,
-            "ss_huber_c": args.huber_c,
        }

        if use_user_config:
@@ -648,11 +564,6 @@ class NetworkTrainer:
                        "random_crop": bool(subset.random_crop),
                        "shuffle_caption": bool(subset.shuffle_caption),
                        "keep_tokens": subset.keep_tokens,
-                        "keep_tokens_separator": subset.keep_tokens_separator,
-                        "secondary_separator": subset.secondary_separator,
-                        "enable_wildcard": bool(subset.enable_wildcard),
-                        "caption_prefix": subset.caption_prefix,
-                        "caption_suffix": subset.caption_suffix,
                    }

                    image_dir_or_metadata_file = None
@@ -775,54 +686,7 @@ class NetworkTrainer:
            if key in metadata:
                minimum_metadata[key] = metadata[key]

-        # calculate steps to skip when resuming or starting from a specific step
-        initial_step = 0
-        if args.initial_epoch is not None or args.initial_step is not None:
-            # if initial_epoch or initial_step is specified, steps_from_state is ignored even when resuming
-            if steps_from_state is not None:
-                logger.warning(
-                    "steps from the state is ignored because initial_step is specified / initial_stepが指定されているため、stateからのステップ数は無視されます"
-                )
-            if args.initial_step is not None:
-                initial_step = args.initial_step
-            else:
-                # num steps per epoch is calculated by num_processes and gradient_accumulation_steps
-                initial_step = (args.initial_epoch - 1) * math.ceil(
-                    len(train_dataloader) / accelerator.num_processes / args.gradient_accumulation_steps
-                )
-        else:
-            # if initial_epoch and initial_step are not specified, steps_from_state is used when resuming
-            if steps_from_state is not None:
-                initial_step = steps_from_state
-                steps_from_state = None
-
-        if initial_step > 0:
-            assert (
-                args.max_train_steps > initial_step
-            ), f"max_train_steps should be greater than initial step / max_train_stepsは初期ステップより大きい必要があります: {args.max_train_steps} vs {initial_step}"
-
-        progress_bar = tqdm(
-            range(args.max_train_steps - initial_step), smoothing=0, disable=not accelerator.is_local_main_process, desc="steps"
-        )
-
-        epoch_to_start = 0
-        if initial_step > 0:
-            if args.skip_until_initial_step:
-                # if skip_until_initial_step is specified, load data and discard it to ensure the same data is used
-                if not args.resume:
-                    logger.info(
-                        f"initial_step is specified but not resuming. lr scheduler will be started from the beginning / initial_stepが指定されていますがresumeしていないため、lr schedulerは最初から始まります"
-                    )
-                logger.info(f"skipping {initial_step} steps / {initial_step}ステップをスキップします")
-                initial_step *= args.gradient_accumulation_steps
-
-                # set epoch to start to make initial_step less than len(train_dataloader)
-                epoch_to_start = initial_step // math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
-            else:
-                # if not, only epoch no is skipped for informative purpose
-                epoch_to_start = initial_step // math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
-                initial_step = 0  # do not skip
-
+        progress_bar = tqdm(range(args.max_train_steps), smoothing=0, disable=not accelerator.is_local_main_process, desc="steps")
        global_step = 0

        noise_scheduler = DDPMScheduler(
@@ -839,9 +703,7 @@ class NetworkTrainer:
            if args.log_tracker_config is not None:
                init_kwargs = toml.load(args.log_tracker_config)
            accelerator.init_trackers(
-                "network_train" if args.log_tracker_name is None else args.log_tracker_name,
-                config=train_util.get_sanitized_config_or_none(args),
-                init_kwargs=init_kwargs,
+                "network_train" if args.log_tracker_name is None else args.log_tracker_name, init_kwargs=init_kwargs
            )

        loss_recorder = train_util.LossRecorder()
@@ -881,13 +743,7 @@ class NetworkTrainer:
        self.sample_images(accelerator, args, 0, global_step, accelerator.device, vae, tokenizer, text_encoder, unet)

        # training loop
-        if initial_step > 0:  # only if skip_until_initial_step is specified
-            for skip_epoch in range(epoch_to_start):  # skip epochs
-                logger.info(f"skipping epoch {skip_epoch+1} because initial_step (multiplied) is {initial_step}")
-                initial_step -= len(train_dataloader)
-            global_step = initial_step
-
-        for epoch in range(epoch_to_start, num_train_epochs):
+        for epoch in range(num_train_epochs):
            accelerator.print(f"\nepoch {epoch+1}/{num_train_epochs}")
            current_epoch.value = epoch + 1

@@ -895,32 +751,23 @@ class NetworkTrainer:

            accelerator.unwrap_model(network).on_epoch_start(text_encoder, unet)

-            skipped_dataloader = None
-            if initial_step > 0:
-                skipped_dataloader = accelerator.skip_first_batches(train_dataloader, initial_step - 1)
-                initial_step = 1
-
-            for step, batch in enumerate(skipped_dataloader or train_dataloader):
+            for step, batch in enumerate(train_dataloader):
                current_step.value = global_step
-                if initial_step > 0:
-                    initial_step -= 1
-                    continue
-
-                with accelerator.accumulate(training_model):
+                with accelerator.accumulate(network):
                    on_step_start(text_encoder, unet)

-                    if "latents" in batch and batch["latents"] is not None:
-                        latents = batch["latents"].to(accelerator.device).to(dtype=weight_dtype)
-                    else:
-                        with torch.no_grad():
+                    with torch.no_grad():
+                        if "latents" in batch and batch["latents"] is not None:
+                            latents = batch["latents"].to(accelerator.device)
+                        else:
                            # latentに変換
-                            latents = vae.encode(batch["images"].to(dtype=vae_dtype)).latent_dist.sample().to(dtype=weight_dtype)
+                            latents = vae.encode(batch["images"].to(dtype=vae_dtype)).latent_dist.sample()

                            # NaNが含まれていれば警告を表示し0に置き換える
                            if torch.any(torch.isnan(latents)):
                                accelerator.print("NaN found in latents, replacing with zeros")
                                latents = torch.nan_to_num(latents, 0, out=latents)
-                    latents = latents * self.vae_scale_factor
+                        latents = latents * self.vae_scale_factor

                    # get multiplier for each sample
                    if network_has_multiplier:
@@ -951,7 +798,7 @@ class NetworkTrainer:

                    # Sample noise, sample a random timestep for each image, and add noise to the latents,
                    # with noise offset and/or multires noise if specified
-                    noise, noisy_latents, timesteps, huber_c = train_util.get_noise_noisy_latents_and_timesteps(
+                    noise, noisy_latents, timesteps = train_util.get_noise_noisy_latents_and_timesteps(
                        args, noise_scheduler, latents
                    )

@@ -981,11 +828,7 @@ class NetworkTrainer:
                    else:
                        target = noise

-                    loss = train_util.conditional_loss(
-                        noise_pred.float(), target.float(), reduction="none", loss_type=args.loss_type, huber_c=huber_c
-                    )
-                    if args.masked_loss or ("alpha_masks" in batch and batch["alpha_masks"] is not None):
-                        loss = apply_masked_loss(loss, batch)
+                    loss = torch.nn.functional.mse_loss(noise_pred.float(), target.float(), reduction="none")
                    loss = loss.mean([1, 2, 3])

                    loss_weights = batch["loss_weights"]  # 各sampleごとのweight
@@ -998,7 +841,7 @@ class NetworkTrainer:
                    if args.v_pred_like_loss:
                        loss = add_v_prediction_like_loss(loss, timesteps, noise_scheduler, args.v_pred_like_loss)
                    if args.debiased_estimation_loss:
-                        loss = apply_debiased_estimation(loss, timesteps, noise_scheduler, args.v_parameterization)
+                        loss = apply_debiased_estimation(loss, timesteps, noise_scheduler)

                    loss = loss.mean()  # 平均なのでbatch_sizeで割る必要なし

@@ -1053,9 +896,7 @@ class NetworkTrainer:
                    progress_bar.set_postfix(**{**max_mean_logs, **logs})

                if args.logging_dir is not None:
-                    logs = self.generate_step_logs(
-                        args, current_loss, avr_loss, lr_scheduler, lr_descriptions, keys_scaled, mean_norm, maximum_norm
-                    )
+                    logs = self.generate_step_logs(args, current_loss, avr_loss, lr_scheduler, keys_scaled, mean_norm, maximum_norm)
                    accelerator.log(logs, step=global_step)

                if global_step >= args.max_train_steps:
@@ -1094,7 +935,7 @@ class NetworkTrainer:

        accelerator.end_training()

-        if is_main_process and (args.save_state or args.save_state_on_train_end):
+        if is_main_process and args.save_state:
            train_util.save_state_on_train_end(args, accelerator)

        if is_main_process:
@@ -1111,8 +952,6 @@ def setup_parser() -> argparse.ArgumentParser:
    train_util.add_sd_models_arguments(parser)
    train_util.add_dataset_arguments(parser, True, True, True)
    train_util.add_training_arguments(parser, True)
-    train_util.add_masked_loss_arguments(parser)
-    deepspeed_utils.add_deepspeed_arguments(parser)
    train_util.add_optimizer_arguments(parser)
    config_util.add_config_arguments(parser)
    custom_train_functions.add_custom_train_arguments(parser)
@@ -1206,28 +1045,6 @@ def setup_parser() -> argparse.ArgumentParser:
        action="store_true",
        help="do not use fp16/bf16 VAE in mixed precision (use float VAE) / mixed precisionでも fp16/bf16 VAEを使わずfloat VAEを使う",
    )
-    parser.add_argument(
-        "--skip_until_initial_step",
-        action="store_true",
-        help="skip training until initial_step is reached / initial_stepに到達するまで学習をスキップする",
-    )
-    parser.add_argument(
-        "--initial_epoch",
-        type=int,
-        default=None,
-        help="initial epoch number, 1 means first epoch (same as not specifying). NOTE: initial_epoch/step doesn't affect to lr scheduler. Which means lr scheduler will start from 0 without `--resume`."
-        + " / 初期エポック数、1で最初のエポック（未指定時と同じ）。注意：initial_epoch/stepはlr schedulerに影響しないため、`--resume`しない場合はlr schedulerは0から始まる",
-    )
-    parser.add_argument(
-        "--initial_step",
-        type=int,
-        default=None,
-        help="initial step number including all epochs, 0 means first step (same as not specifying). overwrites initial_epoch."
-        + " / 初期ステップ数、全エポックを含むステップ数、0で最初のステップ（未指定時と同じ）。initial_epochを上書きする",
-    )
-    # parser.add_argument("--loraplus_lr_ratio", default=None, type=float, help="LoRA+ learning rate ratio")
-    # parser.add_argument("--loraplus_unet_lr_ratio", default=None, type=float, help="LoRA+ UNet learning rate ratio")
-    # parser.add_argument("--loraplus_text_encoder_lr_ratio", default=None, type=float, help="LoRA+ text encoder learning rate ratio")
    return parser


@@ -1235,7 +1052,6 @@ if __name__ == "__main__":
    parser = setup_parser()

    args = parser.parse_args()
-    train_util.verify_command_line_training_args(args)
    args = train_util.read_config_from_file(args, parser)

    trainer = NetworkTrainer()
--- a/train_textual_inversion.py
+++ b/train_textual_inversion.py
@@ -8,14 +8,12 @@ from tqdm import tqdm

 import torch
 from library.device_utils import init_ipex, clean_memory_on_device
-
-
 init_ipex()

 from accelerate.utils import set_seed
 from diffusers import DDPMScheduler
 from transformers import CLIPTokenizer
-from library import deepspeed_utils, model_util
+from library import model_util

 import library.train_util as train_util
 import library.huggingface_util as huggingface_util
@@ -31,7 +29,6 @@ from library.custom_train_functions import (
    scale_v_prediction_loss_like_noise_prediction,
    add_v_prediction_like_loss,
    apply_debiased_estimation,
-    apply_masked_loss,
 )
 from library.utils import setup_logging, add_logging_arguments

@@ -99,7 +96,7 @@ class TextualInversionTrainer:
        self.is_sdxl = False

    def assert_extra_args(self, args, train_dataset_group):
-        train_dataset_group.verify_bucket_reso_steps(64)
+        pass

    def load_target_model(self, args, weight_dtype, accelerator):
        text_encoder, vae, unet, _ = train_util.load_target_model(args, weight_dtype, accelerator)
@@ -271,7 +268,7 @@ class TextualInversionTrainer:

        # データセットを準備する
        if args.dataset_class is None:
-            blueprint_generator = BlueprintGenerator(ConfigSanitizer(True, True, args.masked_loss, False))
+            blueprint_generator = BlueprintGenerator(ConfigSanitizer(True, True, False, False))
            if args.dataset_config is not None:
                accelerator.print(f"Load dataset config from {args.dataset_config}")
                user_config = config_util.load_user_config(args.dataset_config)
@@ -510,7 +507,7 @@ class TextualInversionTrainer:
            if args.log_tracker_config is not None:
                init_kwargs = toml.load(args.log_tracker_config)
            accelerator.init_trackers(
-                "textual_inversion" if args.log_tracker_name is None else args.log_tracker_name, config=train_util.get_sanitized_config_or_none(args), init_kwargs=init_kwargs
+                "textual_inversion" if args.log_tracker_name is None else args.log_tracker_name, init_kwargs=init_kwargs
            )

        # function for saving/removing
@@ -561,10 +558,10 @@ class TextualInversionTrainer:
                with accelerator.accumulate(text_encoders[0]):
                    with torch.no_grad():
                        if "latents" in batch and batch["latents"] is not None:
-                            latents = batch["latents"].to(accelerator.device).to(dtype=weight_dtype)
+                            latents = batch["latents"].to(accelerator.device)
                        else:
                            # latentに変換
-                            latents = vae.encode(batch["images"].to(dtype=vae_dtype)).latent_dist.sample().to(dtype=weight_dtype)
+                            latents = vae.encode(batch["images"].to(dtype=vae_dtype)).latent_dist.sample()
                        latents = latents * self.vae_scale_factor

                    # Get the text embedding for conditioning
@@ -572,7 +569,7 @@ class TextualInversionTrainer:

                    # Sample noise, sample a random timestep for each image, and add noise to the latents,
                    # with noise offset and/or multires noise if specified
-                    noise, noisy_latents, timesteps, huber_c = train_util.get_noise_noisy_latents_and_timesteps(
+                    noise, noisy_latents, timesteps = train_util.get_noise_noisy_latents_and_timesteps(
                        args, noise_scheduler, latents
                    )

@@ -588,9 +585,7 @@ class TextualInversionTrainer:
                    else:
                        target = noise

-                    loss = train_util.conditional_loss(noise_pred.float(), target.float(), reduction="none", loss_type=args.loss_type, huber_c=huber_c)
-                    if args.masked_loss or ("alpha_masks" in batch and batch["alpha_masks"] is not None):
-                        loss = apply_masked_loss(loss, batch)
+                    loss = torch.nn.functional.mse_loss(noise_pred.float(), target.float(), reduction="none")
                    loss = loss.mean([1, 2, 3])

                    loss_weights = batch["loss_weights"]  # 各sampleごとのweight
@@ -603,7 +598,7 @@ class TextualInversionTrainer:
                    if args.v_pred_like_loss:
                        loss = add_v_prediction_like_loss(loss, timesteps, noise_scheduler, args.v_pred_like_loss)
                    if args.debiased_estimation_loss:
-                        loss = apply_debiased_estimation(loss, timesteps, noise_scheduler, args.v_parameterization)
+                        loss = apply_debiased_estimation(loss, timesteps, noise_scheduler)

                    loss = loss.mean()  # 平均なのでbatch_sizeで割る必要なし

@@ -737,7 +732,7 @@ class TextualInversionTrainer:

        accelerator.end_training()

-        if is_main_process and (args.save_state or args.save_state_on_train_end):
+        if args.save_state and is_main_process:
            train_util.save_state_on_train_end(args, accelerator)

        if is_main_process:
@@ -754,8 +749,6 @@ def setup_parser() -> argparse.ArgumentParser:
    train_util.add_sd_models_arguments(parser)
    train_util.add_dataset_arguments(parser, True, True, False)
    train_util.add_training_arguments(parser, True)
-    train_util.add_masked_loss_arguments(parser)
-    deepspeed_utils.add_deepspeed_arguments(parser)
    train_util.add_optimizer_arguments(parser)
    config_util.add_config_arguments(parser)
    custom_train_functions.add_custom_train_arguments(parser, False)
@@ -806,7 +799,6 @@ if __name__ == "__main__":
    parser = setup_parser()

    args = parser.parse_args()
-    train_util.verify_command_line_training_args(args)
    args = train_util.read_config_from_file(args, parser)

    trainer = TextualInversionTrainer()
--- a/train_textual_inversion_XTI.py
+++ b/train_textual_inversion_XTI.py
@@ -8,9 +8,7 @@ from multiprocessing import Value
 from tqdm import tqdm

 import torch
-from library import deepspeed_utils
 from library.device_utils import init_ipex, clean_memory_on_device
-
 init_ipex()

 from accelerate.utils import set_seed
@@ -33,7 +31,6 @@ from library.custom_train_functions import (
    apply_noise_offset,
    scale_v_prediction_loss_like_noise_prediction,
    apply_debiased_estimation,
-    apply_masked_loss,
 )
 import library.original_unet as original_unet
 from XTI_hijack import unet_forward_XTI, downblock_forward_XTI, upblock_forward_XTI
@@ -203,7 +200,7 @@ def train(args):
    logger.info(f"create embeddings for {args.num_vectors_per_token} tokens, for {args.token_string}")

    # データセットを準備する
-    blueprint_generator = BlueprintGenerator(ConfigSanitizer(True, True, args.masked_loss, False))
+    blueprint_generator = BlueprintGenerator(ConfigSanitizer(True, True, False, False))
    if args.dataset_config is not None:
        logger.info(f"Load dataset config from {args.dataset_config}")
        user_config = config_util.load_user_config(args.dataset_config)
@@ -407,7 +404,7 @@ def train(args):
        if args.log_tracker_config is not None:
            init_kwargs = toml.load(args.log_tracker_config)
        accelerator.init_trackers(
-            "textual_inversion" if args.log_tracker_name is None else args.log_tracker_name, config=train_util.get_sanitized_config_or_none(args), init_kwargs=init_kwargs
+            "textual_inversion" if args.log_tracker_name is None else args.log_tracker_name, init_kwargs=init_kwargs
        )

    # function for saving/removing
@@ -442,7 +439,7 @@ def train(args):
            with accelerator.accumulate(text_encoder):
                with torch.no_grad():
                    if "latents" in batch and batch["latents"] is not None:
-                        latents = batch["latents"].to(accelerator.device).to(dtype=weight_dtype)
+                        latents = batch["latents"].to(accelerator.device)
                    else:
                        # latentに変換
                        latents = vae.encode(batch["images"].to(dtype=weight_dtype)).latent_dist.sample()
@@ -461,7 +458,7 @@ def train(args):

                # Sample noise, sample a random timestep for each image, and add noise to the latents,
                # with noise offset and/or multires noise if specified
-                noise, noisy_latents, timesteps, huber_c = train_util.get_noise_noisy_latents_and_timesteps(args, noise_scheduler, latents)
+                noise, noisy_latents, timesteps = train_util.get_noise_noisy_latents_and_timesteps(args, noise_scheduler, latents)

                # Predict the noise residual
                with accelerator.autocast():
@@ -473,9 +470,7 @@ def train(args):
                else:
                    target = noise

-                loss = train_util.conditional_loss(noise_pred.float(), target.float(), reduction="none", loss_type=args.loss_type, huber_c=huber_c)
-                if args.masked_loss or ("alpha_masks" in batch and batch["alpha_masks"] is not None):
-                    loss = apply_masked_loss(loss, batch)
+                loss = torch.nn.functional.mse_loss(noise_pred.float(), target.float(), reduction="none")
                loss = loss.mean([1, 2, 3])

                loss_weights = batch["loss_weights"]  # 各sampleごとのweight
@@ -486,7 +481,7 @@ def train(args):
                if args.scale_v_pred_loss_like_noise_pred:
                    loss = scale_v_prediction_loss_like_noise_prediction(loss, timesteps, noise_scheduler)
                if args.debiased_estimation_loss:
-                    loss = apply_debiased_estimation(loss, timesteps, noise_scheduler, args.v_parameterization)
+                    loss = apply_debiased_estimation(loss, timesteps, noise_scheduler)

                loss = loss.mean()  # 平均なのでbatch_sizeで割る必要なし

@@ -591,7 +586,7 @@ def train(args):

    accelerator.end_training()

-    if is_main_process and (args.save_state or args.save_state_on_train_end):
+    if args.save_state and is_main_process:
        train_util.save_state_on_train_end(args, accelerator)

    updated_embs = text_encoder.get_input_embeddings().weight[token_ids_XTI].data.detach().clone()
@@ -667,8 +662,6 @@ def setup_parser() -> argparse.ArgumentParser:
    train_util.add_sd_models_arguments(parser)
    train_util.add_dataset_arguments(parser, True, True, False)
    train_util.add_training_arguments(parser, True)
-    train_util.add_masked_loss_arguments(parser)
-    deepspeed_utils.add_deepspeed_arguments(parser)
    train_util.add_optimizer_arguments(parser)
    config_util.add_config_arguments(parser)
    custom_train_functions.add_custom_train_arguments(parser, False)
@@ -714,7 +707,6 @@ if __name__ == "__main__":
    parser = setup_parser()

    args = parser.parse_args()
-    train_util.verify_command_line_training_args(args)
    args = train_util.read_config_from_file(args, parser)

    train(args)
Author	SHA1	Message	Date
Kohya S	235a1ea2c6	Merge branch 'dev' into stable-cascade	2024-02-25 20:03:39 +09:00
Kohya S	cb648a2bf8	update readme	2024-02-25 20:03:00 +09:00
Kohya S	3a2a48c15d	make LoRA compatible with ComfyUI #1119	2024-02-25 20:01:37 +09:00
Kohya S	40f2c688db	fix stage c weight is loaded in bf16/fp16 #1119	2024-02-25 09:39:53 +09:00
Kohya S	e4f8736c60	Merge branch 'dev' into stable-cascade	2024-02-25 08:58:27 +09:00
Kohya S	13f49d1e4a	update readme	2024-02-22 23:50:10 +09:00
Kohya S	df7648245e	update readme	2024-02-22 23:41:46 +09:00
Kohya S	3368fb1af7	Modify nn.MHA to attn with q/k/v	2024-02-22 23:39:28 +09:00
Kohya S	417f14d245	Merge pull request #1130 from sdbds/fixbugs [stable-cascade]add save parser and fix lora scripts model name and hash	2024-02-22 12:30:59 +09:00
青龍聖者@bdsqlsz	86503cb945	add save parser and fix lora scripts model name and hash	2024-02-21 19:38:12 +08:00
Kohya S	d91b1d3793	update readme	2024-02-20 22:39:57 +09:00
Kohya S	70917077a6	update readme	2024-02-20 22:38:36 +09:00
Kohya S	69dbc50912	fix effnet encoder preprocess issue	2024-02-20 22:34:06 +09:00
Kohya S	985761ca43	fix to work without network module	2024-02-20 20:33:03 +09:00
Kohya S	71e03559e2	support LoRA training for Stable Cascade Stage C	2024-02-20 08:27:11 +09:00
Kohya S	806a6237fb	minor fixes	2024-02-18 21:57:16 +09:00
Kohya S	9b0e532942	add command line sample	2024-02-18 21:40:36 +09:00
Kohya S	c26f01241f	input prompt from console	2024-02-18 21:29:46 +09:00
Kohya S	ac71168939	add train_text_encoder arg	2024-02-18 21:29:10 +09:00
Kohya S	4e37d950d2	fix typos	2024-02-18 18:02:20 +09:00
Kohya S	4b5784eb44	update stable cascade stage C training #1119	2024-02-18 17:54:21 +09:00
Kohya S	856df07f49	Merge branch 'dev' into stable-cascade	2024-02-18 09:15:12 +09:00
Kohya S	80ef59c115	support text encoder training in stable cascade	2024-02-18 09:12:37 +09:00
Kohya S	319bbf8057	add stage c tmp training code	2024-02-17 23:59:20 +09:00
Kohya S	fa440208b7	add inference script	2024-02-17 17:57:30 +09:00