feat: add Chroma model implementation (WIP)

fix: update parameter names for CFG truncate and Renorm CFG in documentation and code
fix: update default values for timestep_sampling and model_prediction_type in training arguments
2026-04-06 21:52:27 +00:00 · 2025-07-14 08:52:35 +09:00 · 2025-07-13 21:00:27 +09:00 · 2025-07-13 20:52:00 +09:00 · 2025-07-13 20:49:38 +09:00 · 2025-07-13 20:46:24 +09:00
171 changed files with 81493 additions and 8475 deletions
--- a/.ai/claude.prompt.md
+++ b/.ai/claude.prompt.md
@@ -0,0 +1,9 @@
+## About This File
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## 1. Project Context
+Here is the essential context for our project. Please read and understand it thoroughly.
+
+### Project Overview
+@./context/01-overview.md
--- a/.ai/context/01-overview.md
+++ b/.ai/context/01-overview.md
@@ -0,0 +1,101 @@
+This file provides the overview and guidance for developers working with the codebase, including setup instructions, architecture details, and common commands.
+
+## Project Architecture
+
+### Core Training Framework
+The codebase is built around a **strategy pattern architecture** that supports multiple diffusion model families:
+
+- **`library/strategy_base.py`**: Base classes for tokenization, text encoding, latent caching, and training strategies
+- **`library/strategy_*.py`**: Model-specific implementations for SD, SDXL, SD3, FLUX, etc.
+- **`library/train_util.py`**: Core training utilities shared across all model types
+- **`library/config_util.py`**: Configuration management with TOML support
+
+### Model Support Structure
+Each supported model family has a consistent structure:
+- **Training script**: `{model}_train.py` (full fine-tuning), `{model}_train_network.py` (LoRA/network training)
+- **Model utilities**: `library/{model}_models.py`, `library/{model}_train_utils.py`, `library/{model}_utils.py`
+- **Networks**: `networks/lora_{model}.py`, `networks/oft_{model}.py` for adapter training
+
+### Supported Models
+- **Stable Diffusion 1.x**: `train*.py`, `library/train_util.py`, `train_db.py` (for DreamBooth)
+- **SDXL**: `sdxl_train*.py`, `library/sdxl_*`
+- **SD3**: `sd3_train*.py`, `library/sd3_*`
+- **FLUX.1**: `flux_train*.py`, `library/flux_*`
+
+### Key Components
+
+#### Memory Management
+- **Block swapping**: CPU-GPU memory optimization via `--blocks_to_swap` parameter, works with custom offloading. Only available for models with transformer architectures like SD3 and FLUX.1.
+- **Custom offloading**: `library/custom_offloading_utils.py` for advanced memory management
+- **Gradient checkpointing**: Memory reduction during training
+
+#### Training Features
+- **LoRA training**: Low-rank adaptation networks in `networks/lora*.py`
+- **ControlNet training**: Conditional generation control
+- **Textual Inversion**: Custom embedding training
+- **Multi-resolution training**: Bucket-based aspect ratio handling
+- **Validation loss**: Real-time training monitoring, only for LoRA training
+
+#### Configuration System
+Dataset configuration uses TOML files with structured validation:
+```toml
+[datasets.sample_dataset]
+  resolution = 1024
+  batch_size = 2
+  
+  [[datasets.sample_dataset.subsets]]
+    image_dir = "path/to/images"
+    caption_extension = ".txt"
+```
+
+## Common Development Commands
+
+### Training Commands Pattern
+All training scripts follow this general pattern:
+```bash
+accelerate launch --mixed_precision bf16 {script_name}.py \
+  --pretrained_model_name_or_path model.safetensors \
+  --dataset_config config.toml \
+  --output_dir output \
+  --output_name model_name \
+  [model-specific options]
+```
+
+### Memory Optimization
+For low VRAM environments, use block swapping:
+```bash
+# Add to any training command for memory reduction
+--blocks_to_swap 10  # Swap 10 blocks to CPU (adjust number as needed)
+```
+
+### Utility Scripts
+Located in `tools/` directory:
+- `tools/merge_lora.py`: Merge LoRA weights into base models
+- `tools/cache_latents.py`: Pre-cache VAE latents for faster training
+- `tools/cache_text_encoder_outputs.py`: Pre-cache text encoder outputs
+
+## Development Notes
+
+### Strategy Pattern Implementation
+When adding support for new models, implement the four core strategies:
+1. `TokenizeStrategy`: Text tokenization handling
+2. `TextEncodingStrategy`: Text encoder forward pass
+3. `LatentsCachingStrategy`: VAE encoding/caching
+4. `TextEncoderOutputsCachingStrategy`: Text encoder output caching
+
+### Testing Approach
+- Unit tests focus on utility functions and model loading
+- Integration tests validate training script syntax and basic execution
+- Most tests use mocks to avoid requiring actual model files
+- Add tests for new model support in `tests/test_{model}_*.py`
+
+### Configuration System
+- Use `config_util.py` dataclasses for type-safe configuration
+- Support both command-line arguments and TOML file configuration
+- Validate configuration early in training scripts to prevent runtime errors
+
+### Memory Management
+- Always consider VRAM limitations when implementing features
+- Use gradient checkpointing for large models
+- Implement block swapping for models with transformer architectures
+- Cache intermediate results (latents, text embeddings) when possible
--- a/.ai/gemini.prompt.md
+++ b/.ai/gemini.prompt.md
@@ -0,0 +1,9 @@
+## About This File
+
+This file provides guidance to Gemini CLI (https://github.com/google-gemini/gemini-cli) when working with code in this repository.
+
+## 1. Project Context
+Here is the essential context for our project. Please read and understand it thoroughly.
+
+### Project Overview
+@./context/01-overview.md
--- a/.github/FUNDING.yml
+++ b/.github/FUNDING.yml
@@ -0,0 +1,3 @@
+# These are supported funding model platforms
+
+github: kohya-ss
--- a/.github/dependabot.yml
+++ b/.github/dependabot.yml
@@ -0,0 +1,7 @@
+---
+version: 2
+updates:
+  - package-ecosystem: "github-actions"
+    directory: "/"
+    schedule:
+      interval: "monthly"
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -0,0 +1,51 @@
+name: Test with pytest
+
+on: 
+  push:
+    branches:
+      - main
+      - dev
+      - sd3
+  pull_request:
+    branches:
+      - main
+      - dev
+      - sd3
+
+# CKV2_GHA_1: "Ensure top-level permissions are not set to write-all"
+permissions: read-all
+
+jobs:
+  build:
+    runs-on: ${{ matrix.os }}
+    strategy:
+      matrix:
+        os: [ubuntu-latest]
+        python-version: ["3.10"] # Python versions to test
+        pytorch-version: ["2.4.0"] # PyTorch versions to test
+
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          # https://woodruffw.github.io/zizmor/audits/#artipacked
+          persist-credentials: false
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+          cache: 'pip' 
+
+      - name: Install and update pip, setuptools, wheel
+        run: |
+          # Setuptools, wheel for compiling some packages
+          python -m pip install --upgrade pip setuptools wheel
+
+      - name: Install dependencies
+        run: |
+          # Pre-install torch to pin version (requirements.txt has dependencies like transformers which requires pytorch)
+          pip install dadaptation==3.2 torch==${{ matrix.pytorch-version }} torchvision pytest==8.3.4
+          pip install -r requirements.txt
+
+      - name: Test with pytest
+        run: pytest # See pytest.ini for configuration
+
--- a/.github/workflows/typos.yml
+++ b/.github/workflows/typos.yml
@@ -0,0 +1,29 @@
+---
+name: Typos
+
+on: 
+  push:
+    branches:
+      - main
+      - dev
+  pull_request:
+    types:
+      - opened
+      - synchronize
+      - reopened
+
+# CKV2_GHA_1: "Ensure top-level permissions are not set to write-all"
+permissions: read-all
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          # https://woodruffw.github.io/zizmor/audits/#artipacked
+          persist-credentials: false
+
+      - name: typos-action
+        uses: crate-ci/typos@v1.28.1
--- a/.gitignore
+++ b/.gitignore
@@ -4,4 +4,10 @@ wd14_tagger_model
 venv
 *.egg-info
 build
-.vscode
+.vscode
+wandb
+CLAUDE.md
+GEMINI.md
+.claude
+.gemini
+MagicMock
--- a/README-ja.md
+++ b/README-ja.md
@@ -3,25 +3,30 @@ Stable Diffusionの学習、画像生成、その他のスクリプトを入れ

 [README in English](./README.md) ←更新情報はこちらにあります

+開発中のバージョンはdevブランチにあります。最新の変更点はdevブランチをご確認ください。
+
+FLUX.1およびSD3/SD3.5対応はsd3ブランチで行っています。それらの学習を行う場合はsd3ブランチをご利用ください。
+
 GUIやPowerShellスクリプトなど、より使いやすくする機能が[bmaltais氏のリポジトリ](https://github.com/bmaltais/kohya_ss)で提供されています（英語です）のであわせてご覧ください。bmaltais氏に感謝します。

 以下のスクリプトがあります。

 * DreamBooth、U-NetおよびText Encoderの学習をサポート
 * fine-tuning、同上
+* LoRAの学習をサポート
 * 画像生成
 * モデル変換（Stable Diffision ckpt/safetensorsとDiffusersの相互変換）

 ## 使用法について

-当リポジトリ内およびnote.comに記事がありますのでそちらをご覧ください（将来的にはすべてこちらへ移すかもしれません）。
-
-* [DreamBoothの学習について](./train_db_README-ja.md)
-* [fine-tuningのガイド](./fine_tune_README_ja.md):
-BLIPによるキャプショニングと、DeepDanbooruまたはWD14 taggerによるタグ付けを含みます
-* [LoRAの学習について](./train_network_README-ja.md)
-* [Textual Inversionの学習について](./train_ti_README-ja.md)
-* note.com [画像生成スクリプト](https://note.com/kohya_ss/n/n2693183a798e)
+* [学習について、共通編](./docs/train_README-ja.md) : データ整備やオプションなど
+    * [データセット設定](./docs/config_README-ja.md)
+* [SDXL学習](./docs/train_SDXL-en.md) （英語版）
+* [DreamBoothの学習について](./docs/train_db_README-ja.md)
+* [fine-tuningのガイド](./docs/fine_tune_README_ja.md):
+* [LoRAの学習について](./docs/train_network_README-ja.md)
+* [Textual Inversionの学習について](./docs/train_ti_README-ja.md)
+* [画像生成スクリプト](./docs/gen_img_README-ja.md)
 * note.com [モデル変換スクリプト](https://note.com/kohya_ss/n/n374f316fe4ad)

 ## Windowsでの動作に必要なプログラム
@@ -31,6 +36,8 @@ Python 3.10.6およびGitが必要です。
 - Python 3.10.6: https://www.python.org/ftp/python/3.10.6/python-3.10.6-amd64.exe
 - git: https://git-scm.com/download/win

+Python 3.10.x、3.11.x、3.12.xでも恐らく動作しますが、3.10.6でテストしています。
+
 PowerShellを使う場合、venvを使えるようにするためには以下の手順でセキュリティ設定を変更してください。
 （venvに限らずスクリプトの実行が可能になりますので注意してください。）

@@ -40,11 +47,11 @@ PowerShellを使う場合、venvを使えるようにするためには以下の

 ## Windows環境でのインストール

-以下の例ではPyTorchは1.12.1／CUDA 11.6版をインストールします。CUDA 11.3版やPyTorch 1.13を使う場合は適宜書き換えください。
+スクリプトはPyTorch 2.1.2でテストしています。PyTorch 2.2以降でも恐らく動作します。

 （なお、python -m venv～の行で「python」とだけ表示された場合、py -m venv～のようにpythonをpyに変更してください。）

-通常の（管理者ではない）PowerShellを開き以下を順に実行します。
+PowerShellを使う場合、通常の（管理者ではない）PowerShellを開き以下を順に実行します。

 ```powershell
 git clone https://github.com/kohya-ss/sd-scripts.git
@@ -53,44 +60,23 @@ cd sd-scripts
 python -m venv venv
 .\venv\Scripts\activate

-pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
+pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu118
 pip install --upgrade -r requirements.txt
-pip install -U -I --no-deps https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/f/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl
-
-cp .\bitsandbytes_windows\*.dll .\venv\Lib\site-packages\bitsandbytes\
-cp .\bitsandbytes_windows\cextension.py .\venv\Lib\site-packages\bitsandbytes\cextension.py
-cp .\bitsandbytes_windows\main.py .\venv\Lib\site-packages\bitsandbytes\cuda_setup\main.py
+pip install xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118

 accelerate config
 ```

-コマンドプロンプトでは以下になります。
+コマンドプロンプトでも同一です。

+注：`bitsandbytes==0.44.0`、`prodigyopt==1.0`、`lion-pytorch==0.0.6` は `requirements.txt` に含まれるようになりました。他のバージョンを使う場合は適宜インストールしてください。

-```bat
-git clone https://github.com/kohya-ss/sd-scripts.git
-cd sd-scripts
+この例では PyTorch および xfomers は2.1.2／CUDA 11.8版をインストールします。CUDA 12.1版やPyTorch 1.12.1を使う場合は適宜書き換えください。たとえば CUDA 12.1版の場合は `pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu121` および `pip install xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu121` としてください。

-python -m venv venv
-.\venv\Scripts\activate
-
-pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
-pip install --upgrade -r requirements.txt
-pip install -U -I --no-deps https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/f/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl
-
-copy /y .\bitsandbytes_windows\*.dll .\venv\Lib\site-packages\bitsandbytes\
-copy /y .\bitsandbytes_windows\cextension.py .\venv\Lib\site-packages\bitsandbytes\cextension.py
-copy /y .\bitsandbytes_windows\main.py .\venv\Lib\site-packages\bitsandbytes\cuda_setup\main.py
-
-accelerate config
-```
-
-（注:``python -m venv venv`` のほうが ``python -m venv --system-site-packages venv`` より安全そうなため書き換えました。globalなpythonにパッケージがインストールしてあると、後者だといろいろと問題が起きます。）
+PyTorch 2.2以降を用いる場合は、`torch==2.1.2` と `torchvision==0.16.2` 、および `xformers==0.0.23.post1` を適宜変更してください。

 accelerate configの質問には以下のように答えてください。（bf16で学習する場合、最後の質問にはbf16と答えてください。）

-※0.15.0から日本語環境では選択のためにカーソルキーを押すと落ちます（……）。数字キーの0、1、2……で選択できますので、そちらを使ってください。
-
 ```txt
 - This machine
 - No distributed training
@@ -104,10 +90,6 @@ accelerate configの質問には以下のように答えてください。（bf1
 ※場合によって ``ValueError: fp16 mixed precision requires a GPU`` というエラーが出ることがあるようです。この場合、6番目の質問（
 ``What GPU(s) (by id) should be used for training on this machine as a comma-separated list? [all]:``）に「0」と答えてください。（id `0`のGPUが使われます。）

-### PyTorchとxformersのバージョンについて
-
-他のバージョンでは学習がうまくいかない場合があるようです。特に他の理由がなければ指定のバージョンをお使いください。
-
 ## アップグレード

 新しいリリースがあった場合、以下のコマンドで更新できます。
@@ -116,7 +98,7 @@ accelerate configの質問には以下のように答えてください。（bf1
 cd sd-scripts
 git pull
 .\venv\Scripts\activate
-pip install --upgrade -r requirements.txt
+pip install --use-pep517 --upgrade -r requirements.txt
 ```

 コマンドが成功すれば新しいバージョンが使用できます。
@@ -125,6 +107,8 @@ pip install --upgrade -r requirements.txt

 LoRAの実装は[cloneofsimo氏のリポジトリ](https://github.com/cloneofsimo/lora)を基にしたものです。感謝申し上げます。

+Conv2d 3x3への拡大は [cloneofsimo氏](https://github.com/cloneofsimo/lora) が最初にリリースし、KohakuBlueleaf氏が [LoCon](https://github.com/KohakuBlueleaf/LoCon) でその有効性を明らかにしたものです。KohakuBlueleaf氏に深く感謝します。
+
 ## ライセンス

 スクリプトのライセンスはASL 2.0ですが（Diffusersおよびcloneofsimo氏のリポジトリ由来のものも同様）、一部他のライセンスのコードを含みます。
@@ -135,4 +119,47 @@ LoRAの実装は[cloneofsimo氏のリポジトリ](https://github.com/cloneofsim

 [BLIP](https://github.com/salesforce/BLIP): BSD-3-Clause

+## その他の情報

+### LoRAの名称について
+
+`train_network.py` がサポートするLoRAについて、混乱を避けるため名前を付けました。ドキュメントは更新済みです。以下は当リポジトリ内の独自の名称です。
+
+1. __LoRA-LierLa__ : (LoRA for __Li__ n __e__ a __r__  __La__ yers、リエラと読みます)
+
+    Linear 層およびカーネルサイズ 1x1 の Conv2d 層に適用されるLoRA
+
+2. __LoRA-C3Lier__ : (LoRA for __C__ olutional layers with __3__ x3 Kernel and  __Li__ n __e__ a __r__ layers、セリアと読みます)
+
+    1.に加え、カーネルサイズ 3x3 の Conv2d 層に適用されるLoRA
+
+デフォルトではLoRA-LierLaが使われます。LoRA-C3Lierを使う場合は `--network_args` に `conv_dim` を指定してください。
+
+<!-- 
+LoRA-LierLa は[Web UI向け拡張](https://github.com/kohya-ss/sd-webui-additional-networks)、またはAUTOMATIC1111氏のWeb UIのLoRA機能で使用することができます。
+
+LoRA-C3Lierを使いWeb UIで生成するには拡張を使用してください。
+-->
+
+### 学習中のサンプル画像生成
+
+プロンプトファイルは例えば以下のようになります。
+
+```
+# prompt 1
+masterpiece, best quality, (1girl), in white shirts, upper body, looking at viewer, simple background --n low quality, worst quality, bad anatomy,bad composition, poor, low effort --w 768 --h 768 --d 1 --l 7.5 --s 28
+
+# prompt 2
+masterpiece, best quality, 1boy, in business suit, standing at street, looking back --n (low quality, worst quality), bad anatomy,bad composition, poor, low effort --w 576 --h 832 --d 2 --l 5.5 --s 40
+```
+
+  `#` で始まる行はコメントになります。`--n` のように「ハイフン二個＋英小文字」の形でオプションを指定できます。以下が使用可能できます。
+
+  * `--n` Negative prompt up to the next option.
+  * `--w` Specifies the width of the generated image.
+  * `--h` Specifies the height of the generated image.
+  * `--d` Specifies the seed of the generated image.
+  * `--l` Specifies the CFG scale of the generated image.
+  * `--s` Specifies the number of steps in the generation.
+
+  `( )` や `[ ]` などの重みづけも動作します。
--- a/README.md
+++ b/README.md
--- a/XTI_hijack.py
+++ b/XTI_hijack.py
@@ -0,0 +1,204 @@
+import torch
+from library.device_utils import init_ipex
+init_ipex()
+
+from typing import Union, List, Optional, Dict, Any, Tuple
+from diffusers.models.unet_2d_condition import UNet2DConditionOutput
+
+from library.original_unet import SampleOutput
+
+
+def unet_forward_XTI(
+    self,
+    sample: torch.FloatTensor,
+    timestep: Union[torch.Tensor, float, int],
+    encoder_hidden_states: torch.Tensor,
+    class_labels: Optional[torch.Tensor] = None,
+    return_dict: bool = True,
+) -> Union[Dict, Tuple]:
+    r"""
+    Args:
+        sample (`torch.FloatTensor`): (batch, channel, height, width) noisy inputs tensor
+        timestep (`torch.FloatTensor` or `float` or `int`): (batch) timesteps
+        encoder_hidden_states (`torch.FloatTensor`): (batch, sequence_length, feature_dim) encoder hidden states
+        return_dict (`bool`, *optional*, defaults to `True`):
+            Whether or not to return a dict instead of a plain tuple.
+
+    Returns:
+        `SampleOutput` or `tuple`:
+        `SampleOutput` if `return_dict` is True, otherwise a `tuple`. When returning a tuple, the first element is the sample tensor.
+    """
+    # By default samples have to be AT least a multiple of the overall upsampling factor.
+    # The overall upsampling factor is equal to 2 ** (# num of upsampling layears).
+    # However, the upsampling interpolation output size can be forced to fit any upsampling size
+    # on the fly if necessary.
+    # デフォルトではサンプルは「2^アップサンプルの数」、つまり64の倍数である必要がある
+    # ただそれ以外のサイズにも対応できるように、必要ならアップサンプルのサイズを変更する
+    # 多分画質が悪くなるので、64で割り切れるようにしておくのが良い
+    default_overall_up_factor = 2**self.num_upsamplers
+
+    # upsample size should be forwarded when sample is not a multiple of `default_overall_up_factor`
+    # 64で割り切れないときはupsamplerにサイズを伝える
+    forward_upsample_size = False
+    upsample_size = None
+
+    if any(s % default_overall_up_factor != 0 for s in sample.shape[-2:]):
+        # logger.info("Forward upsample size to force interpolation output size.")
+        forward_upsample_size = True
+
+    # 1. time
+    timesteps = timestep
+    timesteps = self.handle_unusual_timesteps(sample, timesteps)  # 変な時だけ処理
+
+    t_emb = self.time_proj(timesteps)
+
+    # timesteps does not contain any weights and will always return f32 tensors
+    # but time_embedding might actually be running in fp16. so we need to cast here.
+    # there might be better ways to encapsulate this.
+    # timestepsは重みを含まないので常にfloat32のテンソルを返す
+    # しかしtime_embeddingはfp16で動いているかもしれないので、ここでキャストする必要がある
+    # time_projでキャストしておけばいいんじゃね？
+    t_emb = t_emb.to(dtype=self.dtype)
+    emb = self.time_embedding(t_emb)
+
+    # 2. pre-process
+    sample = self.conv_in(sample)
+
+    # 3. down
+    down_block_res_samples = (sample,)
+    down_i = 0
+    for downsample_block in self.down_blocks:
+        # downblockはforwardで必ずencoder_hidden_statesを受け取るようにしても良さそうだけど、
+        # まあこちらのほうがわかりやすいかもしれない
+        if downsample_block.has_cross_attention:
+            sample, res_samples = downsample_block(
+                hidden_states=sample,
+                temb=emb,
+                encoder_hidden_states=encoder_hidden_states[down_i : down_i + 2],
+            )
+            down_i += 2
+        else:
+            sample, res_samples = downsample_block(hidden_states=sample, temb=emb)
+
+        down_block_res_samples += res_samples
+
+    # 4. mid
+    sample = self.mid_block(sample, emb, encoder_hidden_states=encoder_hidden_states[6])
+
+    # 5. up
+    up_i = 7
+    for i, upsample_block in enumerate(self.up_blocks):
+        is_final_block = i == len(self.up_blocks) - 1
+
+        res_samples = down_block_res_samples[-len(upsample_block.resnets) :]
+        down_block_res_samples = down_block_res_samples[: -len(upsample_block.resnets)]  # skip connection
+
+        # if we have not reached the final block and need to forward the upsample size, we do it here
+        # 前述のように最後のブロック以外ではupsample_sizeを伝える
+        if not is_final_block and forward_upsample_size:
+            upsample_size = down_block_res_samples[-1].shape[2:]
+
+        if upsample_block.has_cross_attention:
+            sample = upsample_block(
+                hidden_states=sample,
+                temb=emb,
+                res_hidden_states_tuple=res_samples,
+                encoder_hidden_states=encoder_hidden_states[up_i : up_i + 3],
+                upsample_size=upsample_size,
+            )
+            up_i += 3
+        else:
+            sample = upsample_block(
+                hidden_states=sample, temb=emb, res_hidden_states_tuple=res_samples, upsample_size=upsample_size
+            )
+
+    # 6. post-process
+    sample = self.conv_norm_out(sample)
+    sample = self.conv_act(sample)
+    sample = self.conv_out(sample)
+
+    if not return_dict:
+        return (sample,)
+
+    return SampleOutput(sample=sample)
+
+
+def downblock_forward_XTI(
+    self, hidden_states, temb=None, encoder_hidden_states=None, attention_mask=None, cross_attention_kwargs=None
+):
+    output_states = ()
+    i = 0
+
+    for resnet, attn in zip(self.resnets, self.attentions):
+        if self.training and self.gradient_checkpointing:
+
+            def create_custom_forward(module, return_dict=None):
+                def custom_forward(*inputs):
+                    if return_dict is not None:
+                        return module(*inputs, return_dict=return_dict)
+                    else:
+                        return module(*inputs)
+
+                return custom_forward
+
+            hidden_states = torch.utils.checkpoint.checkpoint(create_custom_forward(resnet), hidden_states, temb)
+            hidden_states = torch.utils.checkpoint.checkpoint(
+                create_custom_forward(attn, return_dict=False), hidden_states, encoder_hidden_states[i]
+            )[0]
+        else:
+            hidden_states = resnet(hidden_states, temb)
+            hidden_states = attn(hidden_states, encoder_hidden_states=encoder_hidden_states[i]).sample
+
+        output_states += (hidden_states,)
+        i += 1
+
+    if self.downsamplers is not None:
+        for downsampler in self.downsamplers:
+            hidden_states = downsampler(hidden_states)
+
+        output_states += (hidden_states,)
+
+    return hidden_states, output_states
+
+
+def upblock_forward_XTI(
+    self,
+    hidden_states,
+    res_hidden_states_tuple,
+    temb=None,
+    encoder_hidden_states=None,
+    upsample_size=None,
+):
+    i = 0
+    for resnet, attn in zip(self.resnets, self.attentions):
+        # pop res hidden states
+        res_hidden_states = res_hidden_states_tuple[-1]
+        res_hidden_states_tuple = res_hidden_states_tuple[:-1]
+        hidden_states = torch.cat([hidden_states, res_hidden_states], dim=1)
+
+        if self.training and self.gradient_checkpointing:
+
+            def create_custom_forward(module, return_dict=None):
+                def custom_forward(*inputs):
+                    if return_dict is not None:
+                        return module(*inputs, return_dict=return_dict)
+                    else:
+                        return module(*inputs)
+
+                return custom_forward
+
+            hidden_states = torch.utils.checkpoint.checkpoint(create_custom_forward(resnet), hidden_states, temb)
+            hidden_states = torch.utils.checkpoint.checkpoint(
+                create_custom_forward(attn, return_dict=False), hidden_states, encoder_hidden_states[i]
+            )[0]
+        else:
+            hidden_states = resnet(hidden_states, temb)
+            hidden_states = attn(hidden_states, encoder_hidden_states=encoder_hidden_states[i]).sample
+
+        i += 1
+
+    if self.upsamplers is not None:
+        for upsampler in self.upsamplers:
+            hidden_states = upsampler(hidden_states, upsample_size)
+
+    return hidden_states
--- a/_typos.toml
+++ b/_typos.toml
@@ -0,0 +1,35 @@
+# Files for typos
+# Instruction:  https://github.com/marketplace/actions/typos-action#getting-started
+
+[default.extend-identifiers]
+ddPn08="ddPn08"
+
+[default.extend-words]
+NIN="NIN"
+parms="parms"
+nin="nin"
+extention="extention" # Intentionally left
+nd="nd"
+shs="shs"
+sts="sts"
+scs="scs"
+cpc="cpc"
+coc="coc"
+cic="cic"
+msm="msm"
+usu="usu"
+ici="ici"
+lvl="lvl"
+dii="dii"
+muk="muk"
+ori="ori"
+hru="hru"
+rik="rik"
+koo="koo"
+yos="yos"
+wn="wn"
+hime="hime"
+
+
+[files]
+extend-exclude = ["_typos.toml", "venv"]
--- a/bitsandbytes_windows/libbitsandbytes_cuda118.dll
+++ b/bitsandbytes_windows/libbitsandbytes_cuda118.dll
--- a/bitsandbytes_windows/main.py
+++ b/bitsandbytes_windows/main.py
@@ -1,166 +1,166 @@
-"""
-extract factors the build is dependent on:
-[X] compute capability
-    [ ] TODO: Q - What if we have multiple GPUs of different makes?
- CUDA version
- Software:
-    - CPU-only: only CPU quantization functions (no optimizer, no matrix multiple)
-    - CuBLAS-LT: full-build 8-bit optimizer
-    - no CuBLAS-LT: no 8-bit matrix multiplication (`nomatmul`)
-
-evaluation:
-    - if paths faulty, return meaningful error
-    - else:
-        - determine CUDA version
-        - determine capabilities
-        - based on that set the default path
-"""
-
-import ctypes
-
-from .paths import determine_cuda_runtime_lib_path
-
-
-def check_cuda_result(cuda, result_val):
-    # 3. Check for CUDA errors
-    if result_val != 0:
-        error_str = ctypes.c_char_p()
-        cuda.cuGetErrorString(result_val, ctypes.byref(error_str))
-        print(f"CUDA exception! Error code: {error_str.value.decode()}")
-
-def get_cuda_version(cuda, cudart_path):
-    # https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART____VERSION.html#group__CUDART____VERSION
-    try:
-        cudart = ctypes.CDLL(cudart_path)
-    except OSError:
-        # TODO: shouldn't we error or at least warn here?
-        print(f'ERROR: libcudart.so could not be read from path: {cudart_path}!')
-        return None
-
-    version = ctypes.c_int()
-    check_cuda_result(cuda, cudart.cudaRuntimeGetVersion(ctypes.byref(version)))
-    version = int(version.value)
-    major = version//1000
-    minor = (version-(major*1000))//10
-
-    if major < 11:
-       print('CUDA SETUP: CUDA version lower than 11 are currently not supported for LLM.int8(). You will be only to use 8-bit optimizers and quantization routines!!')
-
-    return f'{major}{minor}'
-
-
-def get_cuda_lib_handle():
-    # 1. find libcuda.so library (GPU driver) (/usr/lib)
-    try:
-        cuda = ctypes.CDLL("libcuda.so")
-    except OSError:
-        # TODO: shouldn't we error or at least warn here?
-        print('CUDA SETUP: WARNING! libcuda.so not found! Do you have a CUDA driver installed? If you are on a cluster, make sure you are on a CUDA machine!')
-        return None
-    check_cuda_result(cuda, cuda.cuInit(0))
-
-    return cuda
-
-
-def get_compute_capabilities(cuda):
-    """
-    1. find libcuda.so library (GPU driver) (/usr/lib)
-       init_device -> init variables -> call function by reference
-    2. call extern C function to determine CC
-       (https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__DEVICE__DEPRECATED.html)
-    3. Check for CUDA errors
-       https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api
-    # bits taken from https://gist.github.com/f0k/63a664160d016a491b2cbea15913d549
-    """
-
-
-    nGpus = ctypes.c_int()
-    cc_major = ctypes.c_int()
-    cc_minor = ctypes.c_int()
-
-    device = ctypes.c_int()
-
-    check_cuda_result(cuda, cuda.cuDeviceGetCount(ctypes.byref(nGpus)))
-    ccs = []
-    for i in range(nGpus.value):
-        check_cuda_result(cuda, cuda.cuDeviceGet(ctypes.byref(device), i))
-        ref_major = ctypes.byref(cc_major)
-        ref_minor = ctypes.byref(cc_minor)
-        # 2. call extern C function to determine CC
-        check_cuda_result(
-            cuda, cuda.cuDeviceComputeCapability(ref_major, ref_minor, device)
-        )
-        ccs.append(f"{cc_major.value}.{cc_minor.value}")
-
-    return ccs
-
-
-# def get_compute_capability()-> Union[List[str, ...], None]: # FIXME: error
-def get_compute_capability(cuda):
-    """
-    Extracts the highest compute capbility from all available GPUs, as compute
-    capabilities are downwards compatible. If no GPUs are detected, it returns
-    None.
-    """
-    ccs = get_compute_capabilities(cuda)
-    if ccs is not None:
-        # TODO: handle different compute capabilities; for now, take the max
-        return ccs[-1]
-    return None
-
-
-def evaluate_cuda_setup():
-    print('')
-    print('='*35 + 'BUG REPORT' + '='*35)
-    print('Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues')
-    print('For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link')
-    print('='*80)
-    return "libbitsandbytes_cuda116.dll"            # $$$
-    
-    binary_name = "libbitsandbytes_cpu.so"
-    #if not torch.cuda.is_available():
-        #print('No GPU detected. Loading CPU library...')
-        #return binary_name
-
-    cudart_path = determine_cuda_runtime_lib_path()
-    if cudart_path is None:
-        print(
-            "WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!"
-        )
-        return binary_name
-
-    print(f"CUDA SETUP: CUDA runtime path found: {cudart_path}")
-    cuda = get_cuda_lib_handle()
-    cc = get_compute_capability(cuda)
-    print(f"CUDA SETUP: Highest compute capability among GPUs detected: {cc}")
-    cuda_version_string = get_cuda_version(cuda, cudart_path)
-
-
-    if cc == '':
-        print(
-            "WARNING: No GPU detected! Check your CUDA paths. Processing to load CPU-only library..."
-        )
-        return binary_name
-
-    # 7.5 is the minimum CC vor cublaslt
-    has_cublaslt = cc in ["7.5", "8.0", "8.6"]
-
-    # TODO:
-    # (1) CUDA missing cases (no CUDA installed by CUDA driver (nvidia-smi accessible)
-    # (2) Multiple CUDA versions installed
-
-    # we use ls -l instead of nvcc to determine the cuda version
-    # since most installations will have the libcudart.so installed, but not the compiler
-    print(f'CUDA SETUP: Detected CUDA version {cuda_version_string}')
-
-    def get_binary_name():
-        "if not has_cublaslt (CC < 7.5), then we have to choose  _nocublaslt.so"
-        bin_base_name = "libbitsandbytes_cuda"
-        if has_cublaslt:
-            return f"{bin_base_name}{cuda_version_string}.so"
-        else:
-            return f"{bin_base_name}{cuda_version_string}_nocublaslt.so"
-
-    binary_name = get_binary_name()
-
-    return binary_name
+"""
+extract factors the build is dependent on:
+[X] compute capability
+    [ ] TODO: Q - What if we have multiple GPUs of different makes?
+- CUDA version
+- Software:
+    - CPU-only: only CPU quantization functions (no optimizer, no matrix multiple)
+    - CuBLAS-LT: full-build 8-bit optimizer
+    - no CuBLAS-LT: no 8-bit matrix multiplication (`nomatmul`)
+
+evaluation:
+    - if paths faulty, return meaningful error
+    - else:
+        - determine CUDA version
+        - determine capabilities
+        - based on that set the default path
+"""
+
+import ctypes
+
+from .paths import determine_cuda_runtime_lib_path
+
+
+def check_cuda_result(cuda, result_val):
+    # 3. Check for CUDA errors
+    if result_val != 0:
+        error_str = ctypes.c_char_p()
+        cuda.cuGetErrorString(result_val, ctypes.byref(error_str))
+        print(f"CUDA exception! Error code: {error_str.value.decode()}")
+
+def get_cuda_version(cuda, cudart_path):
+    # https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART____VERSION.html#group__CUDART____VERSION
+    try:
+        cudart = ctypes.CDLL(cudart_path)
+    except OSError:
+        # TODO: shouldn't we error or at least warn here?
+        print(f'ERROR: libcudart.so could not be read from path: {cudart_path}!')
+        return None
+
+    version = ctypes.c_int()
+    check_cuda_result(cuda, cudart.cudaRuntimeGetVersion(ctypes.byref(version)))
+    version = int(version.value)
+    major = version//1000
+    minor = (version-(major*1000))//10
+
+    if major < 11:
+       print('CUDA SETUP: CUDA version lower than 11 are currently not supported for LLM.int8(). You will be only to use 8-bit optimizers and quantization routines!!')
+
+    return f'{major}{minor}'
+
+
+def get_cuda_lib_handle():
+    # 1. find libcuda.so library (GPU driver) (/usr/lib)
+    try:
+        cuda = ctypes.CDLL("libcuda.so")
+    except OSError:
+        # TODO: shouldn't we error or at least warn here?
+        print('CUDA SETUP: WARNING! libcuda.so not found! Do you have a CUDA driver installed? If you are on a cluster, make sure you are on a CUDA machine!')
+        return None
+    check_cuda_result(cuda, cuda.cuInit(0))
+
+    return cuda
+
+
+def get_compute_capabilities(cuda):
+    """
+    1. find libcuda.so library (GPU driver) (/usr/lib)
+       init_device -> init variables -> call function by reference
+    2. call extern C function to determine CC
+       (https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__DEVICE__DEPRECATED.html)
+    3. Check for CUDA errors
+       https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api
+    # bits taken from https://gist.github.com/f0k/63a664160d016a491b2cbea15913d549
+    """
+
+
+    nGpus = ctypes.c_int()
+    cc_major = ctypes.c_int()
+    cc_minor = ctypes.c_int()
+
+    device = ctypes.c_int()
+
+    check_cuda_result(cuda, cuda.cuDeviceGetCount(ctypes.byref(nGpus)))
+    ccs = []
+    for i in range(nGpus.value):
+        check_cuda_result(cuda, cuda.cuDeviceGet(ctypes.byref(device), i))
+        ref_major = ctypes.byref(cc_major)
+        ref_minor = ctypes.byref(cc_minor)
+        # 2. call extern C function to determine CC
+        check_cuda_result(
+            cuda, cuda.cuDeviceComputeCapability(ref_major, ref_minor, device)
+        )
+        ccs.append(f"{cc_major.value}.{cc_minor.value}")
+
+    return ccs
+
+
+# def get_compute_capability()-> Union[List[str, ...], None]: # FIXME: error
+def get_compute_capability(cuda):
+    """
+    Extracts the highest compute capbility from all available GPUs, as compute
+    capabilities are downwards compatible. If no GPUs are detected, it returns
+    None.
+    """
+    ccs = get_compute_capabilities(cuda)
+    if ccs is not None:
+        # TODO: handle different compute capabilities; for now, take the max
+        return ccs[-1]
+    return None
+
+
+def evaluate_cuda_setup():
+    print('')
+    print('='*35 + 'BUG REPORT' + '='*35)
+    print('Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues')
+    print('For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link')
+    print('='*80)
+    return "libbitsandbytes_cuda116.dll"            # $$$
+    
+    binary_name = "libbitsandbytes_cpu.so"
+    #if not torch.cuda.is_available():
+        #print('No GPU detected. Loading CPU library...')
+        #return binary_name
+
+    cudart_path = determine_cuda_runtime_lib_path()
+    if cudart_path is None:
+        print(
+            "WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!"
+        )
+        return binary_name
+
+    print(f"CUDA SETUP: CUDA runtime path found: {cudart_path}")
+    cuda = get_cuda_lib_handle()
+    cc = get_compute_capability(cuda)
+    print(f"CUDA SETUP: Highest compute capability among GPUs detected: {cc}")
+    cuda_version_string = get_cuda_version(cuda, cudart_path)
+
+
+    if cc == '':
+        print(
+            "WARNING: No GPU detected! Check your CUDA paths. Processing to load CPU-only library..."
+        )
+        return binary_name
+
+    # 7.5 is the minimum CC vor cublaslt
+    has_cublaslt = cc in ["7.5", "8.0", "8.6"]
+
+    # TODO:
+    # (1) CUDA missing cases (no CUDA installed by CUDA driver (nvidia-smi accessible)
+    # (2) Multiple CUDA versions installed
+
+    # we use ls -l instead of nvcc to determine the cuda version
+    # since most installations will have the libcudart.so installed, but not the compiler
+    print(f'CUDA SETUP: Detected CUDA version {cuda_version_string}')
+
+    def get_binary_name():
+        "if not has_cublaslt (CC < 7.5), then we have to choose  _nocublaslt.so"
+        bin_base_name = "libbitsandbytes_cuda"
+        if has_cublaslt:
+            return f"{bin_base_name}{cuda_version_string}.so"
+        else:
+            return f"{bin_base_name}{cuda_version_string}_nocublaslt.so"
+
+    binary_name = get_binary_name()
+
+    return binary_name
--- a/docs/config_README-en.md
+++ b/docs/config_README-en.md
@@ -0,0 +1,389 @@
+Original Source by kohya-ss
+
+First version:
+A.I Translation by Model: NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO, editing by Darkstorm2150
+
+Some parts are manually added.
+
+# Config Readme
+
+This README is about the configuration files that can be passed with the `--dataset_config` option.
+
+## Overview
+
+By passing a configuration file, users can make detailed settings.
+
+* Multiple datasets can be configured
+   * For example, by setting `resolution` for each dataset, they can be mixed and trained.
+   * In training methods that support both the DreamBooth approach and the fine-tuning approach, datasets of the DreamBooth method and the fine-tuning method can be mixed.
+* Settings can be changed for each subset
+   * A subset is a partition of the dataset by image directory or metadata. Several subsets make up a dataset.
+   * Options such as `keep_tokens` and `flip_aug` can be set for each subset. On the other hand, options such as `resolution` and `batch_size` can be set for each dataset, and their values are common among subsets belonging to the same dataset. More details will be provided later.
+
+The configuration file format can be JSON or TOML. Considering the ease of writing, it is recommended to use [TOML](https://toml.io/ja/v1.0.0-rc.2). The following explanation assumes the use of TOML.
+
+
+Here is an example of a configuration file written in TOML.
+
+```toml
+[general]
+shuffle_caption = true
+caption_extension = '.txt'
+keep_tokens = 1
+
+# This is a DreamBooth-style dataset
+[[datasets]]
+resolution = 512
+batch_size = 4
+keep_tokens = 2
+
+  [[datasets.subsets]]
+  image_dir = 'C:\hoge'
+  class_tokens = 'hoge girl'
+  # This subset uses keep_tokens = 2 (the value of the parent datasets)
+
+  [[datasets.subsets]]
+  image_dir = 'C:\fuga'
+  class_tokens = 'fuga boy'
+  keep_tokens = 3
+
+  [[datasets.subsets]]
+  is_reg = true
+  image_dir = 'C:\reg'
+  class_tokens = 'human'
+  keep_tokens = 1
+
+# This is a fine-tuning dataset
+[[datasets]]
+resolution = [768, 768]
+batch_size = 2
+
+  [[datasets.subsets]]
+  image_dir = 'C:\piyo'
+  metadata_file = 'C:\piyo\piyo_md.json'
+  # This subset uses keep_tokens = 1 (the value of [general])
+```
+
+In this example, three directories are trained as a DreamBooth-style dataset at 512x512 (batch size 4), and one directory is trained as a fine-tuning dataset at 768x768 (batch size 2).
+
+## Settings for datasets and subsets
+
+Settings for datasets and subsets are divided into several registration locations.
+
+* `[general]`
+    * This is where options that apply to all datasets or all subsets are specified.
+    * If there are options with the same name in the dataset-specific or subset-specific settings, the dataset-specific or subset-specific settings take precedence.
+* `[[datasets]]`
+    * `datasets` is where settings for datasets are registered. This is where options that apply individually to each dataset are specified.
+	* If there are subset-specific settings, the subset-specific settings take precedence.
+* `[[datasets.subsets]]`
+    * `datasets.subsets` is where settings for subsets are registered. This is where options that apply individually to each subset are specified.
+
+Here is an image showing the correspondence between image directories and registration locations in the previous example.
+
+```
+C:\
+├─ hoge  ->  [[datasets.subsets]] No.1  ┐                        ┐
+├─ fuga  ->  [[datasets.subsets]] No.2  |->  [[datasets]] No.1   |->  [general]
+├─ reg   ->  [[datasets.subsets]] No.3  ┘                        |
+└─ piyo  ->  [[datasets.subsets]] No.4  -->  [[datasets]] No.2   ┘
+```
+
+The image directory corresponds to each `[[datasets.subsets]]`. Then, multiple `[[datasets.subsets]]` are combined to form one `[[datasets]]`. All `[[datasets]]` and `[[datasets.subsets]]` belong to `[general]`.
+
+The available options for each registration location may differ, but if the same option is specified, the value in the lower registration location will take precedence. You can check how the `keep_tokens` option is handled in the previous example for better understanding.
+
+Additionally, the available options may vary depending on the method that the learning approach supports.
+
+* Options specific to the DreamBooth method
+* Options specific to the fine-tuning method
+* Options available when using the caption dropout technique
+
+When using both the DreamBooth method and the fine-tuning method, they can be used together with a learning approach that supports both.
+When using them together, a point to note is that the method is determined based on the dataset, so it is not possible to mix DreamBooth method subsets and fine-tuning method subsets within the same dataset.
+In other words, if you want to use both methods together, you need to set up subsets of different methods belonging to different datasets.
+
+In terms of program behavior, if the `metadata_file` option exists, it is determined to be a subset of fine-tuning. Therefore, for subsets belonging to the same dataset, as long as they are either "all have the `metadata_file` option" or "all have no `metadata_file` option," there is no problem.
+
+Below, the available options will be explained. For options with the same name as the command-line argument, the explanation will be omitted in principle. Please refer to other READMEs.
+
+### Common options for all learning methods
+
+These are options that can be specified regardless of the learning method.
+
+#### Data set specific options
+
+These are options related to the configuration of the data set. They cannot be described in `datasets.subsets`.
+
+
+| Option Name | Example Setting | `[general]` | `[[datasets]]` |
+| ---- | ---- | ---- | ---- |
+| `batch_size` | `1` | o | o |
+| `bucket_no_upscale` | `true` | o | o |
+| `bucket_reso_steps` | `64` | o | o |
+| `enable_bucket` | `true` | o | o |
+| `max_bucket_reso` | `1024` | o | o |
+| `min_bucket_reso` | `128` | o | o |
+| `resolution` | `256`, `[512, 512]` | o | o |
+
+* `batch_size`
+    * This corresponds to the command-line argument `--train_batch_size`.
+* `max_bucket_reso`, `min_bucket_reso`
+    * Specify the maximum and minimum resolutions of the bucket. It must be divisible by `bucket_reso_steps`.
+
+These settings are fixed per dataset. That means that subsets belonging to the same dataset will share these settings. For example, if you want to prepare datasets with different resolutions, you can define them as separate datasets as shown in the example above, and set different resolutions for each.
+
+#### Options for Subsets
+
+These options are related to subset configuration.
+
+| Option Name | Example | `[general]` | `[[datasets]]` | `[[dataset.subsets]]` |
+| ---- | ---- | ---- | ---- | ---- |
+| `color_aug` | `false` | o | o | o |
+| `face_crop_aug_range` | `[1.0, 3.0]` | o | o | o |
+| `flip_aug` | `true` | o | o | o |
+| `keep_tokens` | `2` | o | o | o |
+| `num_repeats` | `10` | o | o | o |
+| `random_crop` | `false` | o | o | o |
+| `shuffle_caption` | `true` | o | o | o |
+| `caption_prefix` | `"masterpiece, best quality, "` | o | o | o |
+| `caption_suffix` | `", from side"` | o | o | o |
+| `caption_separator` |  (not specified) | o | o | o |
+| `keep_tokens_separator` | `“|||”` | o | o | o |
+| `secondary_separator` | `“;;;”` | o | o | o |
+| `enable_wildcard` | `true` | o | o | o |
+| `resize_interpolation` | (not specified) | o | o | o |
+
+* `num_repeats`
+    * Specifies the number of repeats for images in a subset. This is equivalent to `--dataset_repeats` in fine-tuning but can be specified for any training method.
+* `caption_prefix`, `caption_suffix`
+    * Specifies the prefix and suffix strings to be appended to the captions. Shuffling is performed with these strings included. Be cautious when using `keep_tokens`.
+* `caption_separator`
+    * Specifies the string to separate the tags. The default is `,`. This option is usually not necessary to set.
+* `keep_tokens_separator`
+    * Specifies the string to separate the parts to be fixed in the caption. For example, if you specify `aaa, bbb ||| ccc, ddd, eee, fff ||| ggg, hhh`, the parts `aaa, bbb` and `ggg, hhh` will remain, and the rest will be shuffled and dropped. The comma in between is not necessary. As a result, the prompt will be `aaa, bbb, eee, ccc, fff, ggg, hhh` or `aaa, bbb, fff, ccc, eee, ggg, hhh`, etc.
+* `secondary_separator`
+    * Specifies an additional separator. The part separated by this separator is treated as one tag and is shuffled and dropped. It is then replaced by `caption_separator`. For example, if you specify `aaa;;;bbb;;;ccc`, it will be replaced by `aaa,bbb,ccc` or dropped together.
+* `enable_wildcard`
+    * Enables wildcard notation. This will be explained later.
+* `resize_interpolation`
+    * Specifies the interpolation method used when resizing images. Normally, there is no need to specify this. The following options can be specified: `lanczos`, `nearest`, `bilinear`, `linear`, `bicubic`, `cubic`, `area`, `box`. By default (when not specified), `area` is used for downscaling, and `lanczos` is used for upscaling. If this option is specified, the same interpolation method will be used for both upscaling and downscaling. When `lanczos` or `box` is specified, PIL is used; for other options, OpenCV is used.
+
+### DreamBooth-specific options
+
+DreamBooth-specific options only exist as subsets-specific options.
+
+#### Subset-specific options
+
+Options related to the configuration of DreamBooth subsets.
+
+| Option Name | Example Setting | `[general]` | `[[datasets]]` | `[[dataset.subsets]]` |
+| ---- | ---- | ---- | ---- | ---- |
+| `image_dir` | `'C:\hoge'` | - | - | o (required) |
+| `caption_extension` | `".txt"` | o | o | o |
+| `class_tokens` | `"sks girl"` | - | - | o |
+| `cache_info` | `false` | o | o | o |
+| `is_reg` | `false` | - | - | o |
+
+Firstly, note that for `image_dir`, the path to the image files must be specified as being directly in the directory. Unlike the previous DreamBooth method, where images had to be placed in subdirectories, this is not compatible with that specification. Also, even if you name the folder something like "5_cat", the number of repeats of the image and the class name will not be reflected. If you want to set these individually, you will need to explicitly specify them using `num_repeats` and `class_tokens`.
+
+* `image_dir`
+    * Specifies the path to the image directory. This is a required option.
+    * Images must be placed directly under the directory.
+* `class_tokens`
+    * Sets the class tokens.
+    * Only used during training when a corresponding caption file does not exist. The determination of whether or not to use it is made on a per-image basis. If `class_tokens` is not specified and a caption file is not found, an error will occur.
+* `cache_info`
+    * Specifies whether to cache the image size and caption. If not specified, it is set to `false`. The cache is saved in `metadata_cache.json` in `image_dir`.
+    * Caching speeds up the loading of the dataset after the first time. It is effective when dealing with thousands of images or more.
+* `is_reg`
+    * Specifies whether the subset images are for normalization. If not specified, it is set to `false`, meaning that the images are not for normalization.
+
+### Fine-tuning method specific options
+
+The options for the fine-tuning method only exist for subset-specific options.
+
+#### Subset-specific options
+
+These options are related to the configuration of the fine-tuning method's subsets.
+
+| Option name | Example setting | `[general]` | `[[datasets]]` | `[[dataset.subsets]]` |
+| ---- | ---- | ---- | ---- | ---- |
+| `image_dir` | `'C:\hoge'` | - | - | o |
+| `metadata_file` | `'C:\piyo\piyo_md.json'` | - | - | o (required) |
+
+* `image_dir`
+    * Specify the path to the image directory. Unlike the DreamBooth method, specifying it is not mandatory, but it is recommended to do so.
+        * The case where it is not necessary to specify is when the `--full_path` is added to the command line when generating the metadata file.
+    * The images must be placed directly under the directory.
+* `metadata_file`
+    * Specify the path to the metadata file used for the subset. This is a required option.
+        * It is equivalent to the command-line argument `--in_json`.
+    * Due to the specification that a metadata file must be specified for each subset, it is recommended to avoid creating a metadata file with images from different directories as a single metadata file. It is strongly recommended to prepare a separate metadata file for each image directory and register them as separate subsets.
+
+### Options available when caption dropout method can be used
+
+The options available when the caption dropout method can be used exist only for subsets. Regardless of whether it's the DreamBooth method or fine-tuning method, if it supports caption dropout, it can be specified.
+
+#### Subset-specific options
+
+Options related to the setting of subsets that caption dropout can be used for.
+
+| Option Name | `[general]` | `[[datasets]]` | `[[dataset.subsets]]` |
+| ---- | ---- | ---- | ---- |
+| `caption_dropout_every_n_epochs` | o | o | o |
+| `caption_dropout_rate` | o | o | o |
+| `caption_tag_dropout_rate` | o | o | o |
+
+## Behavior when there are duplicate subsets
+
+In the case of the DreamBooth dataset, if there are multiple `image_dir` directories with the same content, they are considered to be duplicate subsets. For the fine-tuning dataset, if there are multiple `metadata_file` files with the same content, they are considered to be duplicate subsets. If duplicate subsets exist in the dataset, subsequent subsets will be ignored.
+
+However, if they belong to different datasets, they are not considered duplicates. For example, if you have subsets with the same `image_dir` in different datasets, they will not be considered duplicates. This is useful when you want to train with the same image but with different resolutions.
+
+```toml
+# If data sets exist separately, they are not considered duplicates and are both used for training.
+
+[[datasets]]
+resolution = 512
+
+  [[datasets.subsets]]
+  image_dir = 'C:\hoge'
+
+[[datasets]]
+resolution = 768
+
+  [[datasets.subsets]]
+  image_dir = 'C:\hoge'
+```
+
+## Command Line Argument and Configuration File
+
+There are options in the configuration file that have overlapping roles with command line argument options.
+
+The following command line argument options are ignored if a configuration file is passed:
+
+* `--train_data_dir`
+* `--reg_data_dir`
+* `--in_json`
+
+The following command line argument options are given priority over the configuration file options if both are specified simultaneously. In most cases, they have the same names as the corresponding options in the configuration file.
+
+| Command Line Argument Option   | Prioritized Configuration File Option |
+| ------------------------------- | ------------------------------------- |
+| `--bucket_no_upscale`           |                                       |
+| `--bucket_reso_steps`           |                                       |
+| `--caption_dropout_every_n_epochs` |                                       |
+| `--caption_dropout_rate`        |                                       |
+| `--caption_extension`           |                                       |
+| `--caption_tag_dropout_rate`    |                                       |
+| `--color_aug`                   |                                       |
+| `--dataset_repeats`             | `num_repeats`                          |
+| `--enable_bucket`               |                                       |
+| `--face_crop_aug_range`         |                                       |
+| `--flip_aug`                    |                                       |
+| `--keep_tokens`                 |                                       |
+| `--min_bucket_reso`              |                                       |
+| `--random_crop`                 |                                       |
+| `--resolution`                  |                                       |
+| `--shuffle_caption`             |                                       |
+| `--train_batch_size`            | `batch_size`                           |
+
+## Error Guide
+
+Currently, we are using an external library to check if the configuration file is written correctly, but the development has not been completed, and there is a problem that the error message is not clear. In the future, we plan to improve this problem.
+
+As a temporary measure, we will list common errors and their solutions. If you encounter an error even though it should be correct or if the error content is not understandable, please contact us as it may be a bug.
+
+* `voluptuous.error.MultipleInvalid: required key not provided @ ...`: This error occurs when a required option is not provided. It is highly likely that you forgot to specify the option or misspelled the option name.
+  * The error location is indicated by `...` in the error message. For example, if you encounter an error like `voluptuous.error.MultipleInvalid: required key not provided @ data['datasets'][0]['subsets'][0]['image_dir']`, it means that the `image_dir` option does not exist in the 0th `subsets` of the 0th `datasets` setting.
+* `voluptuous.error.MultipleInvalid: expected int for dictionary value @ ...`: This error occurs when the specified value format is incorrect. It is highly likely that the value format is incorrect. The `int` part changes depending on the target option. The example configurations in this README may be helpful.
+* `voluptuous.error.MultipleInvalid: extra keys not allowed @ ...`: This error occurs when there is an option name that is not supported. It is highly likely that you misspelled the option name or mistakenly included it.
+
+## Miscellaneous
+
+### Multi-line captions
+
+By setting `enable_wildcard = true`, multiple-line captions are also enabled. If the caption file consists of multiple lines, one line is randomly selected as the caption. 
+
+```txt
+1girl, hatsune miku, vocaloid, upper body, looking at viewer, microphone, stage
+a girl with a microphone standing on a stage
+detailed digital art of a girl with a microphone on a stage
+```
+
+It can be combined with wildcard notation.
+
+In metadata files, you can also specify multiple-line captions. In the `.json` metadata file, use `\n` to represent a line break. If the caption file consists of multiple lines, `merge_captions_to_metadata.py` will create a metadata file in this format.
+
+The tags in the metadata (`tags`) are added to each line of the caption.
+
+```json
+{
+    "/path/to/image.png": {
+        "caption": "a cartoon of a frog with the word frog on it\ntest multiline caption1\ntest multiline caption2",
+        "tags": "open mouth, simple background, standing, no humans, animal, black background, frog, animal costume, animal focus"
+    },
+    ...
+}
+```
+
+In this case, the actual caption will be `a cartoon of a frog with the word frog on it, open mouth, simple background ...`, `test multiline caption1, open mouth, simple background ...`, `test multiline caption2, open mouth, simple background ...`, etc.
+
+### Example of configuration file : `secondary_separator`, wildcard notation, `keep_tokens_separator`, etc.
+
+```toml
+[general]
+flip_aug = true
+color_aug = false
+resolution = [1024, 1024]
+
+[[datasets]]
+batch_size = 6
+enable_bucket = true
+bucket_no_upscale = true
+caption_extension = ".txt"
+keep_tokens_separator= "|||"
+shuffle_caption = true
+caption_tag_dropout_rate = 0.1
+secondary_separator = ";;;" # subset 側に書くこともできます / can be written in the subset side
+enable_wildcard = true # 同上 / same as above
+
+  [[datasets.subsets]]
+  image_dir = "/path/to/image_dir"
+  num_repeats = 1
+
+  # ||| の前後はカンマは不要です（自動的に追加されます） / No comma is required before and after ||| (it is added automatically)
+  caption_prefix = "1girl, hatsune miku, vocaloid |||" 
+  
+  # ||| の後はシャッフル、drop されず残ります / After |||, it is not shuffled or dropped and remains
+  # 単純に文字列として連結されるので、カンマなどは自分で入れる必要があります / It is simply concatenated as a string, so you need to put commas yourself
+  caption_suffix = ", anime screencap ||| masterpiece, rating: general"
+```
+
+### Example of caption, secondary_separator notation: `secondary_separator = ";;;"`
+
+```txt
+1girl, hatsune miku, vocaloid, upper body, looking at viewer, sky;;;cloud;;;day, outdoors
+```
+The part `sky;;;cloud;;;day` is replaced with `sky,cloud,day` without shuffling or dropping. When shuffling and dropping are enabled, it is processed as a whole (as one tag). For example, it becomes `vocaloid, 1girl, upper body, sky,cloud,day, outdoors, hatsune miku` (shuffled) or `vocaloid, 1girl, outdoors, looking at viewer, upper body, hatsune miku` (dropped).
+
+### Example of caption, enable_wildcard notation: `enable_wildcard = true`
+
+```txt
+1girl, hatsune miku, vocaloid, upper body, looking at viewer, {simple|white} background
+```
+`simple` or `white` is randomly selected, and it becomes `simple background` or `white background`.
+
+```txt
+1girl, hatsune miku, vocaloid, {{retro style}}
+```
+If you want to include `{` or `}` in the tag string, double them like `{{` or `}}` (in this example, the actual caption used for training is `{retro style}`).
+
+### Example of caption, `keep_tokens_separator` notation: `keep_tokens_separator = "|||"`
+
+```txt
+1girl, hatsune miku, vocaloid ||| stage, microphone, white shirt, smile ||| best quality, rating: general
+```
+It becomes `1girl, hatsune miku, vocaloid, microphone, stage, white shirt, best quality, rating: general` or `1girl, hatsune miku, vocaloid, white shirt, smile, stage, microphone, best quality, rating: general` etc.
+
--- a/docs/config_README-ja.md
+++ b/docs/config_README-ja.md
@@ -0,0 +1,392 @@
+`--dataset_config` で渡すことができる設定ファイルに関する説明です。
+
+## 概要
+
+設定ファイルを渡すことにより、ユーザが細かい設定を行えるようにします。
+
+* 複数のデータセットが設定可能になります
+    * 例えば `resolution` をデータセットごとに設定して、それらを混合して学習できます。
+    * DreamBooth の手法と fine tuning の手法の両方に対応している学習方法では、DreamBooth 方式と fine tuning 方式のデータセットを混合することが可能です。
+* サブセットごとに設定を変更することが可能になります
+    * データセットを画像ディレクトリ別またはメタデータ別に分割したものがサブセットです。いくつかのサブセットが集まってデータセットを構成します。
+    * `keep_tokens` や `flip_aug` 等のオプションはサブセットごとに設定可能です。一方、`resolution` や `batch_size` といったオプションはデータセットごとに設定可能で、同じデータセットに属するサブセットでは値が共通になります。詳しくは後述します。
+
+設定ファイルの形式は JSON か TOML を利用できます。記述のしやすさを考えると [TOML](https://toml.io/ja/v1.0.0-rc.2) を利用するのがオススメです。以下、TOML の利用を前提に説明します。
+
+TOML で記述した設定ファイルの例です。
+
+```toml
+[general]
+shuffle_caption = true
+caption_extension = '.txt'
+keep_tokens = 1
+
+# これは DreamBooth 方式のデータセット
+[[datasets]]
+resolution = 512
+batch_size = 4
+keep_tokens = 2
+
+  [[datasets.subsets]]
+  image_dir = 'C:\hoge'
+  class_tokens = 'hoge girl'
+  # このサブセットは keep_tokens = 2 （所属する datasets の値が使われる）
+
+  [[datasets.subsets]]
+  image_dir = 'C:\fuga'
+  class_tokens = 'fuga boy'
+  keep_tokens = 3
+
+  [[datasets.subsets]]
+  is_reg = true
+  image_dir = 'C:\reg'
+  class_tokens = 'human'
+  keep_tokens = 1
+
+# これは fine tuning 方式のデータセット
+[[datasets]]
+resolution = [768, 768]
+batch_size = 2
+
+  [[datasets.subsets]]
+  image_dir = 'C:\piyo'
+  metadata_file = 'C:\piyo\piyo_md.json'
+  # このサブセットは keep_tokens = 1 （general の値が使われる）
+```
+
+この例では、3 つのディレクトリを DreamBooth 方式のデータセットとして 512x512 (batch size 4) で学習させ、1 つのディレクトリを fine tuning 方式のデータセットとして 768x768 (batch size 2) で学習させることになります。
+
+## データセット・サブセットに関する設定
+
+データセット・サブセットに関する設定は、登録可能な箇所がいくつかに分かれています。
+
+* `[general]`
+    * 全データセットまたは全サブセットに適用されるオプションを指定する箇所です。
+    * データセットごとの設定及びサブセットごとの設定に同名のオプションが存在していた場合には、データセット・サブセットごとの設定が優先されます。
+* `[[datasets]]`
+    * `datasets` はデータセットに関する設定の登録箇所になります。各データセットに個別に適用されるオプションを指定する箇所です。
+    * サブセットごとの設定が存在していた場合には、サブセットごとの設定が優先されます。
+* `[[datasets.subsets]]`
+    * `datasets.subsets` はサブセットに関する設定の登録箇所になります。各サブセットに個別に適用されるオプションを指定する箇所です。
+
+先程の例における、画像ディレクトリと登録箇所の対応に関するイメージ図です。
+
+```
+C:\
+├─ hoge  ->  [[datasets.subsets]] No.1  ┐                        ┐
+├─ fuga  ->  [[datasets.subsets]] No.2  |->  [[datasets]] No.1   |->  [general]
+├─ reg   ->  [[datasets.subsets]] No.3  ┘                        |
+└─ piyo  ->  [[datasets.subsets]] No.4  -->  [[datasets]] No.2   ┘
+```
+
+画像ディレクトリがそれぞれ1つの `[[datasets.subsets]]` に対応しています。そして `[[datasets.subsets]]` が1つ以上組み合わさって1つの `[[datasets]]` を構成します。`[general]` には全ての `[[datasets]]`, `[[datasets.subsets]]` が属します。
+
+登録箇所ごとに指定可能なオプションは異なりますが、同名のオプションが指定された場合は下位の登録箇所にある値が優先されます。先程の例の `keep_tokens` オプションの扱われ方を確認してもらうと理解しやすいかと思います。
+
+加えて、学習方法が対応している手法によっても指定可能なオプションが変化します。
+
+* DreamBooth 方式専用のオプション
+* fine tuning 方式専用のオプション
+* caption dropout の手法が使える場合のオプション
+
+DreamBooth の手法と fine tuning の手法の両方とも利用可能な学習方法では、両者を併用することができます。
+併用する際の注意点として、DreamBooth 方式なのか fine tuning 方式なのかはデータセット単位で判別を行っているため、同じデータセット中に DreamBooth 方式のサブセットと fine tuning 方式のサブセットを混在させることはできません。
+つまり、これらを併用したい場合には異なる方式のサブセットが異なるデータセットに所属するように設定する必要があります。
+
+プログラムの挙動としては、後述する `metadata_file` オプションが存在していたら fine tuning 方式のサブセットだと判断します。
+そのため、同一のデータセットに所属するサブセットについて言うと、「全てが `metadata_file` オプションを持つ」か「全てが `metadata_file` オプションを持たない」かのどちらかになっていれば問題ありません。
+
+以下、利用可能なオプションを説明します。コマンドライン引数と名称が同一のオプションについては、基本的に説明を割愛します。他の README を参照してください。
+
+### 全学習方法で共通のオプション
+
+学習方法によらずに指定可能なオプションです。
+
+#### データセット向けオプション
+
+データセットの設定に関わるオプションです。`datasets.subsets` には記述できません。
+
+| オプション名 | 設定例 | `[general]` | `[[datasets]]` |
+| ---- | ---- | ---- | ---- |
+| `batch_size` | `1` | o | o |
+| `bucket_no_upscale` | `true` | o | o |
+| `bucket_reso_steps` | `64` | o | o |
+| `enable_bucket` | `true` | o | o |
+| `max_bucket_reso` | `1024` | o | o |
+| `min_bucket_reso` | `128` | o | o |
+| `resolution` | `256`, `[512, 512]` | o | o |
+
+* `batch_size`
+    * コマンドライン引数の `--train_batch_size` と同等です。
+* `max_bucket_reso`, `min_bucket_reso`
+    * bucketの最大、最小解像度を指定します。`bucket_reso_steps` で割り切れる必要があります。
+
+これらの設定はデータセットごとに固定です。
+つまり、データセットに所属するサブセットはこれらの設定を共有することになります。
+例えば解像度が異なるデータセットを用意したい場合は、上に挙げた例のように別々のデータセットとして定義すれば別々の解像度を設定可能です。
+
+#### サブセット向けオプション
+
+サブセットの設定に関わるオプションです。
+
+| オプション名 | 設定例 | `[general]` | `[[datasets]]` | `[[dataset.subsets]]` |
+| ---- | ---- | ---- | ---- | ---- |
+| `color_aug` | `false` | o | o | o |
+| `face_crop_aug_range` | `[1.0, 3.0]` | o | o | o |
+| `flip_aug` | `true` | o | o | o |
+| `keep_tokens` | `2` | o | o | o |
+| `num_repeats` | `10` | o | o | o |
+| `random_crop` | `false` | o | o | o |
+| `shuffle_caption` | `true` | o | o | o |
+| `caption_prefix` | `“masterpiece, best quality, ”` | o | o | o |
+| `caption_suffix` | `“, from side”` | o | o | o |
+| `caption_separator` | （通常は設定しません） | o | o | o |
+| `keep_tokens_separator` | `“|||”` | o | o | o |
+| `secondary_separator` | `“;;;”` | o | o | o |
+| `enable_wildcard` | `true` | o | o | o |
+| `resize_interpolation` |（通常は設定しません） | o | o | o |
+
+* `num_repeats`
+    * サブセットの画像の繰り返し回数を指定します。fine tuning における `--dataset_repeats` に相当しますが、`num_repeats` はどの学習方法でも指定可能です。
+* `caption_prefix`, `caption_suffix`
+    * キャプションの前、後に付与する文字列を指定します。シャッフルはこれらの文字列を含めた状態で行われます。`keep_tokens` を指定する場合には注意してください。
+
+* `caption_separator`
+    * タグを区切る文字列を指定します。デフォルトは `,` です。このオプションは通常は設定する必要はありません。
+
+* `keep_tokens_separator`
+    *  キャプションで固定したい部分を区切る文字列を指定します。たとえば `aaa, bbb ||| ccc, ddd, eee, fff ||| ggg, hhh` のように指定すると、`aaa, bbb` と `ggg, hhh` の部分はシャッフル、drop されず残ります。間のカンマは不要です。結果としてプロンプトは `aaa, bbb, eee, ccc, fff, ggg, hhh` や `aaa, bbb, fff, ccc, eee, ggg, hhh` などになります。
+
+* `secondary_separator`
+    * 追加の区切り文字を指定します。この区切り文字で区切られた部分は一つのタグとして扱われ、シャッフル、drop されます。その後、`caption_separator` に置き換えられます。たとえば `aaa;;;bbb;;;ccc` のように指定すると、`aaa,bbb,ccc` に置き換えられるか、まとめて drop されます。
+
+* `enable_wildcard`
+    * ワイルドカード記法および複数行キャプションを有効にします。ワイルドカード記法、複数行キャプションについては後述します。
+
+* `resize_interpolation`
+    * 画像のリサイズ時に使用する補間方法を指定します。通常は指定しなくて構いません。`lanczos`, `nearest`, `bilinear`, `linear`, `bicubic`, `cubic`, `area`, `box` が指定可能です。デフォルト（未指定時）は、縮小時は `area`、拡大時は `lanczos` になります。このオプションを指定すると、拡大時・縮小時とも同じ補間方法が使用されます。`lanczos`、`box`を指定するとPILが、それ以外を指定するとOpenCVが使用されます。
+
+### DreamBooth 方式専用のオプション
+
+DreamBooth 方式のオプションは、サブセット向けオプションのみ存在します。
+
+#### サブセット向けオプション
+
+DreamBooth 方式のサブセットの設定に関わるオプションです。
+
+| オプション名 | 設定例 | `[general]` | `[[datasets]]` | `[[dataset.subsets]]` |
+| ---- | ---- | ---- | ---- | ---- |
+| `image_dir` | `‘C:\hoge’` | - | - | o（必須） |
+| `caption_extension` | `".txt"` | o | o | o |
+| `class_tokens` | `“sks girl”` | - | - | o |
+| `cache_info` | `false` | o | o | o | 
+| `is_reg` | `false` | - | - | o |
+
+まず注意点として、 `image_dir` には画像ファイルが直下に置かれているパスを指定する必要があります。従来の DreamBooth の手法ではサブディレクトリに画像を置く必要がありましたが、そちらとは仕様に互換性がありません。また、`5_cat` のようなフォルダ名にしても、画像の繰り返し回数とクラス名は反映されません。これらを個別に設定したい場合、`num_repeats` と `class_tokens` で明示的に指定する必要があることに注意してください。
+
+* `image_dir`
+    * 画像ディレクトリのパスを指定します。指定必須オプションです。
+    * 画像はディレクトリ直下に置かれている必要があります。
+* `class_tokens`
+    * クラストークンを設定します。
+    * 画像に対応する caption ファイルが存在しない場合にのみ学習時に利用されます。利用するかどうかの判定は画像ごとに行います。`class_tokens` を指定しなかった場合に caption ファイルも見つからなかった場合にはエラーになります。
+* `cache_info`
+    * 画像サイズ、キャプションをキャッシュするかどうかを指定します。指定しなかった場合は `false` になります。キャッシュは `image_dir` に `metadata_cache.json` というファイル名で保存されます。
+    * キャッシュを行うと、二回目以降のデータセット読み込みが高速化されます。数千枚以上の画像を扱う場合には有効です。
+* `is_reg`
+    * サブセットの画像が正規化用かどうかを指定します。指定しなかった場合は `false` として、つまり正規化画像ではないとして扱います。
+
+### fine tuning 方式専用のオプション
+
+fine tuning 方式のオプションは、サブセット向けオプションのみ存在します。
+
+#### サブセット向けオプション
+
+fine tuning 方式のサブセットの設定に関わるオプションです。
+
+| オプション名 | 設定例 | `[general]` | `[[datasets]]` | `[[dataset.subsets]]` |
+| ---- | ---- | ---- | ---- | ---- |
+| `image_dir` | `‘C:\hoge’` | - | - | o |
+| `metadata_file` | `'C:\piyo\piyo_md.json'` | - | - | o（必須） |
+
+* `image_dir`
+    * 画像ディレクトリのパスを指定します。DreamBooth の手法の方とは異なり指定は必須ではありませんが、設定することを推奨します。
+        * 指定する必要がない状況としては、メタデータファイルの生成時に `--full_path` を付与して実行していた場合です。
+    * 画像はディレクトリ直下に置かれている必要があります。
+* `metadata_file`
+    * サブセットで利用されるメタデータファイルのパスを指定します。指定必須オプションです。
+        * コマンドライン引数の `--in_json` と同等です。
+    * サブセットごとにメタデータファイルを指定する必要がある仕様上、ディレクトリを跨いだメタデータを1つのメタデータファイルとして作成することは避けた方が良いでしょう。画像ディレクトリごとにメタデータファイルを用意し、それらを別々のサブセットとして登録することを強く推奨します。
+
+### caption dropout の手法が使える場合に指定可能なオプション
+
+caption dropout の手法が使える場合のオプションは、サブセット向けオプションのみ存在します。
+DreamBooth 方式か fine tuning 方式かに関わらず、caption dropout に対応している学習方法であれば指定可能です。
+
+#### サブセット向けオプション
+
+caption dropout が使えるサブセットの設定に関わるオプションです。
+
+| オプション名 | `[general]` | `[[datasets]]` | `[[dataset.subsets]]` |
+| ---- | ---- | ---- | ---- |
+| `caption_dropout_every_n_epochs` | o | o | o |
+| `caption_dropout_rate` | o | o | o |
+| `caption_tag_dropout_rate` | o | o | o |
+
+## 重複したサブセットが存在する時の挙動
+
+DreamBooth 方式のデータセットの場合、その中にある `image_dir` が同一のサブセットは重複していると見なされます。
+fine tuning 方式のデータセットの場合は、その中にある `metadata_file` が同一のサブセットは重複していると見なされます。
+データセット中に重複したサブセットが存在する場合、2個目以降は無視されます。
+
+一方、異なるデータセットに所属している場合は、重複しているとは見なされません。
+例えば、以下のように同一の `image_dir` を持つサブセットを別々のデータセットに入れた場合には、重複していないと見なします。
+これは、同じ画像でも異なる解像度で学習したい場合に役立ちます。
+
+```toml
+# 別々のデータセットに存在している場合は重複とは見なされず、両方とも学習に使われる
+
+[[datasets]]
+resolution = 512
+
+  [[datasets.subsets]]
+  image_dir = 'C:\hoge'
+
+[[datasets]]
+resolution = 768
+
+  [[datasets.subsets]]
+  image_dir = 'C:\hoge'
+```
+
+## コマンドライン引数との併用
+
+設定ファイルのオプションの中には、コマンドライン引数のオプションと役割が重複しているものがあります。
+
+以下に挙げるコマンドライン引数のオプションは、設定ファイルを渡した場合には無視されます。
+
+* `--train_data_dir`
+* `--reg_data_dir`
+* `--in_json`
+
+以下に挙げるコマンドライン引数のオプションは、コマンドライン引数と設定ファイルで同時に指定された場合、コマンドライン引数の値よりも設定ファイルの値が優先されます。特に断りがなければ同名のオプションとなります。
+
+| コマンドライン引数のオプション     | 優先される設定ファイルのオプション |
+| ---------------------------------- | ---------------------------------- |
+| `--bucket_no_upscale`              |                                    |
+| `--bucket_reso_steps`              |                                    |
+| `--caption_dropout_every_n_epochs` |                                    |
+| `--caption_dropout_rate`           |                                    |
+| `--caption_extension`              |                                    |
+| `--caption_tag_dropout_rate`       |                                    |
+| `--color_aug`                      |                                    |
+| `--dataset_repeats`                | `num_repeats`                      |
+| `--enable_bucket`                  |                                    |
+| `--face_crop_aug_range`            |                                    |
+| `--flip_aug`                       |                                    |
+| `--keep_tokens`                    |                                    |
+| `--min_bucket_reso`                |                                    |
+| `--random_crop`                    |                                    |
+| `--resolution`                     |                                    |
+| `--shuffle_caption`                |                                    |
+| `--train_batch_size`               | `batch_size`                       |
+
+## エラーの手引き
+
+現在、外部ライブラリを利用して設定ファイルの記述が正しいかどうかをチェックしているのですが、整備が行き届いておらずエラーメッセージがわかりづらいという問題があります。
+将来的にはこの問題の改善に取り組む予定です。
+
+次善策として、頻出のエラーとその対処法について載せておきます。
+正しいはずなのにエラーが出る場合、エラー内容がどうしても分からない場合は、バグかもしれないのでご連絡ください。
+
+* `voluptuous.error.MultipleInvalid: required key not provided @ ...`: 指定必須のオプションが指定されていないというエラーです。指定を忘れているか、オプション名を間違って記述している可能性が高いです。
+  * `...` の箇所にはエラーが発生した場所が載っています。例えば `voluptuous.error.MultipleInvalid: required key not provided @ data['datasets'][0]['subsets'][0]['image_dir']` のようなエラーが出たら、0 番目の `datasets` 中の 0 番目の `subsets` の設定に `image_dir` が存在しないということになります。
+* `voluptuous.error.MultipleInvalid: expected int for dictionary value @ ...`: 指定する値の形式が不正というエラーです。値の形式が間違っている可能性が高いです。`int` の部分は対象となるオプションによって変わります。この README に載っているオプションの「設定例」が役立つかもしれません。
+* `voluptuous.error.MultipleInvalid: extra keys not allowed @ ...`: 対応していないオプション名が存在している場合に発生するエラーです。オプション名を間違って記述しているか、誤って紛れ込んでいる可能性が高いです。
+
+## その他
+
+### 複数行キャプション
+
+`enable_wildcard = true` を設定することで、複数行キャプションも同時に有効になります。キャプションファイルが複数の行からなる場合、ランダムに一つの行が選ばれてキャプションとして利用されます。
+
+```txt
+1girl, hatsune miku, vocaloid, upper body, looking at viewer, microphone, stage
+a girl with a microphone standing on a stage
+detailed digital art of a girl with a microphone on a stage
+```
+
+ワイルドカード記法と組み合わせることも可能です。
+
+メタデータファイルでも同様に複数行キャプションを指定することができます。メタデータの .json 内には、`\n` を使って改行を表現してください。キャプションファイルが複数行からなる場合、`merge_captions_to_metadata.py` を使うと、この形式でメタデータファイルが作成されます。
+
+メタデータのタグ (`tags`) は、キャプションの各行に追加されます。
+
+```json
+{
+    "/path/to/image.png": {
+        "caption": "a cartoon of a frog with the word frog on it\ntest multiline caption1\ntest multiline caption2",
+        "tags": "open mouth, simple background, standing, no humans, animal, black background, frog, animal costume, animal focus"
+    },
+    ...
+}
+```
+
+この場合、実際のキャプションは `a cartoon of a frog with the word frog on it, open mouth, simple background ...` または `test multiline caption1, open mouth, simple background ...`、 `test multiline caption2, open mouth, simple background ...` 等になります。
+
+### 設定ファイルの記述例：追加の区切り文字、ワイルドカード記法、`keep_tokens_separator` 等
+
+```toml
+[general]
+flip_aug = true
+color_aug = false
+resolution = [1024, 1024]
+
+[[datasets]]
+batch_size = 6
+enable_bucket = true
+bucket_no_upscale = true
+caption_extension = ".txt"
+keep_tokens_separator= "|||"
+shuffle_caption = true
+caption_tag_dropout_rate = 0.1
+secondary_separator = ";;;" # subset 側に書くこともできます / can be written in the subset side
+enable_wildcard = true # 同上 / same as above
+
+  [[datasets.subsets]]
+  image_dir = "/path/to/image_dir"
+  num_repeats = 1
+
+  # ||| の前後はカンマは不要です（自動的に追加されます） / No comma is required before and after ||| (it is added automatically)
+  caption_prefix = "1girl, hatsune miku, vocaloid |||" 
+  
+  # ||| の後はシャッフル、drop されず残ります / After |||, it is not shuffled or dropped and remains
+  # 単純に文字列として連結されるので、カンマなどは自分で入れる必要があります / It is simply concatenated as a string, so you need to put commas yourself
+  caption_suffix = ", anime screencap ||| masterpiece, rating: general"
+```
+
+### キャプション記述例、secondary_separator 記法：`secondary_separator = ";;;"` の場合
+
+```txt
+1girl, hatsune miku, vocaloid, upper body, looking at viewer, sky;;;cloud;;;day, outdoors
+```
+`sky;;;cloud;;;day` の部分はシャッフル、drop されず `sky,cloud,day` に置換されます。シャッフル、drop が有効な場合、まとめて（一つのタグとして）処理されます。つまり `vocaloid, 1girl, upper body, sky,cloud,day, outdoors, hatsune miku` （シャッフル）や `vocaloid, 1girl, outdoors, looking at viewer, upper body, hatsune miku` （drop されたケース）などになります。
+
+### キャプション記述例、ワイルドカード記法： `enable_wildcard = true` の場合
+
+```txt
+1girl, hatsune miku, vocaloid, upper body, looking at viewer, {simple|white} background
+```
+ランダムに `simple` または `white` が選ばれ、`simple background` または `white background` になります。
+
+```txt
+1girl, hatsune miku, vocaloid, {{retro style}}
+```
+タグ文字列に `{` や `}` そのものを含めたい場合は `{{` や `}}` のように二つ重ねてください（この例では実際に学習に用いられるキャプションは `{retro style}` になります）。
+
+### キャプション記述例、`keep_tokens_separator` 記法： `keep_tokens_separator = "|||"` の場合
+
+```txt
+1girl, hatsune miku, vocaloid ||| stage, microphone, white shirt, smile ||| best quality, rating: general
+```
+`1girl, hatsune miku, vocaloid, microphone, stage, white shirt, best quality, rating: general` や `1girl, hatsune miku, vocaloid, white shirt, smile, stage, microphone, best quality, rating: general` などになります。
--- a/docs/fine_tune_README_ja.md
+++ b/docs/fine_tune_README_ja.md
@@ -0,0 +1,140 @@
+NovelAIの提案した学習手法、自動キャプションニング、タグ付け、Windows＋VRAM 12GB（SD v1.xの場合）環境等に対応したfine tuningです。ここでfine tuningとは、モデルを画像とキャプションで学習することを指します（LoRAやTextual Inversion、Hypernetworksは含みません）
+
+[学習についての共通ドキュメント](./train_README-ja.md) もあわせてご覧ください。
+
+# 概要
+
+Diffusersを用いてStable DiffusionのU-Netのfine tuningを行います。NovelAIの記事にある以下の改善に対応しています（Aspect Ratio BucketingについてはNovelAIのコードを参考にしましたが、最終的なコードはすべてオリジナルです）。
+
+* CLIP（Text Encoder）の最後の層ではなく最後から二番目の層の出力を用いる。
+* 正方形以外の解像度での学習（Aspect Ratio Bucketing） 。
+* トークン長を75から225に拡張する。
+* BLIPによるキャプショニング（キャプションの自動作成）、DeepDanbooruまたはWD14Taggerによる自動タグ付けを行う。
+* Hypernetworkの学習にも対応する。
+* Stable Diffusion v2.0（baseおよび768/v）に対応。
+* VAEの出力をあらかじめ取得しディスクに保存しておくことで、学習の省メモリ化、高速化を図る。
+
+デフォルトではText Encoderの学習は行いません。モデル全体のfine tuningではU-Netだけを学習するのが一般的なようです（NovelAIもそのようです）。オプション指定でText Encoderも学習対象とできます。
+
+# 追加機能について
+
+## CLIPの出力の変更
+
+プロンプトを画像に反映するため、テキストの特徴量への変換を行うのがCLIP（Text Encoder）です。Stable DiffusionではCLIPの最後の層の出力を用いていますが、それを最後から二番目の層の出力を用いるよう変更できます。NovelAIによると、これによりより正確にプロンプトが反映されるようになるとのことです。
+元のまま、最後の層の出力を用いることも可能です。
+
+※Stable Diffusion 2.0では最後から二番目の層をデフォルトで使います。clip_skipオプションを指定しないでください。
+
+## 正方形以外の解像度での学習
+
+Stable Diffusionは512\*512で学習されていますが、それに加えて256\*1024や384\*640といった解像度でも学習します。これによりトリミングされる部分が減り、より正しくプロンプトと画像の関係が学習されることが期待されます。
+学習解像度はパラメータとして与えられた解像度の面積（＝メモリ使用量）を超えない範囲で、64ピクセル単位で縦横に調整、作成されます。
+
+機械学習では入力サイズをすべて統一するのが一般的ですが、特に制約があるわけではなく、実際は同一のバッチ内で統一されていれば大丈夫です。NovelAIの言うbucketingは、あらかじめ教師データを、アスペクト比に応じた学習解像度ごとに分類しておくことを指しているようです。そしてバッチを各bucket内の画像で作成することで、バッチの画像サイズを統一します。
+
+## トークン長の75から225への拡張
+
+Stable Diffusionでは最大75トークン（開始・終了を含むと77トークン）ですが、それを225トークンまで拡張します。
+ただしCLIPが受け付ける最大長は75トークンですので、225トークンの場合、単純に三分割してCLIPを呼び出してから結果を連結しています。
+
+※これが望ましい実装なのかどうかはいまひとつわかりません。とりあえず動いてはいるようです。特に2.0では何も参考になる実装がないので独自に実装してあります。
+
+※Automatic1111氏のWeb UIではカンマを意識して分割、といったこともしているようですが、私の場合はそこまでしておらず単純な分割です。
+
+# 学習の手順
+
+あらかじめこのリポジトリのREADMEを参照し、環境整備を行ってください。
+
+## データの準備
+
+[学習データの準備について](./train_README-ja.md) を参照してください。fine tuningではメタデータを用いるfine tuning方式のみ対応しています。
+
+## 学習の実行
+たとえば以下のように実行します。以下は省メモリ化のための設定です。それぞれの行を必要に応じて書き換えてください。
+
+```
+accelerate launch --num_cpu_threads_per_process 1 fine_tune.py 
+    --pretrained_model_name_or_path=<.ckptまたは.safetensordまたはDiffusers版モデルのディレクトリ> 
+    --output_dir=<学習したモデルの出力先フォルダ>  
+    --output_name=<学習したモデル出力時のファイル名> 
+    --dataset_config=<データ準備で作成した.tomlファイル> 
+    --save_model_as=safetensors 
+    --learning_rate=5e-6 --max_train_steps=10000 
+    --use_8bit_adam --xformers --gradient_checkpointing
+    --mixed_precision=fp16
+```
+
+`num_cpu_threads_per_process` には通常は1を指定するとよいようです。
+
+`pretrained_model_name_or_path` に追加学習を行う元となるモデルを指定します。Stable Diffusionのcheckpointファイル（.ckptまたは.safetensors）、Diffusersのローカルディスクにあるモデルディレクトリ、DiffusersのモデルID（"stabilityai/stable-diffusion-2"など）が指定できます。
+
+`output_dir` に学習後のモデルを保存するフォルダを指定します。`output_name` にモデルのファイル名を拡張子を除いて指定します。`save_model_as` でsafetensors形式での保存を指定しています。
+
+`dataset_config` に `.toml` ファイルを指定します。ファイル内でのバッチサイズ指定は、当初はメモリ消費を抑えるために `1` としてください。
+
+学習させるステップ数 `max_train_steps` を10000とします。学習率 `learning_rate` はここでは5e-6を指定しています。
+
+省メモリ化のため `mixed_precision="fp16"` を指定します（RTX30 シリーズ以降では `bf16` も指定できます。環境整備時にaccelerateに行った設定と合わせてください）。また `gradient_checkpointing` を指定します。
+
+オプティマイザ（モデルを学習データにあうように最適化＝学習させるクラス）にメモリ消費の少ない 8bit AdamW を使うため、 `optimizer_type="AdamW8bit"` を指定します。
+
+`xformers` オプションを指定し、xformersのCrossAttentionを用います。xformersをインストールしていない場合やエラーとなる場合（環境にもよりますが `mixed_precision="no"` の場合など）、代わりに `mem_eff_attn` オプションを指定すると省メモリ版CrossAttentionを使用します（速度は遅くなります）。
+
+ある程度メモリがある場合は、`.toml` ファイルを編集してバッチサイズをたとえば `4` くらいに増やしてください（高速化と精度向上の可能性があります）。
+
+### よく使われるオプションについて
+
+以下の場合にはオプションに関するドキュメントを参照してください。
+
+- Stable Diffusion 2.xまたはそこからの派生モデルを学習する
+- clip skipを2以上を前提としたモデルを学習する
+- 75トークンを超えたキャプションで学習する
+
+### バッチサイズについて
+
+モデル全体を学習するためLoRA等の学習に比べるとメモリ消費量は多くなります（DreamBoothと同じ）。
+
+### 学習率について
+
+1e-6から5e-6程度が一般的なようです。他のfine tuningの例なども参照してみてください。
+
+### 以前の形式のデータセット指定をした場合のコマンドライン
+
+解像度やバッチサイズをオプションで指定します。コマンドラインの例は以下の通りです。
+
+```
+accelerate launch --num_cpu_threads_per_process 1 fine_tune.py 
+    --pretrained_model_name_or_path=model.ckpt 
+    --in_json meta_lat.json 
+    --train_data_dir=train_data 
+    --output_dir=fine_tuned 
+    --shuffle_caption 
+    --train_batch_size=1 --learning_rate=5e-6 --max_train_steps=10000 
+    --use_8bit_adam --xformers --gradient_checkpointing
+    --mixed_precision=bf16
+    --save_every_n_epochs=4
+```
+
+<!-- 
+### 勾配をfp16とした学習（実験的機能）
+full_fp16オプションを指定すると勾配を通常のfloat32からfloat16（fp16）に変更して学習します（mixed precisionではなく完全なfp16学習になるようです）。これによりSD1.xの512*512サイズでは8GB未満、SD2.xの512*512サイズで12GB未満のVRAM使用量で学習できるようです。
+
+あらかじめaccelerate configでfp16を指定し、オプションでmixed_precision="fp16"としてください（bf16では動作しません）。
+
+メモリ使用量を最小化するためには、xformers、use_8bit_adam、gradient_checkpointingの各オプションを指定し、train_batch_sizeを1としてください。
+（余裕があるようならtrain_batch_sizeを段階的に増やすと若干精度が上がるはずです。）
+
+PyTorchのソースにパッチを当てて無理やり実現しています（PyTorch 1.12.1と1.13.0で確認）。精度はかなり落ちますし、途中で学習失敗する確率も高くなります。学習率やステップ数の設定もシビアなようです。それらを認識したうえで自己責任でお使いください。
+-->
+
+# fine tuning特有のその他の主なオプション
+
+すべてのオプションについては別文書を参照してください。
+
+## `train_text_encoder`
+Text Encoderも学習対象とします。メモリ使用量が若干増加します。
+
+通常のfine tuningではText Encoderは学習対象としませんが（恐らくText Encoderの出力に従うようにU-Netを学習するため）、学習データ数が少ない場合には、DreamBoothのようにText Encoder側に学習させるのも有効的なようです。
+
+## `diffusers_xformers`
+スクリプト独自のxformers置換機能ではなくDiffusersのxformers機能を利用します。Hypernetworkの学習はできなくなります。
--- a/docs/gen_img_README-ja.md
+++ b/docs/gen_img_README-ja.md
@@ -0,0 +1,487 @@
+SD 1.xおよび2.xのモデル、当リポジトリで学習したLoRA、ControlNet（v1.0のみ動作確認）などに対応した、Diffusersベースの推論（画像生成）スクリプトです。コマンドラインから用います。
+
+# 概要
+
+* Diffusers (v0.10.2) ベースの推論（画像生成）スクリプト。
+* SD 1.xおよび2.x (base/v-parameterization)モデルに対応。
+* txt2img、img2img、inpaintingに対応。
+* 対話モード、およびファイルからのプロンプト読み込み、連続生成に対応。
+* プロンプト1行あたりの生成枚数を指定可能。
+* 全体の繰り返し回数を指定可能。
+* `fp16`だけでなく`bf16`にも対応。
+* xformersに対応し高速生成が可能。
+    * xformersにより省メモリ生成を行いますが、Automatic 1111氏のWeb UIほど最適化していないため、512*512の画像生成でおおむね6GB程度のVRAMを使用します。
+* プロンプトの225トークンへの拡張。ネガティブプロンプト、重みづけに対応。
+* Diffusersの各種samplerに対応（Web UIよりもsampler数は少ないです）。
+* Text Encoderのclip skip（最後からn番目の層の出力を用いる）に対応。
+* VAEの別途読み込み。
+* CLIP Guided Stable Diffusion、VGG16 Guided Stable Diffusion、Highres. fix、upscale対応。
+    * Highres. fixはWeb UIの実装を全く確認していない独自実装のため、出力結果は異なるかもしれません。
+* LoRA対応。適用率指定、複数LoRA同時利用、重みのマージに対応。
+    * Text EncoderとU-Netで別の適用率を指定することはできません。
+* Attention Coupleに対応。
+* ControlNet v1.0に対応。
+* 途中でモデルを切り替えることはできませんが、バッチファイルを組むことで対応できます。
+* 個人的に欲しくなった機能をいろいろ追加。
+
+機能追加時にすべてのテストを行っているわけではないため、以前の機能に影響が出て一部機能が動かない可能性があります。何か問題があればお知らせください。
+
+# 基本的な使い方
+
+## 対話モードでの画像生成
+
+以下のように入力してください。
+
+```batchfile
+python gen_img_diffusers.py --ckpt <モデル名> --outdir <画像出力先> --xformers --fp16 --interactive
+```
+
+`--ckpt`オプションにモデル（Stable Diffusionのcheckpointファイル、またはDiffusersのモデルフォルダ）、`--outdir`オプションに画像の出力先フォルダを指定します。
+
+`--xformers`オプションでxformersの使用を指定します（xformersを使わない場合は外してください）。`--fp16`オプションでfp16（単精度）での推論を行います。RTX 30系のGPUでは `--bf16`オプションでbf16（bfloat16）での推論を行うこともできます。
+
+`--interactive`オプションで対話モードを指定しています。
+
+Stable Diffusion 2.0（またはそこからの追加学習モデル）を使う場合は`--v2`オプションを追加してください。v-parameterizationを使うモデル（`768-v-ema.ckpt`およびそこからの追加学習モデル）を使う場合はさらに`--v_parameterization`を追加してください。
+
+`--v2`の指定有無が間違っているとモデル読み込み時にエラーになります。`--v_parameterization`の指定有無が間違っていると茶色い画像が表示されます。
+
+`Type prompt:`と表示されたらプロンプトを入力してください。
+
+![image](https://user-images.githubusercontent.com/52813779/235343115-f3b8ac82-456d-4aab-9724-0cc73c4534aa.png)
+
+※画像が表示されずエラーになる場合、headless（画面表示機能なし）のOpenCVがインストールされているかもしれません。`pip install opencv-python`として通常のOpenCVを入れてください。または`--no_preview`オプションで画像表示を止めてください。
+
+画像ウィンドウを選択してから何らかのキーを押すとウィンドウが閉じ、次のプロンプトが入力できます。プロンプトでCtrl+Z、エンターの順に打鍵するとスクリプトを閉じます。
+
+## 単一のプロンプトで画像を一括生成
+
+以下のように入力します（実際には1行で入力します）。
+
+```batchfile
+python gen_img_diffusers.py --ckpt <モデル名> --outdir <画像出力先> 
+    --xformers --fp16 --images_per_prompt <生成枚数> --prompt "<プロンプト>"
+```
+
+`--images_per_prompt`オプションで、プロンプト1件当たりの生成枚数を指定します。`--prompt`オプションでプロンプトを指定します。スペースを含む場合はダブルクォーテーションで囲んでください。
+
+`--batch_size`オプションでバッチサイズを指定できます（後述）。
+
+## ファイルからプロンプトを読み込み一括生成
+
+以下のように入力します。
+
+```batchfile
+python gen_img_diffusers.py --ckpt <モデル名> --outdir <画像出力先> 
+    --xformers --fp16 --from_file <プロンプトファイル名>
+```
+
+`--from_file`オプションで、プロンプトが記述されたファイルを指定します。1行1プロンプトで記述してください。`--images_per_prompt`オプションを指定して1行あたり生成枚数を指定できます。
+
+## ネガティブプロンプト、重みづけの使用
+
+プロンプトオプション（プロンプト内で`--x`のように指定、後述）で`--n`を書くと、以降がネガティブプロンプトとなります。
+
+またAUTOMATIC1111氏のWeb UIと同様の `()` や` []` 、`(xxx:1.3)` などによる重みづけが可能です（実装はDiffusersの[Long Prompt Weighting Stable Diffusion](https://github.com/huggingface/diffusers/blob/main/examples/community/README.md#long-prompt-weighting-stable-diffusion)からコピーしたものです）。
+
+コマンドラインからのプロンプト指定、ファイルからのプロンプト読み込みでも同様に指定できます。
+
+![image](https://user-images.githubusercontent.com/52813779/235343128-e79cd768-ec59-46f5-8395-fce9bdc46208.png)
+
+# 主なオプション
+
+コマンドラインから指定してください。
+
+## モデルの指定
+
+- `--ckpt <モデル名>`：モデル名を指定します。`--ckpt`オプションは必須です。Stable Diffusionのcheckpointファイル、またはDiffusersのモデルフォルダ、Hugging FaceのモデルIDを指定できます。
+
+- `--v2`：Stable Diffusion 2.x系のモデルを使う場合に指定します。1.x系の場合には指定不要です。
+
+- `--v_parameterization`：v-parameterizationを使うモデルを使う場合に指定します（`768-v-ema.ckpt`およびそこからの追加学習モデル、Waifu Diffusion v1.5など）。
+    
+    `--v2`の指定有無が間違っているとモデル読み込み時にエラーになります。`--v_parameterization`の指定有無が間違っていると茶色い画像が表示されます。
+
+- `--vae`：使用するVAEを指定します。未指定時はモデル内のVAEを使用します。
+
+## 画像生成と出力
+
+- `--interactive`：インタラクティブモードで動作します。プロンプトを入力すると画像が生成されます。
+
+- `--prompt <プロンプト>`：プロンプトを指定します。スペースを含む場合はダブルクォーテーションで囲んでください。
+
+- `--from_file <プロンプトファイル名>`：プロンプトが記述されたファイルを指定します。1行1プロンプトで記述してください。なお画像サイズやguidance scaleはプロンプトオプション（後述）で指定できます。
+
+- `--W <画像幅>`：画像の幅を指定します。デフォルトは`512`です。
+
+- `--H <画像高さ>`：画像の高さを指定します。デフォルトは`512`です。
+
+- `--steps <ステップ数>`：サンプリングステップ数を指定します。デフォルトは`50`です。
+
+- `--scale <ガイダンススケール>`：unconditionalガイダンススケールを指定します。デフォルトは`7.5`です。
+
+- `--sampler <サンプラー名>`：サンプラーを指定します。デフォルトは`ddim`です。Diffusersで提供されているddim、pndm、dpmsolver、dpmsolver+++、lms、euler、euler_a、が指定可能です（後ろの三つはk_lms、k_euler、k_euler_aでも指定できます）。
+
+- `--outdir <画像出力先フォルダ>`：画像の出力先を指定します。
+
+- `--images_per_prompt <生成枚数>`：プロンプト1件当たりの生成枚数を指定します。デフォルトは`1`です。
+
+- `--clip_skip <スキップ数>`：CLIPの後ろから何番目の層を使うかを指定します。省略時は最後の層を使います。
+
+- `--max_embeddings_multiples <倍数>`：CLIPの入出力長をデフォルト（75）の何倍にするかを指定します。未指定時は75のままです。たとえば3を指定すると入出力長が225になります。
+
+- `--negative_scale` : uncoditioningのguidance scaleを個別に指定します。[gcem156氏のこちらの記事](https://note.com/gcem156/n/ne9a53e4a6f43)を参考に実装したものです。
+
+## メモリ使用量や生成速度の調整
+
+- `--batch_size <バッチサイズ>`：バッチサイズを指定します。デフォルトは`1`です。バッチサイズが大きいとメモリを多く消費しますが、生成速度が速くなります。
+
+- `--vae_batch_size <VAEのバッチサイズ>`：VAEのバッチサイズを指定します。デフォルトはバッチサイズと同じです。
+    VAEのほうがメモリを多く消費するため、デノイジング後（stepが100%になった後）でメモリ不足になる場合があります。このような場合にはVAEのバッチサイズを小さくしてください。
+
+- `--xformers`：xformersを使う場合に指定します。
+
+- `--fp16`：fp16（単精度）での推論を行います。`fp16`と`bf16`をどちらも指定しない場合はfp32（単精度）での推論を行います。
+
+- `--bf16`：bf16（bfloat16）での推論を行います。RTX 30系のGPUでのみ指定可能です。`--bf16`オプションはRTX 30系以外のGPUではエラーになります。`fp16`よりも`bf16`のほうが推論結果がNaNになる（真っ黒の画像になる）可能性が低いようです。
+
+## 追加ネットワーク（LoRA等）の使用
+
+- `--network_module`：使用する追加ネットワークを指定します。LoRAの場合は`--network_module networks.lora`と指定します。複数のLoRAを使用する場合は`--network_module networks.lora networks.lora networks.lora`のように指定します。
+
+- `--network_weights`：使用する追加ネットワークの重みファイルを指定します。`--network_weights model.safetensors`のように指定します。複数のLoRAを使用する場合は`--network_weights model1.safetensors model2.safetensors model3.safetensors`のように指定します。引数の数は`--network_module`で指定した数と同じにしてください。
+
+- `--network_mul`：使用する追加ネットワークの重みを何倍にするかを指定します。デフォルトは`1`です。`--network_mul 0.8`のように指定します。複数のLoRAを使用する場合は`--network_mul 0.4 0.5 0.7`のように指定します。引数の数は`--network_module`で指定した数と同じにしてください。
+
+- `--network_merge`：使用する追加ネットワークの重みを`--network_mul`に指定した重みであらかじめマージします。`--network_pre_calc` と同時に使用できません。プロンプトオプションの`--am`、およびRegional LoRAは使用できなくなりますが、LoRA未使用時と同じ程度まで生成が高速化されます。
+
+- `--network_pre_calc`：使用する追加ネットワークの重みを生成ごとにあらかじめ計算します。プロンプトオプションの`--am`が使用できます。LoRA未使用時と同じ程度まで生成は高速化されますが、生成前に重みを計算する時間が必要で、またメモリ使用量も若干増加します。Regional LoRA使用時は無効になります 。
+
+# 主なオプションの指定例
+
+次は同一プロンプトで64枚をバッチサイズ4で一括生成する例です。
+
+```batchfile
+python gen_img_diffusers.py --ckpt model.ckpt --outdir outputs 
+    --xformers --fp16 --W 512 --H 704 --scale 12.5 --sampler k_euler_a 
+    --steps 32 --batch_size 4 --images_per_prompt 64 
+    --prompt "beautiful flowers --n monochrome"
+```
+
+次はファイルに書かれたプロンプトを、それぞれ10枚ずつ、バッチサイズ4で一括生成する例です。
+
+```batchfile
+python gen_img_diffusers.py --ckpt model.ckpt --outdir outputs 
+    --xformers --fp16 --W 512 --H 704 --scale 12.5 --sampler k_euler_a 
+    --steps 32 --batch_size 4 --images_per_prompt 10 
+    --from_file prompts.txt
+```
+
+Textual Inversion（後述）およびLoRAの使用例です。
+
+```batchfile
+python gen_img_diffusers.py --ckpt model.safetensors 
+    --scale 8 --steps 48 --outdir txt2img --xformers 
+    --W 512 --H 768 --fp16 --sampler k_euler_a 
+    --textual_inversion_embeddings goodembed.safetensors negprompt.pt 
+    --network_module networks.lora networks.lora 
+    --network_weights model1.safetensors model2.safetensors 
+    --network_mul 0.4 0.8 
+    --clip_skip 2 --max_embeddings_multiples 1 
+    --batch_size 8 --images_per_prompt 1 --interactive
+```
+
+# プロンプトオプション
+
+プロンプト内で、`--n`のように「ハイフンふたつ+アルファベットn文字」でプロンプトから各種オプションの指定が可能です。対話モード、コマンドライン、ファイル、いずれからプロンプトを指定する場合でも有効です。
+
+プロンプトのオプション指定`--n`の前後にはスペースを入れてください。
+
+- `--n`：ネガティブプロンプトを指定します。
+
+- `--w`：画像幅を指定します。コマンドラインからの指定を上書きします。
+
+- `--h`：画像高さを指定します。コマンドラインからの指定を上書きします。
+
+- `--s`：ステップ数を指定します。コマンドラインからの指定を上書きします。
+
+- `--d`：この画像の乱数seedを指定します。`--images_per_prompt`を指定している場合は「--d 1,2,3,4」のようにカンマ区切りで複数指定してください。
+    ※様々な理由により、Web UIとは同じ乱数seedでも生成される画像が異なる場合があります。
+
+- `--l`：guidance scaleを指定します。コマンドラインからの指定を上書きします。
+
+- `--t`：img2img（後述）のstrengthを指定します。コマンドラインからの指定を上書きします。
+
+- `--nl`：ネガティブプロンプトのguidance scaleを指定します（後述）。コマンドラインからの指定を上書きします。
+
+- `--am`：追加ネットワークの重みを指定します。コマンドラインからの指定を上書きします。複数の追加ネットワークを使用する場合は`--am 0.8,0.5,0.3`のように __カンマ区切りで__ 指定します。
+
+※これらのオプションを指定すると、バッチサイズよりも小さいサイズでバッチが実行される場合があります（これらの値が異なると一括生成できないため）。（あまり気にしなくて大丈夫ですが、ファイルからプロンプトを読み込み生成する場合は、これらの値が同一のプロンプトを並べておくと効率が良くなります。）
+
+例：
+```
+(masterpiece, best quality), 1girl, in shirt and plated skirt, standing at street under cherry blossoms, upper body, [from below], kind smile, looking at another, [goodembed] --n realistic, real life, (negprompt), (lowres:1.1), (worst quality:1.2), (low quality:1.1), bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, normal quality, jpeg artifacts, signature, watermark, username, blurry --w 960 --h 640 --s 28 --d 1
+```
+
+![image](https://user-images.githubusercontent.com/52813779/235343446-25654172-fff4-4aaf-977a-20d262b51676.png)
+
+# img2img
+
+## オプション
+
+- `--image_path`：img2imgに利用する画像を指定します。`--image_path template.png`のように指定します。フォルダを指定すると、そのフォルダの画像を順次利用します。
+
+- `--strength`：img2imgのstrengthを指定します。`--strength 0.8`のように指定します。デフォルトは`0.8`です。
+
+- `--sequential_file_name`：ファイル名を連番にするかどうかを指定します。指定すると生成されるファイル名が`im_000001.png`からの連番になります。
+
+- `--use_original_file_name`：指定すると生成ファイル名がオリジナルのファイル名と同じになります。
+
+## コマンドラインからの実行例
+
+```batchfile
+python gen_img_diffusers.py --ckpt trinart_characters_it4_v1_vae_merged.ckpt 
+    --outdir outputs --xformers --fp16 --scale 12.5 --sampler k_euler --steps 32 
+    --image_path template.png --strength 0.8 
+    --prompt "1girl, cowboy shot, brown hair, pony tail, brown eyes, 
+          sailor school uniform, outdoors 
+          --n lowres, bad anatomy, bad hands, error, missing fingers, cropped, 
+          worst quality, low quality, normal quality, jpeg artifacts, (blurry), 
+          hair ornament, glasses" 
+    --batch_size 8 --images_per_prompt 32
+```
+
+`--image_path`オプションにフォルダを指定すると、そのフォルダの画像を順次読み込みます。生成される枚数は画像枚数ではなく、プロンプト数になりますので、`--images_per_promptPPオプションを指定してimg2imgする画像の枚数とプロンプト数を合わせてください。
+
+ファイルはファイル名でソートして読み込みます。なおソート順は文字列順となりますので（`1.jpg→2.jpg→10.jpg`ではなく`1.jpg→10.jpg→2.jpg`の順）、頭を0埋めするなどしてご対応ください（`01.jpg→02.jpg→10.jpg`）。
+
+## img2imgを利用したupscale
+
+img2img時にコマンドラインオプションの`--W`と`--H`で生成画像サイズを指定すると、元画像をそのサイズにリサイズしてからimg2imgを行います。
+
+またimg2imgの元画像がこのスクリプトで生成した画像の場合、プロンプトを省略すると、元画像のメタデータからプロンプトを取得しそのまま用います。これによりHighres. fixの2nd stageの動作だけを行うことができます。
+
+## img2img時のinpainting
+
+画像およびマスク画像を指定してinpaintingできます（inpaintingモデルには対応しておらず、単にマスク領域を対象にimg2imgするだけです）。
+
+オプションは以下の通りです。
+
+- `--mask_image`：マスク画像を指定します。`--img_path`と同様にフォルダを指定すると、そのフォルダの画像を順次利用します。
+
+マスク画像はグレースケール画像で、白の部分がinpaintingされます。境界をグラデーションしておくとなんとなく滑らかになりますのでお勧めです。
+
+![image](https://user-images.githubusercontent.com/52813779/235343795-9eaa6d98-02ff-4f32-b089-80d1fc482453.png)
+
+# その他の機能
+
+## Textual Inversion
+
+`--textual_inversion_embeddings`オプションで使用するembeddingsを指定します（複数指定可）。拡張子を除いたファイル名をプロンプト内で使用することで、そのembeddingsを利用します（Web UIと同様の使用法です）。ネガティブプロンプト内でも使用できます。
+
+モデルとして、当リポジトリで学習したTextual Inversionモデル、およびWeb UIで学習したTextual Inversionモデル（画像埋め込みは非対応）を利用できます
+
+## Extended Textual Inversion
+
+`--textual_inversion_embeddings`の代わりに`--XTI_embeddings`オプションを指定してください。使用法は`--textual_inversion_embeddings`と同じです。
+
+## Highres. fix
+
+AUTOMATIC1111氏のWeb UIにある機能の類似機能です（独自実装のためもしかしたらいろいろ異なるかもしれません）。最初に小さめの画像を生成し、その画像を元にimg2imgすることで、画像全体の破綻を防ぎつつ大きな解像度の画像を生成します。
+
+2nd stageのstep数は`--steps` と`--strength`オプションの値から計算されます（`steps*strength`）。
+
+img2imgと併用できません。
+
+以下のオプションがあります。
+
+- `--highres_fix_scale`：Highres. fixを有効にして、1st stageで生成する画像のサイズを、倍率で指定します。最終出力が1024x1024で、最初に512x512の画像を生成する場合は`--highres_fix_scale 0.5`のように指定します。Web UI出の指定の逆数になっていますのでご注意ください。
+
+- `--highres_fix_steps`：1st stageの画像のステップ数を指定します。デフォルトは`28`です。
+
+- `--highres_fix_save_1st`：1st stageの画像を保存するかどうかを指定します。
+
+- `--highres_fix_latents_upscaling`：指定すると2nd stageの画像生成時に1st stageの画像をlatentベースでupscalingします（bilinearのみ対応）。未指定時は画像をLANCZOS4でupscalingします。
+
+- `--highres_fix_upscaler`：2nd stageに任意のupscalerを利用します。現在は`--highres_fix_upscaler tools.latent_upscaler` のみ対応しています。
+
+- `--highres_fix_upscaler_args`：`--highres_fix_upscaler`で指定したupscalerに渡す引数を指定します。
+    `tools.latent_upscaler`の場合は、`--highres_fix_upscaler_args "weights=D:\Work\SD\Models\others\etc\upscaler-v1-e100-220.safetensors"`のように重みファイルを指定します。 
+
+コマンドラインの例です。
+
+```batchfile
+python gen_img_diffusers.py  --ckpt trinart_characters_it4_v1_vae_merged.ckpt
+    --n_iter 1 --scale 7.5 --W 1024 --H 1024 --batch_size 1 --outdir ../txt2img 
+    --steps 48 --sampler ddim --fp16 
+    --xformers 
+    --images_per_prompt 1  --interactive 
+    --highres_fix_scale 0.5 --highres_fix_steps 28 --strength 0.5
+```
+
+## ControlNet
+
+現在はControlNet 1.0のみ動作確認しています。プリプロセスはCannyのみサポートしています。
+
+以下のオプションがあります。
+
+- `--control_net_models`：ControlNetのモデルファイルを指定します。
+    複数指定すると、それらをstepごとに切り替えて利用します（Web UIのControlNet拡張の実装と異なります）。diffと通常の両方をサポートします。
+
+- `--guide_image_path`：ControlNetに使うヒント画像を指定します。`--img_path`と同様にフォルダを指定すると、そのフォルダの画像を順次利用します。Canny以外のモデルの場合には、あらかじめプリプロセスを行っておいてください。
+
+- `--control_net_preps`：ControlNetのプリプロセスを指定します。`--control_net_models`と同様に複数指定可能です。現在はcannyのみ対応しています。対象モデルでプリプロセスを使用しない場合は `none` を指定します。
+   cannyの場合 `--control_net_preps canny_63_191`のように、閾値1と2を'_'で区切って指定できます。
+
+- `--control_net_weights`：ControlNetの適用時の重みを指定します（`1.0`で通常、`0.5`なら半分の影響力で適用）。`--control_net_models`と同様に複数指定可能です。
+
+- `--control_net_ratios`：ControlNetを適用するstepの範囲を指定します。`0.5`の場合は、step数の半分までControlNetを適用します。`--control_net_models`と同様に複数指定可能です。
+
+コマンドラインの例です。
+
+```batchfile
+python gen_img_diffusers.py --ckpt model_ckpt --scale 8 --steps 48 --outdir txt2img --xformers 
+    --W 512 --H 768 --bf16 --sampler k_euler_a 
+    --control_net_models diff_control_sd15_canny.safetensors --control_net_weights 1.0 
+    --guide_image_path guide.png --control_net_ratios 1.0 --interactive
+```
+
+## Attention Couple + Reginal LoRA
+
+プロンプトをいくつかの部分に分割し、それぞれのプロンプトを画像内のどの領域に適用するかを指定できる機能です。個別のオプションはありませんが、`mask_path`とプロンプトで指定します。
+
+まず、プロンプトで` AND `を利用して、複数部分を定義します。最初の3つに対して領域指定ができ、以降の部分は画像全体へ適用されます。ネガティブプロンプトは画像全体に適用されます。
+
+以下ではANDで3つの部分を定義しています。
+
+```
+shs 2girls, looking at viewer, smile AND bsb 2girls, looking back AND 2girls --n bad quality, worst quality
+```
+
+次にマスク画像を用意します。マスク画像はカラーの画像で、RGBの各チャネルがプロンプトのANDで区切られた部分に対応します。またあるチャネルの値がすべて0の場合、画像全体に適用されます。
+
+上記の例では、Rチャネルが`shs 2girls, looking at viewer, smile`、Gチャネルが`bsb 2girls, looking back`に、Bチャネルが`2girls`に対応します。次のようなマスク画像を使用すると、Bチャネルに指定がありませんので、`2girls`は画像全体に適用されます。
+
+![image](https://user-images.githubusercontent.com/52813779/235343061-b4dc9392-3dae-4831-8347-1e9ae5054251.png)
+
+マスク画像は`--mask_path`で指定します。現在は1枚のみ対応しています。指定した画像サイズに自動的にリサイズされ適用されます。
+
+ControlNetと組み合わせることも可能です（細かい位置指定にはControlNetとの組み合わせを推奨します）。
+
+LoRAを指定すると、`--network_weights`で指定した複数のLoRAがそれぞれANDの各部分に対応します。現在の制約として、LoRAの数はANDの部分の数と同じである必要があります。
+
+## CLIP Guided Stable Diffusion
+
+DiffusersのCommunity Examplesの[こちらのcustom pipeline](https://github.com/huggingface/diffusers/blob/main/examples/community/README.md#clip-guided-stable-diffusion)からソースをコピー、変更したものです。
+
+通常のプロンプトによる生成指定に加えて、追加でより大規模のCLIPでプロンプトのテキストの特徴量を取得し、生成中の画像の特徴量がそのテキストの特徴量に近づくよう、生成される画像をコントロールします（私のざっくりとした理解です）。大きめのCLIPを使いますのでVRAM使用量はかなり増加し（VRAM 8GBでは512*512でも厳しいかもしれません）、生成時間も掛かります。
+
+なお選択できるサンプラーはDDIM、PNDM、LMSのみとなります。
+
+`--clip_guidance_scale`オプションにどの程度、CLIPの特徴量を反映するかを数値で指定します。先のサンプルでは100になっていますので、そのあたりから始めて増減すると良いようです。
+
+デフォルトではプロンプトの先頭75トークン（重みづけの特殊文字を除く）がCLIPに渡されます。プロンプトの`--c`オプションで、通常のプロンプトではなく、CLIPに渡すテキストを別に指定できます（たとえばCLIPはDreamBoothのidentifier（識別子）や「1girl」などのモデル特有の単語は認識できないと思われますので、それらを省いたテキストが良いと思われます）。
+
+コマンドラインの例です。
+
+```batchfile
+python gen_img_diffusers.py  --ckpt v1-5-pruned-emaonly.ckpt --n_iter 1 
+    --scale 2.5 --W 512 --H 512 --batch_size 1 --outdir ../txt2img --steps 36  
+    --sampler ddim --fp16 --opt_channels_last --xformers --images_per_prompt 1  
+    --interactive --clip_guidance_scale 100
+```
+
+## CLIP Image Guided Stable Diffusion
+
+テキストではなくCLIPに別の画像を渡し、その特徴量に近づくよう生成をコントロールする機能です。`--clip_image_guidance_scale`オプションで適用量の数値を、`--guide_image_path`オプションでguideに使用する画像（ファイルまたはフォルダ）を指定してください。
+
+コマンドラインの例です。
+
+```batchfile
+python gen_img_diffusers.py  --ckpt trinart_characters_it4_v1_vae_merged.ckpt
+    --n_iter 1 --scale 7.5 --W 512 --H 512 --batch_size 1 --outdir ../txt2img 
+    --steps 80 --sampler ddim --fp16 --opt_channels_last --xformers 
+    --images_per_prompt 1  --interactive  --clip_image_guidance_scale 100 
+    --guide_image_path YUKA160113420I9A4104_TP_V.jpg
+```
+
+### VGG16 Guided Stable Diffusion
+
+指定した画像に近づくように画像生成する機能です。通常のプロンプトによる生成指定に加えて、追加でVGG16の特徴量を取得し、生成中の画像が指定したガイド画像に近づくよう、生成される画像をコントロールします。img2imgでの使用をお勧めします（通常の生成では画像がぼやけた感じになります）。CLIP Guided Stable Diffusionの仕組みを流用した独自の機能です。またアイデアはVGGを利用したスタイル変換から拝借しています。
+
+なお選択できるサンプラーはDDIM、PNDM、LMSのみとなります。
+
+`--vgg16_guidance_scale`オプションにどの程度、VGG16特徴量を反映するかを数値で指定します。試した感じでは100くらいから始めて増減すると良いようです。`--guide_image_path`オプションでguideに使用する画像（ファイルまたはフォルダ）を指定してください。
+
+複数枚の画像を一括でimg2img変換し、元画像をガイド画像とする場合、`--guide_image_path`と`--image_path`に同じ値を指定すればOKです。
+
+コマンドラインの例です。
+
+```batchfile
+python gen_img_diffusers.py --ckpt wd-v1-3-full-pruned-half.ckpt 
+    --n_iter 1 --scale 5.5 --steps 60 --outdir ../txt2img 
+    --xformers --sampler ddim --fp16 --W 512 --H 704 
+    --batch_size 1 --images_per_prompt 1 
+    --prompt "picturesque, 1girl, solo, anime face, skirt, beautiful face 
+        --n lowres, bad anatomy, bad hands, error, missing fingers, 
+        cropped, worst quality, low quality, normal quality, 
+        jpeg artifacts, blurry, 3d, bad face, monochrome --d 1" 
+    --strength 0.8 --image_path ..\src_image
+    --vgg16_guidance_scale 100 --guide_image_path ..\src_image 
+```
+
+`--vgg16_guidance_layerPで特徴量取得に使用するVGG16のレイヤー番号を指定できます（デフォルトは20でconv4-2のReLUです）。上の層ほど画風を表現し、下の層ほどコンテンツを表現するといわれています。
+
+![image](https://user-images.githubusercontent.com/52813779/235343813-3c1f0d7a-4fb3-4274-98e4-b92d76b551df.png)
+
+# その他のオプション
+
+- `--no_preview` : 対話モードでプレビュー画像を表示しません。OpenCVがインストールされていない場合や、出力されたファイルを直接確認する場合に指定してください。
+
+- `--n_iter` : 生成を繰り返す回数を指定します。デフォルトは1です。プロンプトをファイルから読み込むとき、複数回の生成を行いたい場合に指定します。
+
+- `--tokenizer_cache_dir` : トークナイザーのキャッシュディレクトリを指定します。（作業中）
+
+- `--seed` : 乱数seedを指定します。1枚生成時はその画像のseed、複数枚生成時は各画像のseedを生成するための乱数のseedになります（`--from_file`で複数画像生成するとき、`--seed`オプションを指定すると複数回実行したときに各画像が同じseedになります）。
+
+- `--iter_same_seed` : プロンプトに乱数seedの指定がないとき、`--n_iter`の繰り返し内ではすべて同じseedを使います。`--from_file`で指定した複数のプロンプト間でseedを統一して比較するときに使います。
+
+- `--diffusers_xformers` : Diffuserのxformersを使用します。
+
+- `--opt_channels_last` : 推論時にテンソルのチャンネルを最後に配置します。場合によっては高速化されることがあります。
+
+- `--network_show_meta` : 追加ネットワークのメタデータを表示します。
+
+
+--- 
+
+# About Gradual Latent
+
+Gradual Latent is a Hires fix that gradually increases the size of the latent.  `gen_img.py`, `sdxl_gen_img.py`, and `gen_img_diffusers.py` have the following options.
+
+- `--gradual_latent_timesteps`: Specifies the timestep to start increasing the size of the latent. The default is None, which means Gradual Latent is not used. Please try around 750 at first.
+- `--gradual_latent_ratio`: Specifies the initial size of the latent. The default is 0.5, which means it starts with half the default latent size.
+- `--gradual_latent_ratio_step`: Specifies the ratio to increase the size of the latent. The default is 0.125, which means the latent size is gradually increased to 0.625, 0.75, 0.875, 1.0.
+- `--gradual_latent_ratio_every_n_steps`: Specifies the interval to increase the size of the latent. The default is 3, which means the latent size is increased every 3 steps.
+
+Each option can also be specified with prompt options, `--glt`, `--glr`, `--gls`, `--gle`.
+
+__Please specify `euler_a` for the sampler.__ Because the source code of the sampler is modified. It will not work with other samplers.
+
+It is more effective with SD 1.5. It is quite subtle with SDXL.
+
+# Gradual Latent について
+
+latentのサイズを徐々に大きくしていくHires fixです。`gen_img.py` 、``sdxl_gen_img.py` 、`gen_img_diffusers.py` に以下のオプションが追加されています。
+
+- `--gradual_latent_timesteps` : latentのサイズを大きくし始めるタイムステップを指定します。デフォルトは None で、Gradual Latentを使用しません。750 くらいから始めてみてください。
+- `--gradual_latent_ratio` : latentの初期サイズを指定します。デフォルトは 0.5 で、デフォルトの latent サイズの半分のサイズから始めます。
+- `--gradual_latent_ratio_step`: latentのサイズを大きくする割合を指定します。デフォルトは 0.125 で、latentのサイズを 0.625, 0.75, 0.875, 1.0 と徐々に大きくします。
+- `--gradual_latent_ratio_every_n_steps`: latentのサイズを大きくする間隔を指定します。デフォルトは 3 で、3ステップごとに latent のサイズを大きくします。
+
+それぞれのオプションは、プロンプトオプション、`--glt`、`--glr`、`--gls`、`--gle` でも指定できます。
+
+サンプラーに手を加えているため、__サンプラーに `euler_a` を指定してください。__ 他のサンプラーでは動作しません。
+
+SD 1.5 のほうが効果があります。SDXL ではかなり微妙です。
+
--- a/docs/lumina_train_network.md
+++ b/docs/lumina_train_network.md
@@ -0,0 +1,302 @@
+Status: reviewed
+
+# LoRA Training Guide for Lumina Image 2.0 using `lumina_train_network.py` / `lumina_train_network.py` を用いたLumina Image 2.0モデルのLoRA学習ガイド
+
+This document explains how to train LoRA (Low-Rank Adaptation) models for Lumina Image 2.0 using `lumina_train_network.py` in the `sd-scripts` repository.
+
+## 1. Introduction / はじめに
+
+`lumina_train_network.py` trains additional networks such as LoRA for Lumina Image 2.0 models. Lumina Image 2.0 adopts a Next-DiT (Next-generation Diffusion Transformer) architecture, which differs from previous Stable Diffusion models. It uses a single text encoder (Gemma2) and a dedicated AutoEncoder (AE).
+
+This guide assumes you already understand the basics of LoRA training. For common usage and options, see the train_network.py guide (to be documented). Some parameters are similar to those in [`sd3_train_network.py`](sd3_train_network.md) and [`flux_train_network.py`](flux_train_network.md).
+
+**Prerequisites:**
+
+* The `sd-scripts` repository has been cloned and the Python environment is ready.
+* A training dataset has been prepared. See the [Dataset Configuration Guide](./config_README-en.md).
+* Lumina Image 2.0 model files for training are available.
+
+<details>
+<summary>日本語</summary>
+
+`lumina_train_network.py`は、Lumina Image 2.0モデルに対してLoRAなどの追加ネットワークを学習させるためのスクリプトです。Lumina Image 2.0は、Next-DiT (Next-generation Diffusion Transformer) と呼ばれる新しいアーキテクチャを採用しており、従来のStable Diffusionモデルとは構造が異なります。テキストエンコーダーとしてGemma2を単体で使用し、専用のAutoEncoder (AE) を使用します。
+
+このガイドは、基本的なLoRA学習の手順を理解しているユーザーを対象としています。基本的な使い方や共通のオプションについては、`train_network.py`のガイド（作成中）を参照してください。また一部のパラメータは [`sd3_train_network.py`](sd3_train_network.md) や [`flux_train_network.py`](flux_train_network.md) と同様のものがあるため、そちらも参考にしてください。
+
+**前提条件:**
+
+*   `sd-scripts`リポジトリのクローンとPython環境のセットアップが完了していること。
+*   学習用データセットの準備が完了していること。（データセットの準備については[データセット設定ガイド](./config_README-en.md)を参照してください）
+*   学習対象のLumina Image 2.0モデルファイルが準備できていること。
+</details>
+
+## 2. Differences from `train_network.py` / `train_network.py` との違い
+
+`lumina_train_network.py` is based on `train_network.py` but modified for Lumina Image 2.0. Main differences are:
+
+* **Target models:** Lumina Image 2.0 models.
+* **Model structure:** Uses Next-DiT (Transformer based) instead of U-Net and employs a single text encoder (Gemma2). The AutoEncoder (AE) is not compatible with SDXL/SD3/FLUX.
+* **Arguments:** Options exist to specify the Lumina Image 2.0 model, Gemma2 text encoder and AE. With a single `.safetensors` file, these components are typically provided separately.
+* **Incompatible arguments:** Stable Diffusion v1/v2 options such as `--v2`, `--v_parameterization` and `--clip_skip` are not used.
+* **Lumina specific options:** Additional parameters for timestep sampling, model prediction type, discrete flow shift, and system prompt.
+
+<details>
+<summary>日本語</summary>
+`lumina_train_network.py`は`train_network.py`をベースに、Lumina Image 2.0モデルに対応するための変更が加えられています。主な違いは以下の通りです。
+
+*   **対象モデル:** Lumina Image 2.0モデルを対象とします。
+*   **モデル構造:** U-Netの代わりにNext-DiT (Transformerベース) を使用します。Text EncoderとしてGemma2を単体で使用し、専用のAutoEncoder (AE) を使用します。
+*   **引数:** Lumina Image 2.0モデル、Gemma2 Text Encoder、AEを指定する引数があります。通常、これらのコンポーネントは個別に提供されます。
+*   **一部引数の非互換性:** Stable Diffusion v1/v2向けの引数（例: `--v2`, `--v_parameterization`, `--clip_skip`）はLumina Image 2.0の学習では使用されません。
+*   **Lumina特有の引数:** タイムステップのサンプリング、モデル予測タイプ、離散フローシフト、システムプロンプトに関する引数が追加されています。
+</details>
+
+## 3. Preparation / 準備
+
+The following files are required before starting training:
+
+1. **Training script:** `lumina_train_network.py`
+2. **Lumina Image 2.0 model file:** `.safetensors` file for the base model.
+3. **Gemma2 text encoder file:** `.safetensors` file for the text encoder.
+4. **AutoEncoder (AE) file:** `.safetensors` file for the AE.
+5. **Dataset definition file (.toml):** Dataset settings in TOML format. (See the [Dataset Configuration Guide](./config_README-en.md). In this document we use `my_lumina_dataset_config.toml` as an example.
+
+
+**Model Files:**
+* Lumina Image 2.0: `lumina-image-2.safetensors` ([full precision link](https://huggingface.co/rockerBOO/lumina-image-2/blob/main/lumina-image-2.safetensors)) or `lumina_2_model_bf16.safetensors` ([bf16 link](https://huggingface.co/Comfy-Org/Lumina_Image_2.0_Repackaged/blob/main/split_files/diffusion_models/lumina_2_model_bf16.safetensors))
+* Gemma2 2B (fp16): `gemma-2-2b.safetensors` ([link](https://huggingface.co/Comfy-Org/Lumina_Image_2.0_Repackaged/blob/main/split_files/text_encoders/gemma_2_2b_fp16.safetensors))
+* AutoEncoder: `ae.safetensors` ([link](https://huggingface.co/Comfy-Org/Lumina_Image_2.0_Repackaged/blob/main/split_files/vae/ae.safetensors)) (same as FLUX)
+
+
+<details>
+<summary>日本語</summary>
+学習を開始する前に、以下のファイルが必要です。
+
+1.  **学習スクリプト:** `lumina_train_network.py`
+2.  **Lumina Image 2.0モデルファイル:** 学習のベースとなるLumina Image 2.0モデルの`.safetensors`ファイル。
+3.  **Gemma2テキストエンコーダーファイル:** Gemma2テキストエンコーダーの`.safetensors`ファイル。
+4.  **AutoEncoder (AE) ファイル:** AEの`.safetensors`ファイル。
+5.  **データセット定義ファイル (.toml):** 学習データセットの設定を記述したTOML形式のファイル。（詳細は[データセット設定ガイド](./config_README-en.md)を参照してください）。
+    *   例として`my_lumina_dataset_config.toml`を使用します。
+
+**モデルファイル** は英語ドキュメントの通りです。
+
+</details>
+
+## 4. Running the Training / 学習の実行
+
+Execute `lumina_train_network.py` from the terminal to start training. The overall command-line format is the same as `train_network.py`, but Lumina Image 2.0 specific options must be supplied.
+
+Example command:
+
+```bash
+accelerate launch --num_cpu_threads_per_process 1 lumina_train_network.py \
+  --pretrained_model_name_or_path="lumina-image-2.safetensors" \
+  --gemma2="gemma-2-2b.safetensors" \
+  --ae="ae.safetensors" \
+  --dataset_config="my_lumina_dataset_config.toml" \
+  --output_dir="./output" \
+  --output_name="my_lumina_lora" \
+  --save_model_as=safetensors \
+  --network_module=networks.lora_lumina \
+  --network_dim=8 \
+  --network_alpha=8 \
+  --learning_rate=1e-4 \
+  --optimizer_type="AdamW" \
+  --lr_scheduler="constant" \
+  --timestep_sampling="nextdit_shift" \
+  --discrete_flow_shift=6.0 \
+  --model_prediction_type="raw" \
+  --system_prompt="You are an assistant designed to generate high-quality images based on user prompts." \
+  --max_train_epochs=10 \
+  --save_every_n_epochs=1 \
+  --mixed_precision="bf16" \
+  --gradient_checkpointing \
+  --cache_latents \
+  --cache_text_encoder_outputs
+```
+
+*(Write the command on one line or use `\` or `^` for line breaks.)*
+
+<details>
+<summary>日本語</summary>
+学習は、ターミナルから`lumina_train_network.py`を実行することで開始します。基本的なコマンドラインの構造は`train_network.py`と同様ですが、Lumina Image 2.0特有の引数を指定する必要があります。
+
+以下に、基本的なコマンドライン実行例を示します。
+
+```bash
+accelerate launch --num_cpu_threads_per_process 1 lumina_train_network.py \
+  --pretrained_model_name_or_path="lumina-image-2.safetensors" \
+  --gemma2="gemma-2-2b.safetensors" \
+  --ae="ae.safetensors" \
+  --dataset_config="my_lumina_dataset_config.toml" \
+  --output_dir="./output" \
+  --output_name="my_lumina_lora" \
+  --save_model_as=safetensors \
+  --network_module=networks.lora_lumina \
+  --network_dim=8 \
+  --network_alpha=8 \
+  --learning_rate=1e-4 \
+  --optimizer_type="AdamW" \
+  --lr_scheduler="constant" \
+  --timestep_sampling="nextdit_shift" \
+  --discrete_flow_shift=6.0 \
+  --model_prediction_type="raw" \
+  --system_prompt="You are an assistant designed to generate high-quality images based on user prompts." \
+  --max_train_epochs=10 \
+  --save_every_n_epochs=1 \
+  --mixed_precision="bf16" \
+  --gradient_checkpointing \
+  --cache_latents \
+  --cache_text_encoder_outputs
+```
+
+※実際には1行で書くか、適切な改行文字（`\` または `^`）を使用してください。
+</details>
+
+### 4.1. Explanation of Key Options / 主要なコマンドライン引数の解説
+
+Besides the arguments explained in the [train_network.py guide](train_network.md), specify the following Lumina Image 2.0 options. For shared options (`--output_dir`, `--output_name`, etc.), see that guide.
+
+#### Model Options / モデル関連
+
+* `--pretrained_model_name_or_path="<path to Lumina model>"` **required** – Path to the Lumina Image 2.0 model.
+* `--gemma2="<path to Gemma2 model>"` **required** – Path to the Gemma2 text encoder `.safetensors` file.
+* `--ae="<path to AE model>"` **required** – Path to the AutoEncoder `.safetensors` file.
+
+#### Lumina Image 2.0 Training Parameters / Lumina Image 2.0 学習パラメータ
+
+* `--gemma2_max_token_length=<integer>` – Max token length for Gemma2. Default is 256.
+* `--timestep_sampling=<choice>` – Timestep sampling method. Options: `sigma`, `uniform`, `sigmoid`, `shift`, `nextdit_shift`. Default `shift`. **Recommended: `nextdit_shift`**
+* `--discrete_flow_shift=<float>` – Discrete flow shift for the Euler Discrete Scheduler. Default `6.0`.
+* `--model_prediction_type=<choice>` – Model prediction processing method. Options: `raw`, `additive`, `sigma_scaled`. Default `raw`. **Recommended: `raw`**
+* `--system_prompt=<string>` – System prompt to prepend to all prompts. Recommended: `"You are an assistant designed to generate high-quality images based on user prompts."` or `"You are an assistant designed to generate high-quality images with the highest degree of image-text alignment based on textual prompts."`
+* `--use_flash_attn` – Use Flash Attention. Requires `pip install flash-attn` (may not be supported in all environments). If installed correctly, it speeds up training. 
+* `--sigmoid_scale=<float>` – Scale factor for sigmoid timestep sampling. Default `1.0`.
+
+#### Memory and Speed / メモリ・速度関連
+
+* `--blocks_to_swap=<integer>` **[experimental]** – Swap a number of Transformer blocks between CPU and GPU. More blocks reduce VRAM but slow training. Cannot be used with `--cpu_offload_checkpointing`.
+* `--cache_text_encoder_outputs` – Cache Gemma2 outputs to reduce memory usage.
+* `--cache_latents`, `--cache_latents_to_disk` – Cache AE outputs.
+* `--fp8_base` – Use FP8 precision for the base model.
+
+#### Network Arguments / ネットワーク引数
+
+For Lumina Image 2.0, you can specify different dimensions for various components:
+
+* `--network_args` can include:
+  * `"attn_dim=4"` – Attention dimension
+  * `"mlp_dim=4"` – MLP dimension  
+  * `"mod_dim=4"` – Modulation dimension
+  * `"refiner_dim=4"` – Refiner blocks dimension
+  * `"embedder_dims=[4,4,4]"` – Embedder dimensions for x, t, and caption embedders
+
+#### Incompatible or Deprecated Options / 非互換・非推奨の引数
+
+* `--v2`, `--v_parameterization`, `--clip_skip` – Options for Stable Diffusion v1/v2 that are not used for Lumina Image 2.0.
+
+<details>
+<summary>日本語</summary>
+[`train_network.py`のガイド](train_network.md)で説明されている引数に加え、以下のLumina Image 2.0特有の引数を指定します。共通の引数については、上記ガイドを参照してください。
+
+#### モデル関連
+
+*   `--pretrained_model_name_or_path="<path to Lumina model>"` **[必須]**
+    *   学習のベースとなるLumina Image 2.0モデルの`.safetensors`ファイルのパスを指定します。
+*   `--gemma2="<path to Gemma2 model>"` **[必須]**
+    *   Gemma2テキストエンコーダーの`.safetensors`ファイルのパスを指定します。
+*   `--ae="<path to AE model>"` **[必須]**
+    *   AutoEncoderの`.safetensors`ファイルのパスを指定します。
+
+#### Lumina Image 2.0 学習パラメータ
+
+*   `--gemma2_max_token_length=<integer>` – Gemma2で使用するトークンの最大長を指定します。デフォルトは256です。
+*   `--timestep_sampling=<choice>` – タイムステップのサンプリング方法を指定します。`sigma`, `uniform`, `sigmoid`, `shift`, `nextdit_shift`から選択します。デフォルトは`shift`です。**推奨: `nextdit_shift`**
+*   `--discrete_flow_shift=<float>` – Euler Discrete Schedulerの離散フローシフトを指定します。デフォルトは`6.0`です。
+*   `--model_prediction_type=<choice>` – モデル予測の処理方法を指定します。`raw`, `additive`, `sigma_scaled`から選択します。デフォルトは`raw`です。**推奨: `raw`**
+*   `--system_prompt=<string>` – 全てのプロンプトに前置するシステムプロンプトを指定します。推奨: `"You are an assistant designed to generate high-quality images based on user prompts."` または `"You are an assistant designed to generate high-quality images with the highest degree of image-text alignment based on textual prompts."`
+*   `--use_flash_attn` – Flash Attentionを使用します。`pip install flash-attn`でインストールが必要です（環境によってはサポートされていません）。正しくインストールされている場合は、指定すると学習が高速化されます。
+*   `--sigmoid_scale=<float>` – sigmoidタイムステップサンプリングのスケール係数を指定します。デフォルトは`1.0`です。
+
+#### メモリ・速度関連
+
+*   `--blocks_to_swap=<integer>` **[実験的機能]** – TransformerブロックをCPUとGPUでスワップしてVRAMを節約します。`--cpu_offload_checkpointing`とは併用できません。
+*   `--cache_text_encoder_outputs` – Gemma2の出力をキャッシュしてメモリ使用量を削減します。
+*   `--cache_latents`, `--cache_latents_to_disk` – AEの出力をキャッシュします。
+*   `--fp8_base` – ベースモデルにFP8精度を使用します。
+
+#### ネットワーク引数
+
+Lumina Image 2.0では、各コンポーネントに対して異なる次元を指定できます：
+
+*   `--network_args` には以下を含めることができます：
+    *   `"attn_dim=4"` – アテンション次元
+    *   `"mlp_dim=4"` – MLP次元
+    *   `"mod_dim=4"` – モジュレーション次元
+    *   `"refiner_dim=4"` – リファイナーブロック次元
+    *   `"embedder_dims=[4,4,4]"` – x、t、キャプションエンベッダーのエンベッダー次元
+
+#### 非互換・非推奨の引数
+
+*   `--v2`, `--v_parameterization`, `--clip_skip` – Stable Diffusion v1/v2向けの引数のため、Lumina Image 2.0学習では使用されません。
+</details>
+
+### 4.2. Starting Training / 学習の開始
+
+After setting the required arguments, run the command to begin training. The overall flow and how to check logs are the same as in the [train_network.py guide](train_network.md#32-starting-the-training--学習の開始).
+
+## 5. Using the Trained Model / 学習済みモデルの利用
+
+When training finishes, a LoRA model file (e.g. `my_lumina_lora.safetensors`) is saved in the directory specified by `output_dir`. Use this file with inference environments that support Lumina Image 2.0, such as ComfyUI with appropriate nodes.
+
+## 6. Others / その他
+
+`lumina_train_network.py` shares many features with `train_network.py`, such as sample image generation (`--sample_prompts`, etc.) and detailed optimizer settings. For these, see the [train_network.py guide](train_network.md#5-other-features--その他の機能) or run `python lumina_train_network.py --help`.
+
+### 6.1. Recommended Settings / 推奨設定
+
+Based on the contributor's recommendations, here are the suggested settings for optimal training:
+
+**Key Parameters:**
+* `--timestep_sampling="nextdit_shift"`
+* `--discrete_flow_shift=6.0`
+* `--model_prediction_type="raw"`
+* `--mixed_precision="bf16"`
+
+**System Prompts:**
+* General purpose: `"You are an assistant designed to generate high-quality images based on user prompts."`
+* High image-text alignment: `"You are an assistant designed to generate high-quality images with the highest degree of image-text alignment based on textual prompts."`
+
+**Sample Prompts:**
+Sample prompts can include CFG truncate (`--ctr`) and Renorm CFG (`-rcfg`) parameters:
+* `--ctr 0.25 --rcfg 1.0` (default values)
+
+<details>
+<summary>日本語</summary>
+
+必要な引数を設定し、コマンドを実行すると学習が開始されます。基本的な流れやログの確認方法は[`train_network.py`のガイド](train_network.md#32-starting-the-training--学習の開始)と同様です。
+
+学習が完了すると、指定した`output_dir`にLoRAモデルファイル（例: `my_lumina_lora.safetensors`）が保存されます。このファイルは、Lumina Image 2.0モデルに対応した推論環境（例: ComfyUI + 適切なノード）で使用できます。
+
+`lumina_train_network.py`には、サンプル画像の生成 (`--sample_prompts`など) や詳細なオプティマイザ設定など、`train_network.py`と共通の機能も多く存在します。これらについては、[`train_network.py`のガイド](train_network.md#5-other-features--その他の機能)やスクリプトのヘルプ (`python lumina_train_network.py --help`) を参照してください。
+
+### 6.1. 推奨設定
+
+コントリビューターの推奨に基づく、最適な学習のための推奨設定：
+
+**主要パラメータ:**
+* `--timestep_sampling="nextdit_shift"`
+* `--discrete_flow_shift=6.0`
+* `--model_prediction_type="raw"`
+* `--mixed_precision="bf16"`
+
+**システムプロンプト:**
+* 汎用目的: `"You are an assistant designed to generate high-quality images based on user prompts."`
+* 高い画像-テキスト整合性: `"You are an assistant designed to generate high-quality images with the highest degree of image-text alignment based on textual prompts."`
+
+**サンプルプロンプト:**
+サンプルプロンプトには CFG truncate (`--ctr`) と Renorm CFG (`--rcfg`) パラメータを含めることができます：
+* `--ctr 0.25 --rcfg 1.0` (デフォルト値)
+
+</details>
--- a/docs/masked_loss_README-ja.md
+++ b/docs/masked_loss_README-ja.md
@@ -0,0 +1,57 @@
+## マスクロスについて
+
+マスクロスは、入力画像のマスクで指定された部分だけ損失計算することで、画像の一部分だけを学習することができる機能です。
+たとえばキャラクタを学習したい場合、キャラクタ部分だけをマスクして学習することで、背景を無視して学習することができます。
+
+マスクロスのマスクには、二種類の指定方法があります。
+
+- マスク画像を用いる方法
+- 透明度（アルファチャネル）を使用する方法
+
+なお、サンプルは [ずんずんPJイラスト/3Dデータ](https://zunko.jp/con_illust.html) の「AI画像モデル用学習データ」を使用しています。
+
+### マスク画像を用いる方法
+
+学習画像それぞれに対応するマスク画像を用意する方法です。学習画像と同じファイル名のマスク画像を用意し、それを学習画像と別のディレクトリに保存します。
+
+- 学習画像
+  ![image](https://github.com/kohya-ss/sd-scripts/assets/52813779/607c5116-5f62-47de-8b66-9c4a597f0441)
+- マスク画像
+  ![image](https://github.com/kohya-ss/sd-scripts/assets/52813779/53e9b0f8-a4bf-49ed-882d-4026f84e8450)
+
+```.toml
+[[datasets.subsets]]
+image_dir = "/path/to/a_zundamon"
+caption_extension = ".txt"
+conditioning_data_dir = "/path/to/a_zundamon_mask"
+num_repeats = 8
+```
+
+マスク画像は、学習画像と同じサイズで、学習する部分を白、無視する部分を黒で描画します。グレースケールにも対応しています（127 ならロス重みが 0.5 になります）。なお、正確にはマスク画像の R チャネルが用いられます。
+
+DreamBooth 方式の dataset で、`conditioning_data_dir` で指定したディレクトリにマスク画像を保存してください。ControlNet のデータセットと同じですので、詳細は [ControlNet-LLLite](train_lllite_README-ja.md#データセットの準備) を参照してください。
+
+### 透明度（アルファチャネル）を使用する方法
+
+学習画像の透明度（アルファチャネル）がマスクとして使用されます。透明度が 0 の部分は無視され、255 の部分は学習されます。半透明の場合は、その透明度に応じてロス重みが変化します（127 ならおおむね 0.5）。
+
+![image](https://github.com/kohya-ss/sd-scripts/assets/52813779/0baa129b-446a-4aac-b98c-7208efb0e75e)
+
+※それぞれの画像は透過PNG
+
+学習時のスクリプトのオプションに `--alpha_mask` を指定するか、dataset の設定ファイルの subset で、`alpha_mask` を指定してください。たとえば、以下のようになります。
+
+```toml
+[[datasets.subsets]]
+image_dir = "/path/to/image/dir"
+caption_extension = ".txt"
+num_repeats = 8
+alpha_mask = true
+```
+
+## 学習時の注意事項
+
+- 現時点では DreamBooth 方式の dataset のみ対応しています。
+- マスクは latents のサイズ、つまり 1/8 に縮小されてから適用されます。そのため、細かい部分（たとえばアホ毛やイヤリングなど）はうまく学習できない可能性があります。マスクをわずかに拡張するなどの工夫が必要かもしれません。
+- マスクロスを用いる場合、学習対象外の部分をキャプションに含める必要はないかもしれません。（要検証）
+- `alpha_mask` の場合、マスクの有無を切り替えると latents キャッシュが自動的に再生成されます。
--- a/docs/masked_loss_README.md
+++ b/docs/masked_loss_README.md
@@ -0,0 +1,56 @@
+## Masked Loss
+
+Masked loss is a feature that allows you to train only part of an image by calculating the loss only for the part specified by the mask of the input image. For example, if you want to train a character, you can train only the character part by masking it, ignoring the background.
+
+There are two ways to specify the mask for masked loss.
+
+- Using a mask image
+- Using transparency (alpha channel) of the image
+
+The sample uses the "AI image model training data" from [ZunZunPJ Illustration/3D Data](https://zunko.jp/con_illust.html).
+
+### Using a mask image
+
+This is a method of preparing a mask image corresponding to each training image. Prepare a mask image with the same file name as the training image and save it in a different directory from the training image.
+
+- Training image
+  ![image](https://github.com/kohya-ss/sd-scripts/assets/52813779/607c5116-5f62-47de-8b66-9c4a597f0441)
+- Mask image
+  ![image](https://github.com/kohya-ss/sd-scripts/assets/52813779/53e9b0f8-a4bf-49ed-882d-4026f84e8450)
+
+```.toml
+[[datasets.subsets]]
+image_dir = "/path/to/a_zundamon"
+caption_extension = ".txt"
+conditioning_data_dir = "/path/to/a_zundamon_mask"
+num_repeats = 8
+```
+
+The mask image is the same size as the training image, with the part to be trained drawn in white and the part to be ignored in black. It also supports grayscale (127 gives a loss weight of 0.5). The R channel of the mask image is used currently.
+
+Use the dataset in the DreamBooth method, and save the mask image in the directory specified by `conditioning_data_dir`. It is the same as the ControlNet dataset, so please refer to [ControlNet-LLLite](train_lllite_README.md#Preparing-the-dataset) for details.
+
+### Using transparency (alpha channel) of the image
+
+The transparency (alpha channel) of the training image is used as a mask. The part with transparency 0 is ignored, the part with transparency 255 is trained. For semi-transparent parts, the loss weight changes according to the transparency (127 gives a weight of about 0.5).
+
+![image](https://github.com/kohya-ss/sd-scripts/assets/52813779/0baa129b-446a-4aac-b98c-7208efb0e75e)
+
+※Each image is a transparent PNG
+
+Specify `--alpha_mask` in the training script options or specify `alpha_mask` in the subset of the dataset configuration file. For example, it will look like this.
+
+```toml
+[[datasets.subsets]]
+image_dir = "/path/to/image/dir"
+caption_extension = ".txt"
+num_repeats = 8
+alpha_mask = true
+```
+
+## Notes on training
+
+- At the moment, only the dataset in the DreamBooth method is supported.
+- The mask is applied after the size is reduced to 1/8, which is the size of the latents. Therefore, fine details (such as ahoge or earrings) may not be learned well. Some dilations of the mask may be necessary.
+- If using masked loss, it may not be necessary to include parts that are not to be trained in the caption. (To be verified)
+- In the case of `alpha_mask`, the latents cache is automatically regenerated when the enable/disable state of the mask is switched.
--- a/docs/train_README-ja.md
+++ b/docs/train_README-ja.md
--- a/docs/train_README-zh.md
+++ b/docs/train_README-zh.md
@@ -0,0 +1,912 @@
+__由于文档正在更新中，描述可能有错误。__
+
+# 关于训练，通用描述
+本库支持模型微调(fine tuning)、DreamBooth、训练LoRA和文本反转(Textual Inversion)（包括[XTI:P+](https://github.com/kohya-ss/sd-scripts/pull/327)
+）
+本文档将说明它们通用的训练数据准备方法和选项等。
+
+# 概要
+
+请提前参考本仓库的README，准备好环境。
+
+
+以下本节说明。
+
+1. 准备训练数据（使用设置文件的新格式）
+1. 训练中使用的术语的简要解释
+1. 先前的指定格式（不使用设置文件，而是从命令行指定）
+1. 生成训练过程中的示例图像
+1. 各脚本中常用的共同选项
+1. 准备 fine tuning 方法的元数据：如说明文字(打标签)等
+
+
+1. 如果只执行一次，训练就可以进行（相关内容，请参阅各个脚本的文档）。如果需要，以后可以随时参考。
+
+
+
+# 关于准备训练数据
+
+在任意文件夹（也可以是多个文件夹）中准备好训练数据的图像文件。支持 `.png`, `.jpg`, `.jpeg`, `.webp`, `.bmp` 格式的文件。通常不需要进行任何预处理，如调整大小等。
+
+但是请勿使用极小的图像，若其尺寸比训练分辨率（稍后将提到）还小，建议事先使用超分辨率AI等进行放大。另外，请注意不要使用过大的图像（约为3000 x 3000像素以上），因为这可能会导致错误，建议事先缩小。
+
+在训练时，需要整理要用于训练模型的图像数据，并将其指定给脚本。根据训练数据的数量、训练目标和说明（图像描述）是否可用等因素，可以使用几种方法指定训练数据。以下是其中的一些方法（每个名称都不是通用的，而是该存储库自定义的定义）。有关正则化图像的信息将在稍后提供。
+
+1. DreamBooth、class + identifier方式（可使用正则化图像）
+
+    将训练目标与特定单词（identifier）相关联进行训练。无需准备说明。例如，当要学习特定角色时，由于无需准备说明，因此比较方便，但由于训练数据的所有元素都与identifier相关联，例如发型、服装、背景等，因此在生成时可能会出现无法更换服装的情况。
+
+2. DreamBooth、说明方式（可使用正则化图像）
+
+    事先给每个图片写说明（caption），存放到文本文件中，然后进行训练。例如，通过将图像详细信息（如穿着白色衣服的角色A、穿着红色衣服的角色A等）记录在caption中，可以将角色和其他元素分离，并期望模型更准确地学习角色。
+
+3. 微调方式（不可使用正则化图像）
+
+    先将说明收集到元数据文件中。支持分离标签和说明以及预先缓存latents等功能，以加速训练（这些将在另一篇文档中介绍）。（虽然名为fine tuning方式，但不仅限于fine tuning。）
+   
+训练对象和你可以使用的规范方法的组合如下。
+
+| 训练对象或方法        | 脚本 | DB/class+identifier | DB/caption | fine tuning |
+|----------------| ----- | ----- | ----- | ----- |
+| fine tuning微调模型           | `fine_tune.py`| x | x | o |
+| DreamBooth训练模型 | `train_db.py`| o | o | x |
+| LoRA           | `train_network.py`| o | o | o |
+| Textual Invesion | `train_textual_inversion.py`| o | o | o |
+
+## 选择哪一个
+
+如果您想要训练LoRA、Textual Inversion而不需要准备说明（caption）文件，则建议使用DreamBooth class+identifier。如果您能够准备caption文件，则DreamBooth Captions方法更好。如果您有大量的训练数据并且不使用正则化图像，则请考虑使用fine-tuning方法。
+
+对于DreamBooth也是一样的，但不能使用fine-tuning方法。若要进行微调，只能使用fine-tuning方式。
+
+# 每种方法的指定方式
+
+在这里，我们只介绍每种指定方法的典型模式。有关更详细的指定方法，请参见[数据集设置](./config_README-ja.md)。
+
+# DreamBooth，class+identifier方法（可使用正则化图像）
+
+在该方法中，每个图像将被视为使用与 `class identifier` 相同的标题进行训练（例如 `shs dog`）。
+
+这样一来，每张图片都相当于使用标题“分类标识”（例如“shs dog”）进行训练。
+
+## step 1.确定identifier和class
+
+要将训练的目标与identifier和属于该目标的class相关联。
+
+（虽然有很多称呼，但暂时按照原始论文的说法。）
+
+以下是简要说明（请查阅详细信息）。
+
+class是训练目标的一般类别。例如，如果要学习特定品种的狗，则class将是“dog”。对于动漫角色，根据模型不同，可能是“boy”或“girl”，也可能是“1boy”或“1girl”。
+
+identifier是用于识别训练目标并进行学习的单词。可以使用任何单词，但是根据原始论文，“Tokenizer生成的3个或更少字符的罕见单词”是最好的选择。
+
+使用identifier和class，例如，“shs dog”可以将模型训练为从class中识别并学习所需的目标。
+
+在图像生成时，使用“shs dog”将生成所学习狗种的图像。
+
+（作为identifier，我最近使用的一些参考是“shs sts scs cpc coc cic msm usu ici lvl cic dii muk ori hru rik koo yos wny”等。最好是不包含在Danbooru标签中的单词。）
+
+## step 2. 决定是否使用正则化图像，并在使用时生成正则化图像
+
+正则化图像是为防止前面提到的语言漂移，即整个类别被拉扯成为训练目标而生成的图像。如果不使用正则化图像，例如在 `shs 1girl` 中学习特定角色时，即使在简单的 `1girl` 提示下生成，也会越来越像该角色。这是因为 `1girl` 在训练时的标题中包含了该角色的信息。
+
+通过同时学习目标图像和正则化图像，类别仍然保持不变，仅在将标识符附加到提示中时才生成目标图像。
+
+如果您只想在LoRA或DreamBooth中使用特定的角色，则可以不使用正则化图像。
+
+在Textual Inversion中也不需要使用（如果要学习的token string不包含在标题中，则不会学习任何内容）。
+
+一般情况下，使用在训练目标模型时只使用类别名称生成的图像作为正则化图像是常见的做法（例如 `1girl`）。但是，如果生成的图像质量不佳，可以尝试修改提示或使用从网络上另外下载的图像。
+
+（由于正则化图像也被训练，因此其质量会影响模型。）
+
+通常，准备数百张图像是理想的（图像数量太少会导致类别图像无法被归纳，特征也不会被学习）。
+
+如果要使用生成的图像，生成图像的大小通常应与训练分辨率（更准确地说，是bucket的分辨率，见下文）相匹配。
+
+
+
+## step 2. 设置文件的描述
+
+创建一个文本文件，并将其扩展名更改为`.toml`。例如，您可以按以下方式进行描述：
+
+（以`＃`开头的部分是注释，因此您可以直接复制粘贴，或者将其删除。）
+
+```toml
+[general]
+enable_bucket = true                        # 是否使用Aspect Ratio Bucketing
+
+[[datasets]]
+resolution = 512                            # 训练分辨率
+batch_size = 4                              # 批次大小
+
+  [[datasets.subsets]]
+  image_dir = 'C:\hoge'                     # 指定包含训练图像的文件夹
+  class_tokens = 'hoge girl'                # 指定标识符类
+  num_repeats = 10                          # 训练图像的重复次数
+
+  # 以下仅在使用正则化图像时进行描述。不使用则删除
+  [[datasets.subsets]]
+  is_reg = true
+  image_dir = 'C:\reg'                      # 指定包含正则化图像的文件夹
+  class_tokens = 'girl'                     # 指定class
+  num_repeats = 1                           # 正则化图像的重复次数，基本上1就可以了
+```
+
+基本上只需更改以下几个地方即可进行训练。
+
+1. 训练分辨率
+
+    指定一个数字表示正方形（如果是 `512`，则为 512x512），如果使用方括号和逗号分隔的两个数字，则表示横向×纵向（如果是`[512,768]`，则为 512x768）。在SD1.x系列中，原始训练分辨率为512。指定较大的分辨率，如 `[512,768]` 可能会减少纵向和横向图像生成时的错误。在SD2.x 768系列中，分辨率为 `768`。
+
+1. 批次大小
+
+    指定同时训练多少个数据。这取决于GPU的VRAM大小和训练分辨率。详细信息将在后面说明。此外，fine tuning/DreamBooth/LoRA等也会影响批次大小，请查看各个脚本的说明。
+
+1. 文件夹指定
+
+    指定用于学习的图像和正则化图像（仅在使用时）的文件夹。指定包含图像数据的文件夹。
+
+1. identifier 和 class 的指定
+
+    如前所述，与示例相同。
+
+1. 重复次数
+
+    将在后面说明。
+
+### 关于重复次数
+
+重复次数用于调整正则化图像和训练用图像的数量。由于正则化图像的数量多于训练用图像，因此需要重复使用训练用图像来达到一对一的比例，从而实现训练。
+
+请将重复次数指定为“ __训练用图像的重复次数×训练用图像的数量≥正则化图像的重复次数×正则化图像的数量__ ”。
+
+（1个epoch（指训练数据过完一遍）的数据量为“训练用图像的重复次数×训练用图像的数量”。如果正则化图像的数量多于这个值，则剩余的正则化图像将不会被使用。）
+
+## 步骤 3. 训练
+
+详情请参考相关文档进行训练。
+
+# DreamBooth，文本说明（caption）方式（可使用正则化图像）
+
+在此方式中，每个图像都将通过caption进行训练。
+
+## 步骤 1. 准备文本说明文件
+
+请将与图像具有相同文件名且扩展名为 `.caption`（可以在设置中更改）的文件放置在用于训练图像的文件夹中。每个文件应该只有一行。编码为 `UTF-8`。
+
+## 步骤 2. 决定是否使用正则化图像，并在使用时生成正则化图像
+
+与class+identifier格式相同。可以在规范化图像上附加caption，但通常不需要。
+
+## 步骤 2. 编写设置文件
+
+创建一个文本文件并将扩展名更改为 `.toml`。例如，您可以按以下方式进行描述：
+
+```toml
+[general]
+enable_bucket = true                        # 是否使用Aspect Ratio Bucketing
+
+[[datasets]]
+resolution = 512                            # 训练分辨率
+batch_size = 4                              # 批次大小
+
+  [[datasets.subsets]]
+  image_dir = 'C:\hoge'                     # 指定包含训练图像的文件夹
+  caption_extension = '.caption'            # 若使用txt文件,更改此项
+  num_repeats = 10                          # 训练图像的重复次数
+
+  # 以下仅在使用正则化图像时进行描述。不使用则删除
+  [[datasets.subsets]]
+  is_reg = true
+  image_dir = 'C:\reg'                      # 指定包含正则化图像的文件夹
+  class_tokens = 'girl'                     # 指定class
+  num_repeats = 1                           # 正则化图像的重复次数，基本上1就可以了
+```
+
+基本上只需更改以下几个地方来训练。除非另有说明，否则与class+identifier方法相同。
+
+1. 训练分辨率
+2. 批次大小
+3. 文件夹指定
+4. caption文件的扩展名
+
+    可以指定任意的扩展名。
+5. 重复次数
+
+## 步骤 3. 训练
+
+详情请参考相关文档进行训练。
+
+# 微调方法(fine tuning)
+
+## 步骤 1. 准备元数据
+
+将caption和标签整合到管理文件中称为元数据。它的扩展名为 `.json`，格式为json。由于创建方法较长，因此在本文档的末尾进行描述。
+
+## 步骤 2. 编写设置文件
+
+创建一个文本文件，将扩展名设置为 `.toml`。例如，可以按以下方式编写：
+```toml
+[general]
+shuffle_caption = true
+keep_tokens = 1
+
+[[datasets]]
+resolution = 512                                    # 图像分辨率
+batch_size = 4                                      # 批次大小
+
+  [[datasets.subsets]]
+  image_dir = 'C:\piyo'                             # 指定包含训练图像的文件夹
+  metadata_file = 'C:\piyo\piyo_md.json'            # 元数据文件名
+```
+
+基本上只需更改以下几个地方来训练。除非另有说明，否则与DreamBooth, class+identifier方法相同。
+
+1. 训练分辨率
+2. 批次大小
+3. 指定文件夹
+4. 元数据文件名
+
+    指定使用后面所述方法创建的元数据文件。
+
+
+## 第三步：训练
+
+详情请参考相关文档进行训练。
+
+# 训练中使用的术语简单解释
+
+由于省略了细节并且我自己也没有完全理解，因此请自行查阅详细信息。
+
+## 微调（fine tuning）
+
+指训练模型并微调其性能。具体含义因用法而异，但在 Stable Diffusion 中，狭义的微调是指使用图像和caption进行训练模型。DreamBooth 可视为狭义微调的一种特殊方法。广义的微调包括 LoRA、Textual Inversion、Hypernetworks 等，包括训练模型的所有内容。
+
+## 步骤（step）
+
+粗略地说，每次在训练数据上进行一次计算即为一步。具体来说，“将训练数据的caption传递给当前模型，将生成的图像与训练数据的图像进行比较，稍微更改模型，以使其更接近训练数据”即为一步。
+
+## 批次大小（batch size）
+
+批次大小指定每个步骤要计算多少数据。批次计算可以提高速度。一般来说，批次大小越大，精度也越高。
+
+“批次大小×步数”是用于训练的数据数量。因此，建议减少步数以增加批次大小。
+
+（但是，例如，“批次大小为 1，步数为 1600”和“批次大小为 4，步数为 400”将不会产生相同的结果。如果使用相同的学习速率，通常后者会导致模型欠拟合。请尝试增加学习率（例如 `2e-6`），将步数设置为 500 等。）
+
+批次大小越大，GPU 内存消耗就越大。如果内存不足，将导致错误，或者在边缘时将导致训练速度降低。建议在任务管理器或 `nvidia-smi` 命令中检查使用的内存量进行调整。
+
+注意，一个批次是指“一个数据单位”。
+
+## 学习率
+
+ 学习率指的是每个步骤中改变的程度。如果指定一个大的值，学习速度就会加快，但是可能会出现变化太大导致模型崩溃或无法达到最佳状态的情况。如果指定一个小的值，学习速度会变慢，同时可能无法达到最佳状态。
+
+在fine tuning、DreamBooth、LoRA等过程中，学习率会有很大的差异，并且也会受到训练数据、所需训练的模型、批次大小和步骤数等因素的影响。建议从通常值开始，观察训练状态并逐渐调整。
+
+默认情况下，整个训练过程中学习率是固定的。但是可以通过调度程序指定学习率如何变化，因此结果也会有所不同。
+
+## Epoch
+
+Epoch指的是训练数据被完整训练一遍（即数据已经迭代一轮）。如果指定了重复次数，则在重复后的数据迭代一轮后，为1个epoch。
+
+1个epoch的步骤数通常为“数据量÷批次大小”，但如果使用Aspect Ratio Bucketing，则略微增加（由于不同bucket的数据不能在同一个批次中，因此步骤数会增加）。
+
+## 长宽比分桶（Aspect Ratio Bucketing）
+
+Stable Diffusion 的 v1 是以 512\*512 的分辨率进行训练的，但同时也可以在其他分辨率下进行训练，例如 256\*1024 和 384\*640。这样可以减少裁剪的部分，希望更准确地学习图像和标题之间的关系。
+
+此外，由于可以在任意分辨率下进行训练，因此不再需要事先统一图像数据的长宽比。
+
+此值可以被设定，其在此之前的配置文件示例中已被启用（设置为 `true`）。
+
+只要不超过作为参数给出的分辨率区域（= 内存使用量），就可以按 64 像素的增量（默认值，可更改）在垂直和水平方向上调整和创建训练分辨率。
+
+在机器学习中，通常需要将所有输入大小统一，但实际上只要在同一批次中统一即可。 NovelAI 所说的分桶(bucketing) 指的是，预先将训练数据按照长宽比分类到每个学习分辨率下，并通过使用每个 bucket 内的图像创建批次来统一批次图像大小。
+
+# 以前的指定格式（不使用 .toml 文件，而是使用命令行选项指定）
+
+这是一种通过命令行选项而不是指定 .toml 文件的方法。有 DreamBooth 类+标识符方法、DreamBooth caption方法、微调方法三种方式。
+
+## DreamBooth、类+标识符方式
+
+指定文件夹名称以指定迭代次数。还要使用 `train_data_dir` 和 `reg_data_dir` 选项。
+
+### 第1步。准备用于训练的图像
+
+创建一个用于存储训练图像的文件夹。__此外__，按以下名称创建目录。
+
+```
+<迭代次数>_<标识符> <类别>
+```
+
+不要忘记下划线``_``。
+
+例如，如果在名为“sls frog”的提示下重复数据 20 次，则为“20_sls frog”。如下所示：
+
+![image](https://user-images.githubusercontent.com/52813779/210770636-1c851377-5936-4c15-90b7-8ac8ad6c2074.png)
+
+### 多个类别、多个标识符的训练
+
+该方法很简单，在用于训练的图像文件夹中，需要准备多个文件夹，每个文件夹都是以“重复次数_<标识符> <类别>”命名的，同样，在正则化图像文件夹中，也需要准备多个文件夹，每个文件夹都是以“重复次数_<类别>”命名的。
+
+例如，如果要同时训练“sls青蛙”和“cpc兔子”，则应按以下方式准备文件夹。
+
+![image](https://user-images.githubusercontent.com/52813779/210777933-a22229db-b219-4cd8-83ca-e87320fc4192.png)
+
+如果一个类别包含多个对象，可以只使用一个正则化图像文件夹。例如，如果在1girl类别中有角色A和角色B，则可以按照以下方式处理：
+
+- train_girls
+  - 10_sls 1girl
+  - 10_cpc 1girl
+- reg_girls
+  - 1_1girl
+
+### step 2. 准备正规化图像
+
+这是使用正则化图像时的过程。
+
+创建一个文件夹来存储正则化的图像。 __此外，__ 创建一个名为``<repeat count>_<class>`` 的目录。
+
+例如，使用提示“frog”并且不重复数据（仅一次）：
+![image](https://user-images.githubusercontent.com/52813779/210770897-329758e5-3675-49f1-b345-c135f1725832.png)
+
+
+步骤3. 执行训练
+
+执行每个训练脚本。使用 `--train_data_dir` 选项指定包含训练数据文件夹的父文件夹（不是包含图像的文件夹），使用 `--reg_data_dir` 选项指定包含正则化图像的父文件夹（不是包含图像的文件夹）。
+
+## DreamBooth，带文本说明（caption）的方式
+
+在包含训练图像和正则化图像的文件夹中，将与图像具有相同文件名的文件.caption（可以使用选项进行更改）放置在该文件夹中，然后从该文件中加载caption所作为提示进行训练。
+
+※文件夹名称（标识符类）不再用于这些图像的训练。
+
+默认的caption文件扩展名为.caption。可以使用训练脚本的 `--caption_extension` 选项进行更改。 使用 `--shuffle_caption` 选项，同时对每个逗号分隔的部分进行训练时会对训练时的caption进行混洗。
+
+## 微调方式
+
+创建元数据的方式与使用配置文件相同。 使用 `in_json` 选项指定元数据文件。
+
+# 训练过程中的样本输出
+
+通过在训练中使用模型生成图像，可以检查训练进度。将以下选项指定为训练脚本。
+
+- `--sample_every_n_steps` / `--sample_every_n_epochs`
+    
+    指定要采样的步数或epoch数。为这些数字中的每一个输出样本。如果两者都指定，则 epoch 数优先。
+- `--sample_prompts`
+
+    指定示例输出的提示文件。
+
+- `--sample_sampler`
+
+    指定用于采样输出的采样器。
+    `'ddim', 'pndm', 'heun', 'dpmsolver', 'dpmsolver++', 'dpmsingle', 'k_lms', 'k_euler', 'k_euler_a', 'k_dpm_2', 'k_dpm_2_a'`が選べます。
+
+要输出样本，您需要提前准备一个包含提示的文本文件。每行输入一个提示。
+
+```txt
+# prompt 1
+masterpiece, best quality, 1girl, in white shirts, upper body, looking at viewer, simple background --n low quality, worst quality, bad anatomy,bad composition, poor, low effort --w 768 --h 768 --d 1 --l 7.5 --s 28
+
+# prompt 2
+masterpiece, best quality, 1boy, in business suit, standing at street, looking back --n low quality, worst quality, bad anatomy,bad composition, poor, low effort --w 576 --h 832 --d 2 --l 5.5 --s 40
+```
+
+以“#”开头的行是注释。您可以使用“`--` + 小写字母”为生成的图像指定选项，例如 `--n`。您可以使用：
+
+- `--n` 否定提示到下一个选项。
+- `--w` 指定生成图像的宽度。
+- `--h` 指定生成图像的高度。
+- `--d` 指定生成图像的种子。
+- `--l` 指定生成图像的 CFG 比例。
+- `--s` 指定生成过程中的步骤数。
+
+
+# 每个脚本通用的常用选项
+
+文档更新可能跟不上脚本更新。在这种情况下，请使用 `--help` 选项检查可用选项。
+## 学习模型规范
+
+- `--v2` / `--v_parameterization`
+    
+   如果使用 Hugging Face 的 stable-diffusion-2-base 或来自它的微调模型作为学习目标模型（对于在推理时指示使用 `v2-inference.yaml` 的模型），`- 当使用-v2` 选项与 stable-diffusion-2、768-v-ema.ckpt 及其微调模型（对于在推理过程中使用 `v2-inference-v.yaml` 的模型），`- 指定两个 -v2`和 `--v_parameterization` 选项。
+
+    以下几点在 Stable Diffusion 2.0 中发生了显着变化。
+
+    1.  使用分词器
+    2. 使用哪个Text Encoder，使用哪个输出层（2.0使用倒数第二层）
+    3. Text Encoder的输出维度(768->1024)
+    4. U-Net的结构（CrossAttention的头数等）
+    5. v-parameterization（采样方式好像变了）
+
+    其中base使用1-4，非base使用1-5（768-v）。使用 1-4 进行 v2 选择，使用 5 进行 v_parameterization 选择。
+- `--pretrained_model_name_or_path`
+    
+    指定要从中执行额外训练的模型。您可以指定Stable Diffusion检查点文件（.ckpt 或 .safetensors）、diffusers本地磁盘上的模型目录或diffusers模型 ID（例如“stabilityai/stable-diffusion-2”）。
+## 训练设置
+
+- `--output_dir` 
+
+    指定训练后保存模型的文件夹。
+    
+- `--output_name` 
+    
+    指定不带扩展名的模型文件名。
+    
+- `--dataset_config` 
+
+    指定描述数据集配置的 .toml 文件。
+
+- `--max_train_steps` / `--max_train_epochs`
+
+    指定要训练的步数或epoch数。如果两者都指定，则 epoch 数优先。
+- 
+- `--mixed_precision`
+
+ 训练混合精度以节省内存。指定像`--mixed_precision = "fp16"`。与无混合精度（默认）相比，精度可能较低，但训练所需的 GPU 内存明显较少。
+    
+    （在RTX30系列以后也可以指定`bf16`，请配合您在搭建环境时做的加速设置）。    
+- `--gradient_checkpointing`
+
+  通过逐步计算权重而不是在训练期间一次计算所有权重来减少训练所需的 GPU 内存量。关闭它不会影响准确性，但打开它允许更大的批次大小，所以那里有影响。
+    
+    另外，打开它通常会减慢速度，但可以增加批次大小，因此总的训练时间实际上可能会更快。
+
+- `--xformers` / `--mem_eff_attn`
+
+   当指定 xformers 选项时，使用 xformers 的 CrossAttention。如果未安装 xformers 或发生错误（取决于环境，例如 `mixed_precision="no"`），请指定 `mem_eff_attn` 选项而不是使用 CrossAttention 的内存节省版本（xformers 比 慢）。
+- `--save_precision`
+
+   指定保存时的数据精度。为 save_precision 选项指定 float、fp16 或 bf16 将以该格式保存模型（在 DreamBooth 中保存 Diffusers 格式时无效，微调）。当您想缩小模型的尺寸时请使用它。
+- `--save_every_n_epochs` / `--save_state` / `--resume`
+    为 save_every_n_epochs 选项指定一个数字可以在每个时期的训练期间保存模型。
+
+    如果同时指定save_state选项，训练状态包括优化器的状态等都会一起保存。。保存目的地将是一个文件夹。
+    
+    训练状态输出到目标文件夹中名为“<output_name>-??????-state”（??????是epoch数）的文件夹中。长时间训练时请使用。
+
+    使用 resume 选项从保存的训练状态恢复训练。指定训练状态文件夹（其中的状态文件夹，而不是 `output_dir`）。
+
+    请注意，由于 Accelerator 规范，epoch 数和全局步数不会保存，即使恢复时它们也从 1 开始。
+- `--save_model_as` （DreamBooth, fine tuning 仅有的）
+
+  您可以从 `ckpt, safetensors, diffusers, diffusers_safetensors` 中选择模型保存格式。
+ 
+- `--save_model_as=safetensors` 指定喜欢当读取Stable Diffusion格式（ckpt 或safetensors）并以diffusers格式保存时，缺少的信息通过从 Hugging Face 中删除 v1.5 或 v2.1 信息来补充。
+    
+- `--clip_skip`
+    
+    `2`  如果指定，则使用文本编码器 (CLIP) 的倒数第二层的输出。如果省略 1 或选项，则使用最后一层。
+
+    *SD2.0默认使用倒数第二层，训练SD2.0时请不要指定。
+
+    如果被训练的模型最初被训练为使用第二层，则 2 是一个很好的值。
+
+    如果您使用的是最后一层，那么整个模型都会根据该假设进行训练。因此，如果再次使用第二层进行训练，可能需要一定数量的teacher数据和更长时间的训练才能得到想要的训练结果。
+- `--max_token_length`
+
+    默认值为 75。您可以通过指定“150”或“225”来扩展令牌长度来训练。使用长字幕训练时指定。
+    
+    但由于训练时token展开的规范与Automatic1111的web UI（除法等规范）略有不同，如非必要建议用75训练。
+
+    与clip_skip一样，训练与模型训练状态不同的长度可能需要一定量的teacher数据和更长的学习时间。
+
+- `--persistent_data_loader_workers`
+
+    在 Windows 环境中指定它可以显着减少时期之间的延迟。
+
+- `--max_data_loader_n_workers`
+
+    指定数据加载的进程数。大量的进程会更快地加载数据并更有效地使用 GPU，但会消耗更多的主内存。默认是"`8`或者`CPU并发执行线程数 - 1`，取小者"，所以如果主存没有空间或者GPU使用率大概在90%以上，就看那些数字和 `2` 或将其降低到大约 `1`。
+- `--logging_dir` / `--log_prefix`
+
+   保存训练日志的选项。在 logging_dir 选项中指定日志保存目标文件夹。以 TensorBoard 格式保存日志。
+
+    例如，如果您指定 --logging_dir=logs，将在您的工作文件夹中创建一个日志文件夹，并将日志保存在日期/时间文件夹中。
+    此外，如果您指定 --log_prefix 选项，则指定的字符串将添加到日期和时间之前。使用“--logging_dir=logs --log_prefix=db_style1_”进行识别。
+
+    要检查 TensorBoard 中的日志，请打开另一个命令提示符并在您的工作文件夹中键入：
+    ```
+    tensorboard --logdir=logs
+    ```
+
+   我觉得tensorboard会在环境搭建的时候安装，如果没有安装，请用`pip install tensorboard`安装。）
+
+    然后打开浏览器到http://localhost:6006/就可以看到了。
+- `--noise_offset`
+本文的实现：https://www.crosslabs.org//blog/diffusion-with-offset-noise
+    
+    看起来它可能会为整体更暗和更亮的图像产生更好的结果。它似乎对 LoRA 训练也有效。指定一个大约 0.1 的值似乎很好。
+
+- `--debug_dataset`
+
+   通过添加此选项，您可以在训练之前检查将训练什么样的图像数据和标题。按 Esc 退出并返回命令行。按 `S` 进入下一步（批次），按 `E` 进入下一个epoch。
+
+    *图片在 Linux 环境（包括 Colab）下不显示。
+
+- `--vae`
+
+   如果您在 vae 选项中指定Stable Diffusion检查点、VAE 检查点文件、扩散模型或 VAE（两者都可以指定本地或拥抱面模型 ID），则该 VAE 用于训练（缓存时的潜伏）或在训练过程中获得潜伏）。
+
+    对于 DreamBooth 和微调，保存的模型将包含此 VAE
+
+- `--cache_latents`
+
+  在主内存中缓存 VAE 输出以减少 VRAM 使用。除 flip_aug 之外的任何增强都将不可用。此外，整体训练速度略快。
+- `--min_snr_gamma`
+
+    指定最小 SNR 加权策略。细节是[这里](https://github.com/kohya-ss/sd-scripts/pull/308)请参阅。论文中推荐`5`。
+
+## 优化器相关
+
+- `--optimizer_type`
+    -- 指定优化器类型。您可以指定
+    - AdamW : [torch.optim.AdamW](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html)
+    - 与过去版本中未指定选项时相同
+    - AdamW8bit : 参数同上
+    - PagedAdamW8bit : 参数同上
+    - 与过去版本中指定的 --use_8bit_adam 相同
+    - Lion : https://github.com/lucidrains/lion-pytorch
+    - Lion8bit : 参数同上
+    - PagedLion8bit : 参数同上
+    - 与过去版本中指定的 --use_lion_optimizer 相同
+    - SGDNesterov : [torch.optim.SGD](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html), nesterov=True
+    - SGDNesterov8bit : 参数同上
+    - DAdaptation(DAdaptAdamPreprint) : https://github.com/facebookresearch/dadaptation
+    - DAdaptAdam : 参数同上
+    - DAdaptAdaGrad : 参数同上
+    - DAdaptAdan : 参数同上
+    - DAdaptAdanIP : 参数同上
+    - DAdaptLion : 参数同上
+    - DAdaptSGD : 参数同上
+    - Prodigy : https://github.com/konstmish/prodigy
+    - AdaFactor : [Transformers AdaFactor](https://huggingface.co/docs/transformers/main_classes/optimizer_schedules)
+    - 任何优化器
+
+- `--learning_rate`
+
+   指定学习率。合适的学习率取决于训练脚本，所以请参考每个解释。
+- `--lr_scheduler` / `--lr_warmup_steps` / `--lr_scheduler_num_cycles` / `--lr_scheduler_power`
+  
+    学习率的调度程序相关规范。
+
+    使用 lr_scheduler 选项，您可以从线性、余弦、cosine_with_restarts、多项式、常数、constant_with_warmup 或任何调度程序中选择学习率调度程序。默认值是常量。
+    
+    使用 lr_warmup_steps，您可以指定预热调度程序的步数（逐渐改变学习率）。
+    
+    lr_scheduler_num_cycles 是 cosine with restarts 调度器中的重启次数，lr_scheduler_power 是多项式调度器中的多项式幂。
+
+    有关详细信息，请自行研究。
+
+    要使用任何调度程序，请像使用任何优化器一样使用“--lr_scheduler_args”指定可选参数。
+### 关于指定优化器
+
+使用 --optimizer_args 选项指定优化器选项参数。可以以key=value的格式指定多个值。此外，您可以指定多个值，以逗号分隔。例如，要指定 AdamW 优化器的参数，``--optimizer_args weight_decay=0.01 betas=.9,.999``。
+
+指定可选参数时，请检查每个优化器的规格。
+一些优化器有一个必需的参数，如果省略它会自动添加（例如 SGDNesterov 的动量）。检查控制台输出。
+
+D-Adaptation 优化器自动调整学习率。学习率选项指定的值不是学习率本身，而是D-Adaptation决定的学习率的应用率，所以通常指定1.0。如果您希望 Text Encoder 的学习率是 U-Net 的一半，请指定 ``--text_encoder_lr=0.5 --unet_lr=1.0``。
+如果指定 relative_step=True，AdaFactor 优化器可以自动调整学习率（如果省略，将默认添加）。自动调整时，学习率调度器被迫使用 adafactor_scheduler。此外，指定 scale_parameter 和 warmup_init 似乎也不错。
+
+自动调整的选项类似于``--optimizer_args "relative_step=True" "scale_parameter=True" "warmup_init=True"``。
+
+如果您不想自动调整学习率，请添加可选参数 ``relative_step=False``。在那种情况下，似乎建议将 constant_with_warmup 用于学习率调度程序，而不要为梯度剪裁范数。所以参数就像``--optimizer_type=adafactor --optimizer_args "relative_step=False" --lr_scheduler="constant_with_warmup" --max_grad_norm=0.0``。
+
+### 使用任何优化器
+
+使用 ``torch.optim`` 优化器时，仅指定类名（例如 ``--optimizer_type=RMSprop``），使用其他模块的优化器时，指定“模块名.类名”。（例如``--optimizer_type=bitsandbytes.optim.lamb.LAMB``）。
+
+（内部仅通过 importlib 未确认操作。如果需要，请安装包。）
+<!-- 
+## 使用任意大小的图像进行训练 --resolution
+你可以在广场外训练。请在分辨率中指定“宽度、高度”，如“448,640”。宽度和高度必须能被 64 整除。匹配训练图像和正则化图像的大小。
+
+就我个人而言，我经常生成垂直长的图像，所以我有时会用“448、640”来训练。
+
+## 纵横比分桶 --enable_bucket / --min_bucket_reso / --max_bucket_reso
+它通过指定 enable_bucket 选项来启用。 Stable Diffusion 在 512x512 分辨率下训练，但也在 256x768 和 384x640 等分辨率下训练。
+
+如果指定此选项，则不需要将训练图像和正则化图像统一为特定分辨率。从多种分辨率（纵横比）中进行选择，并在该分辨率下训练。
+由于分辨率为 64 像素，纵横比可能与原始图像不完全相同。
+
+您可以使用 min_bucket_reso 选项指定分辨率的最小大小，使用 max_bucket_reso 指定最大大小。默认值分别为 256 和 1024。
+例如，将最小尺寸指定为 384 将不会使用 256x1024 或 320x768 等分辨率。
+如果将分辨率增加到 768x768，您可能需要将 1280 指定为最大尺寸。
+
+启用 Aspect Ratio Ratio Bucketing 时，最好准备具有与训练图像相似的各种分辨率的正则化图像。
+
+（因为一批中的图像不偏向于训练图像和正则化图像。
+
+## 扩充 --color_aug / --flip_aug
+增强是一种通过在训练过程中动态改变数据来提高模型性能的方法。在使用 color_aug 巧妙地改变色调并使用 flip_aug 左右翻转的同时训练。
+
+由于数据是动态变化的，因此不能与 cache_latents 选项一起指定。
+
+## 使用 fp16 梯度训练（实验特征）--full_fp16
+如果指定 full_fp16 选项，梯度从普通 float32 变为 float16 (fp16) 并训练（它似乎是 full fp16 训练而不是混合精度）。
+结果，似乎 SD1.x 512x512 大小可以在 VRAM 使用量小于 8GB 的情况下训练，而 SD2.x 512x512 大小可以在 VRAM 使用量小于 12GB 的情况下训练。
+
+预先在加速配置中指定 fp16，并可选择设置 ``mixed_precision="fp16"``（bf16 不起作用）。
+
+为了最大限度地减少内存使用，请使用 xformers、use_8bit_adam、cache_latents、gradient_checkpointing 选项并将 train_batch_size 设置为 1。
+
+（如果你负担得起，逐步增加 train_batch_size 应该会提高一点精度。）
+
+它是通过修补 PyTorch 源代码实现的（已通过 PyTorch 1.12.1 和 1.13.0 确认）。准确率会大幅下降，途中学习失败的概率也会增加。
+学习率和步数的设置似乎很严格。请注意它们并自行承担使用它们的风险。
+-->
+
+# 创建元数据文件
+
+## 准备训练数据
+
+如上所述准备好你要训练的图像数据，放在任意文件夹中。
+
+例如，存储这样的图像：
+
+![教师数据文件夹的屏幕截图](https://user-images.githubusercontent.com/52813779/208907739-8e89d5fa-6ca8-4b60-8927-f484d2a9ae04.png)
+
+## 自动captioning
+
+如果您只想训练没有标题的标签，请跳过。
+
+另外，手动准备caption时，请准备在与教师数据图像相同的目录下，文件名相同，扩展名.caption等。每个文件应该是只有一行的文本文件。
+### 使用 BLIP 添加caption
+
+最新版本不再需要 BLIP 下载、权重下载和额外的虚拟环境。按原样工作。
+
+运行 finetune 文件夹中的 make_captions.py。
+
+```
+python finetune\make_captions.py --batch_size <バッチサイズ> <教師データフォルダ>
+```
+
+如果batch size为8，训练数据放在父文件夹train_data中，则会如下所示
+```
+python finetune\make_captions.py --batch_size 8 ..\train_data
+```
+
+caption文件创建在与教师数据图像相同的目录中，具有相同的文件名和扩展名.caption。
+
+根据 GPU 的 VRAM 容量增加或减少 batch_size。越大越快（我认为 12GB 的 VRAM 可以多一点）。
+您可以使用 max_length 选项指定caption的最大长度。默认值为 75。如果使用 225 的令牌长度训练模型，它可能会更长。
+您可以使用 caption_extension 选项更改caption扩展名。默认为 .caption（.txt 与稍后描述的 DeepDanbooru 冲突）。
+如果有多个教师数据文件夹，则对每个文件夹执行。
+
+请注意，推理是随机的，因此每次运行时结果都会发生变化。如果要修复它，请使用 --seed 选项指定一个随机数种子，例如 `--seed 42`。
+
+其他的选项，请参考help with `--help`（好像没有文档说明参数的含义，得看源码）。
+
+默认情况下，会生成扩展名为 .caption 的caption文件。
+
+![caption生成的文件夹](https://user-images.githubusercontent.com/52813779/208908845-48a9d36c-f6ee-4dae-af71-9ab462d1459e.png)
+
+例如，标题如下：
+
+![caption和图像](https://user-images.githubusercontent.com/52813779/208908947-af936957-5d73-4339-b6c8-945a52857373.png)
+
+## 由 DeepDanbooru 标记
+
+如果不想给danbooru标签本身打标签，请继续“标题和标签信息的预处理”。
+
+标记是使用 DeepDanbooru 或 WD14Tagger 完成的。 WD14Tagger 似乎更准确。如果您想使用 WD14Tagger 进行标记，请跳至下一章。
+### 环境布置
+
+将 DeepDanbooru https://github.com/KichangKim/DeepDanbooru 克隆到您的工作文件夹中，或下载并展开 zip。我解压缩了它。
+另外，从 DeepDanbooru 发布页面 https://github.com/KichangKim/DeepDanbooru/releases 上的“DeepDanbooru 预训练模型 v3-20211112-sgd-e28”的资产下载 deepdanbooru-v3-20211112-sgd-e28.zip 并解压到 DeepDanbooru 文件夹。
+
+从下面下载。单击以打开资产并从那里下载。
+
+![DeepDanbooru下载页面](https://user-images.githubusercontent.com/52813779/208909417-10e597df-7085-41ee-bd06-3e856a1339df.png)
+
+做一个这样的目录结构
+
+![DeepDanbooru的目录结构](https://user-images.githubusercontent.com/52813779/208909486-38935d8b-8dc6-43f1-84d3-fef99bc471aa.png)
+为diffusers环境安装必要的库。进入 DeepDanbooru 文件夹并安装它（我认为它实际上只是添加了 tensorflow-io）。
+```
+pip install -r requirements.txt
+```
+
+接下来，安装 DeepDanbooru 本身。
+
+```
+pip install .
+```
+
+这样就完成了标注环境的准备工作。
+
+### 实施标记
+转到 DeepDanbooru 的文件夹并运行 deepdanbooru 进行标记。
+```
+deepdanbooru evaluate <教师资料夹> --project-path deepdanbooru-v3-20211112-sgd-e28 --allow-folder --save-txt
+```
+
+如果将训练数据放在父文件夹train_data中，则如下所示。
+```
+deepdanbooru evaluate ../train_data --project-path deepdanbooru-v3-20211112-sgd-e28 --allow-folder --save-txt
+```
+
+在与教师数据图像相同的目录中创建具有相同文件名和扩展名.txt 的标记文件。它很慢，因为它是一个接一个地处理的。
+
+如果有多个教师数据文件夹，则对每个文件夹执行。
+
+它生成如下。
+
+![DeepDanbooru生成的文件](https://user-images.githubusercontent.com/52813779/208909855-d21b9c98-f2d3-4283-8238-5b0e5aad6691.png)
+
+它会被这样标记（信息量很大...）。
+
+![DeepDanbooru标签和图片](https://user-images.githubusercontent.com/52813779/208909908-a7920174-266e-48d5-aaef-940aba709519.png)
+
+## WD14Tagger标记为
+
+此过程使用 WD14Tagger 而不是 DeepDanbooru。
+
+使用 Mr. Automatic1111 的 WebUI 中使用的标记器。我参考了这个 github 页面上的信息 (https://github.com/toriato/stable-diffusion-webui-wd14-tagger#mrsmilingwolfs-model-aka-waifu-diffusion-14-tagger)。
+
+初始环境维护所需的模块已经安装。权重自动从 Hugging Face 下载。
+### 实施标记
+
+运行脚本以进行标记。
+```
+python tag_images_by_wd14_tagger.py --batch_size <バッチサイズ> <教師データフォルダ>
+```
+
+如果将训练数据放在父文件夹train_data中，则如下所示
+```
+python tag_images_by_wd14_tagger.py --batch_size 4 ..\train_data
+```
+
+模型文件将在首次启动时自动下载到 wd14_tagger_model 文件夹（文件夹可以在选项中更改）。它将如下所示。
+![下载文件](https://user-images.githubusercontent.com/52813779/208910447-f7eb0582-90d6-49d3-a666-2b508c7d1842.png)
+
+在与教师数据图像相同的目录中创建具有相同文件名和扩展名.txt 的标记文件。
+![生成的标签文件](https://user-images.githubusercontent.com/52813779/208910534-ea514373-1185-4b7d-9ae3-61eb50bc294e.png)
+
+![标签和图片](https://user-images.githubusercontent.com/52813779/208910599-29070c15-7639-474f-b3e4-06bd5a3df29e.png)
+
+使用 thresh 选项，您可以指定确定的标签的置信度数以附加标签。默认值为 0.35，与 WD14Tagger 示例相同。较低的值给出更多的标签，但准确性较低。
+
+根据 GPU 的 VRAM 容量增加或减少 batch_size。越大越快（我认为 12GB 的 VRAM 可以多一点）。您可以使用 caption_extension 选项更改标记文件扩展名。默认为 .txt。
+
+您可以使用 model_dir 选项指定保存模型的文件夹。
+
+此外，如果指定 force_download 选项，即使有保存目标文件夹，也会重新下载模型。
+
+如果有多个教师数据文件夹，则对每个文件夹执行。
+
+## 预处理caption和标签信息
+
+将caption和标签作为元数据合并到一个文件中，以便从脚本中轻松处理。
+### caption预处理
+
+要将caption放入元数据，请在您的工作文件夹中运行以下命令（如果您不使用caption进行训练，则不需要运行它）（它实际上是一行，依此类推）。指定 `--full_path` 选项以将图像文件的完整路径存储在元数据中。如果省略此选项，则会记录相对路径，但 .toml 文件中需要单独的文件夹规范。
+```
+python merge_captions_to_metadata.py --full_path <教师资料夹>
+　  --in_json <要读取的元数据文件名> <元数据文件名>
+```
+
+元数据文件名是任意名称。
+如果训练数据为train_data，没有读取元数据文件，元数据文件为meta_cap.json，则会如下。
+```
+python merge_captions_to_metadata.py --full_path train_data meta_cap.json
+```
+
+您可以使用 caption_extension 选项指定标题扩展。
+
+如果有多个教师数据文件夹，请指定 full_path 参数并为每个文件夹执行。
+```
+python merge_captions_to_metadata.py --full_path 
+    train_data1 meta_cap1.json
+python merge_captions_to_metadata.py --full_path --in_json meta_cap1.json 
+    train_data2 meta_cap2.json
+```
+如果省略in_json，如果有写入目标元数据文件，将从那里读取并覆盖。
+
+__* 每次重写 in_json 选项和写入目标并写入单独的元数据文件是安全的。 __
+### 标签预处理
+
+同样，标签也收集在元数据中（如果标签不用于训练，则无需这样做）。
+```
+python merge_dd_tags_to_metadata.py --full_path <教师资料夹> 
+    --in_json <要读取的元数据文件名> <要写入的元数据文件名>
+```
+
+同样的目录结构，读取meta_cap.json和写入meta_cap_dd.json时，会是这样的。
+```
+python merge_dd_tags_to_metadata.py --full_path train_data --in_json meta_cap.json meta_cap_dd.json
+```
+
+如果有多个教师数据文件夹，请指定 full_path 参数并为每个文件夹执行。
+
+```
+python merge_dd_tags_to_metadata.py --full_path --in_json meta_cap2.json
+    train_data1 meta_cap_dd1.json
+python merge_dd_tags_to_metadata.py --full_path --in_json meta_cap_dd1.json 
+    train_data2 meta_cap_dd2.json
+```
+
+如果省略in_json，如果有写入目标元数据文件，将从那里读取并覆盖。
+__※ 通过每次重写 in_json 选项和写入目标，写入单独的元数据文件是安全的。 __
+### 标题和标签清理
+
+到目前为止，标题和DeepDanbooru标签已经被整理到元数据文件中。然而，自动标题生成的标题存在表达差异等微妙问题（※），而标签中可能包含下划线和评级（DeepDanbooru的情况下）。因此，最好使用编辑器的替换功能清理标题和标签。
+
+※例如，如果要学习动漫中的女孩，标题可能会包含girl/girls/woman/women等不同的表达方式。另外，将"anime girl"简单地替换为"girl"可能更合适。
+
+我们提供了用于清理的脚本，请根据情况编辑脚本并使用它。
+
+（不需要指定教师数据文件夹。将清理元数据中的所有数据。）
+
+```
+python clean_captions_and_tags.py <要读取的元数据文件名> <要写入的元数据文件名>
+```
+
+--in_json 请注意，不包括在内。例如：
+
+```
+python clean_captions_and_tags.py meta_cap_dd.json meta_clean.json
+```
+
+标题和标签的预处理现已完成。
+
+## 预先获取 latents
+
+※ 这一步骤并非必须。即使省略此步骤，也可以在训练过程中获取 latents。但是，如果在训练时执行 `random_crop` 或 `color_aug` 等操作，则无法预先获取 latents（因为每次图像都会改变）。如果不进行预先获取，则可以使用到目前为止的元数据进行训练。
+
+提前获取图像的潜在表达并保存到磁盘上。这样可以加速训练过程。同时进行 bucketing（根据宽高比对训练数据进行分类）。
+
+请在工作文件夹中输入以下内容。
+
+```
+python prepare_buckets_latents.py --full_path <教师资料夹>  
+    <要读取的元数据文件名> <要写入的元数据文件名> 
+    <要微调的模型名称或检查点> 
+    --batch_size <批次大小> 
+    --max_resolution <分辨率宽、高> 
+    --mixed_precision <准确性>
+```
+
+如果要从meta_clean.json中读取元数据，并将其写入meta_lat.json，使用模型model.ckpt，批处理大小为4，训练分辨率为512*512，精度为no（float32），则应如下所示。
+```
+python prepare_buckets_latents.py --full_path 
+    train_data meta_clean.json meta_lat.json model.ckpt 
+    --batch_size 4 --max_resolution 512,512 --mixed_precision no
+```
+
+教师数据文件夹中，latents以numpy的npz格式保存。
+
+您可以使用--min_bucket_reso选项指定最小分辨率大小，--max_bucket_reso指定最大大小。默认值分别为256和1024。例如，如果指定最小大小为384，则将不再使用分辨率为256 * 1024或320 * 768等。如果将分辨率增加到768 * 768等较大的值，则最好将最大大小指定为1280等。
+
+如果指定--flip_aug选项，则进行左右翻转的数据增强。虽然这可以使数据量伪造一倍，但如果数据不是左右对称的（例如角色外观、发型等），则可能会导致训练不成功。
+
+对于翻转的图像，也会获取latents，并保存名为\ *_flip.npz的文件，这是一个简单的实现。在fline_tune.py中不需要特定的选项。如果有带有\_flip的文件，则会随机加载带有和不带有flip的文件。
+
+即使VRAM为12GB，批次大小也可以稍微增加。分辨率以“宽度，高度”的形式指定，必须是64的倍数。分辨率直接影响fine tuning时的内存大小。在12GB VRAM中，512,512似乎是极限（*）。如果有16GB，则可以将其提高到512,704或512,768。即使分辨率为256,256等，VRAM 8GB也很难承受（因为参数、优化器等与分辨率无关，需要一定的内存）。
+
+*有报道称，在batch size为1的训练中，使用12GB VRAM和640,640的分辨率。 
+
+以下是bucketing结果的显示方式。
+
+![bucketing的結果](https://user-images.githubusercontent.com/52813779/208911419-71c00fbb-2ce6-49d5-89b5-b78d7715e441.png)
+
+如果有多个教师数据文件夹，请指定 full_path 参数并为每个文件夹执行
+
+```
+python prepare_buckets_latents.py --full_path  
+    train_data1 meta_clean.json meta_lat1.json model.ckpt 
+    --batch_size 4 --max_resolution 512,512 --mixed_precision no
+
+python prepare_buckets_latents.py --full_path 
+    train_data2 meta_lat1.json meta_lat2.json model.ckpt 
+    --batch_size 4 --max_resolution 512,512 --mixed_precision no
+
+```
+可以将读取源和写入目标设为相同，但分开设定更为安全。
+
+__※建议每次更改参数并将其写入另一个元数据文件，以确保安全性。__
--- a/docs/train_SDXL-en.md
+++ b/docs/train_SDXL-en.md
@@ -0,0 +1,84 @@
+## SDXL training
+
+The documentation will be moved to the training documentation in the future. The following is a brief explanation of the training scripts for SDXL.
+
+### Training scripts for SDXL
+
+- `sdxl_train.py` is a script for SDXL fine-tuning. The usage is almost the same as `fine_tune.py`, but it also supports DreamBooth dataset.
+  - `--full_bf16` option is added. Thanks to KohakuBlueleaf!
+    - This option enables the full bfloat16 training (includes gradients). This option is useful to reduce the GPU memory usage. 
+    - The full bfloat16 training might be unstable. Please use it at your own risk.
+  - The different learning rates for each U-Net block are now supported in sdxl_train.py. Specify with `--block_lr` option. Specify 23 values separated by commas like `--block_lr 1e-3,1e-3 ... 1e-3`.
+    - 23 values correspond to `0: time/label embed, 1-9: input blocks 0-8, 10-12: mid blocks 0-2, 13-21: output blocks 0-8, 22: out`.
+- `prepare_buckets_latents.py` now supports SDXL fine-tuning.
+
+- `sdxl_train_network.py` is a script for LoRA training for SDXL. The usage is almost the same as `train_network.py`.
+
+- Both scripts has following additional options:
+  - `--cache_text_encoder_outputs` and `--cache_text_encoder_outputs_to_disk`: Cache the outputs of the text encoders. This option is useful to reduce the GPU memory usage. This option cannot be used with options for shuffling or dropping the captions.
+  - `--no_half_vae`: Disable the half-precision (mixed-precision) VAE. VAE for SDXL seems to produce NaNs in some cases. This option is useful to avoid the NaNs.
+
+- `--weighted_captions` option is not supported yet for both scripts.
+
+- `sdxl_train_textual_inversion.py` is a script for Textual Inversion training for SDXL. The usage is almost the same as `train_textual_inversion.py`.
+  - `--cache_text_encoder_outputs` is not supported.
+  - There are two options for captions:
+    1. Training with captions. All captions must include the token string. The token string is replaced with multiple tokens.
+    2. Use `--use_object_template` or `--use_style_template` option. The captions are generated from the template. The existing captions are ignored.
+  - See below for the format of the embeddings.
+
+- `--min_timestep` and `--max_timestep` options are added to each training script. These options can be used to train U-Net with different timesteps. The default values are 0 and 1000.
+
+### Utility scripts for SDXL
+
+- `tools/cache_latents.py` is added. This script can be used to cache the latents to disk in advance. 
+  - The options are almost the same as `sdxl_train.py'. See the help message for the usage.
+  - Please launch the script as follows:
+    `accelerate launch  --num_cpu_threads_per_process 1 tools/cache_latents.py ...`
+  - This script should work with multi-GPU, but it is not tested in my environment.
+
+- `tools/cache_text_encoder_outputs.py` is added. This script can be used to cache the text encoder outputs to disk in advance. 
+  - The options are almost the same as `cache_latents.py` and `sdxl_train.py`. See the help message for the usage.
+
+- `sdxl_gen_img.py` is added. This script can be used to generate images with SDXL, including LoRA, Textual Inversion and ControlNet-LLLite. See the help message for the usage.
+
+### Tips for SDXL training
+
+- The default resolution of SDXL is 1024x1024.
+- The fine-tuning can be done with 24GB GPU memory with the batch size of 1. For 24GB GPU, the following options are recommended __for the fine-tuning with 24GB GPU memory__:
+  - Train U-Net only.
+  - Use gradient checkpointing.
+  - Use `--cache_text_encoder_outputs` option and caching latents.
+  - Use Adafactor optimizer. RMSprop 8bit or Adagrad 8bit may work. AdamW 8bit doesn't seem to work.
+- The LoRA training can be done with 8GB GPU memory (10GB recommended). For reducing the GPU memory usage, the following options are recommended:
+  - Train U-Net only.
+  - Use gradient checkpointing.
+  - Use `--cache_text_encoder_outputs` option and caching latents.
+  - Use one of 8bit optimizers or Adafactor optimizer.
+  - Use lower dim (4 to 8 for 8GB GPU).
+- `--network_train_unet_only` option is highly recommended for SDXL LoRA. Because SDXL has two text encoders, the result of the training will be unexpected.
+- PyTorch 2 seems to use slightly less GPU memory than PyTorch 1.
+- `--bucket_reso_steps` can be set to 32 instead of the default value 64. Smaller values than 32 will not work for SDXL training.
+
+Example of the optimizer settings for Adafactor with the fixed learning rate:
+```toml
+optimizer_type = "adafactor"
+optimizer_args = [ "scale_parameter=False", "relative_step=False", "warmup_init=False" ]
+lr_scheduler = "constant_with_warmup"
+lr_warmup_steps = 100
+learning_rate = 4e-7 # SDXL original learning rate
+```
+
+### Format of Textual Inversion embeddings for SDXL
+
+```python
+from safetensors.torch import save_file
+
+state_dict = {"clip_g": embs_for_text_encoder_1280, "clip_l": embs_for_text_encoder_768}
+save_file(state_dict, file)
+```
+
+### ControlNet-LLLite
+
+ControlNet-LLLite, a novel method for ControlNet with SDXL, is added. See [documentation](./docs/train_lllite_README.md) for details.
+
--- a/docs/train_db_README-ja.md
+++ b/docs/train_db_README-ja.md
@@ -0,0 +1,167 @@
+DreamBoothのガイドです。
+
+[学習についての共通ドキュメント](./train_README-ja.md) もあわせてご覧ください。
+
+# 概要
+
+DreamBoothとは、画像生成モデルに特定の主題を追加学習し、それを特定の識別子で生成する技術です。[論文はこちら](https://arxiv.org/abs/2208.12242)。
+
+具体的には、Stable Diffusionのモデルにキャラや画風などを学ばせ、それを `shs` のような特定の単語で呼び出せる（生成画像に出現させる）ことができます。
+
+スクリプトは[DiffusersのDreamBooth](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth)を元にしていますが、以下のような機能追加を行っています（いくつかの機能は元のスクリプト側もその後対応しています）。
+
+スクリプトの主な機能は以下の通りです。
+
+- 8bit Adam optimizerおよびlatentのキャッシュによる省メモリ化（[Shivam Shrirao氏版](https://github.com/ShivamShrirao/diffusers/tree/main/examples/dreambooth)と同様）。
+- xformersによる省メモリ化。
+- 512x512だけではなく任意サイズでの学習。
+- augmentationによる品質の向上。
+- DreamBoothだけではなくText Encoder+U-Netのfine tuningに対応。
+- Stable Diffusion形式でのモデルの読み書き。
+- Aspect Ratio Bucketing。
+- Stable Diffusion v2.0対応。
+
+# 学習の手順
+
+あらかじめこのリポジトリのREADMEを参照し、環境整備を行ってください。
+
+## データの準備
+
+[学習データの準備について](./train_README-ja.md) を参照してください。
+
+## 学習の実行
+
+スクリプトを実行します。最大限、メモリを節約したコマンドは以下のようになります（実際には1行で入力します）。それぞれの行を必要に応じて書き換えてください。12GB程度のVRAMで動作するようです。
+
+```
+accelerate launch --num_cpu_threads_per_process 1 train_db.py 
+    --pretrained_model_name_or_path=<.ckptまたは.safetensordまたはDiffusers版モデルのディレクトリ> 
+    --dataset_config=<データ準備で作成した.tomlファイル> 
+    --output_dir=<学習したモデルの出力先フォルダ>  
+    --output_name=<学習したモデル出力時のファイル名> 
+    --save_model_as=safetensors 
+    --prior_loss_weight=1.0 
+    --max_train_steps=1600 
+    --learning_rate=1e-6 
+    --optimizer_type="AdamW8bit" 
+    --xformers 
+    --mixed_precision="fp16" 
+    --cache_latents 
+    --gradient_checkpointing
+```
+
+`num_cpu_threads_per_process` には通常は1を指定するとよいようです。
+
+`pretrained_model_name_or_path` に追加学習を行う元となるモデルを指定します。Stable Diffusionのcheckpointファイル（.ckptまたは.safetensors）、Diffusersのローカルディスクにあるモデルディレクトリ、DiffusersのモデルID（"stabilityai/stable-diffusion-2"など）が指定できます。
+
+`output_dir` に学習後のモデルを保存するフォルダを指定します。`output_name` にモデルのファイル名を拡張子を除いて指定します。`save_model_as` でsafetensors形式での保存を指定しています。
+
+`dataset_config` に `.toml` ファイルを指定します。ファイル内でのバッチサイズ指定は、当初はメモリ消費を抑えるために `1` としてください。
+
+`prior_loss_weight` は正則化画像のlossの重みです。通常は1.0を指定します。
+
+学習させるステップ数 `max_train_steps` を1600とします。学習率 `learning_rate` はここでは1e-6を指定しています。
+
+省メモリ化のため `mixed_precision="fp16"` を指定します（RTX30 シリーズ以降では `bf16` も指定できます。環境整備時にaccelerateに行った設定と合わせてください）。また `gradient_checkpointing` を指定します。
+
+オプティマイザ（モデルを学習データにあうように最適化＝学習させるクラス）にメモリ消費の少ない 8bit AdamW を使うため、 `optimizer_type="AdamW8bit"` を指定します。
+
+`xformers` オプションを指定し、xformersのCrossAttentionを用います。xformersをインストールしていない場合やエラーとなる場合（環境にもよりますが `mixed_precision="no"` の場合など）、代わりに `mem_eff_attn` オプションを指定すると省メモリ版CrossAttentionを使用します（速度は遅くなります）。
+
+省メモリ化のため `cache_latents` オプションを指定してVAEの出力をキャッシュします。
+
+ある程度メモリがある場合は、`.toml` ファイルを編集してバッチサイズをたとえば `4` くらいに増やしてください（高速化と精度向上の可能性があります）。また `cache_latents` を外すことで augmentation が可能になります。
+
+### よく使われるオプションについて
+
+以下の場合には [学習の共通ドキュメント](./train_README-ja.md) の「よく使われるオプション」を参照してください。
+
+- Stable Diffusion 2.xまたはそこからの派生モデルを学習する
+- clip skipを2以上を前提としたモデルを学習する
+- 75トークンを超えたキャプションで学習する
+
+### DreamBoothでのステップ数について
+
+当スクリプトでは省メモリ化のため、ステップ当たりの学習回数が元のスクリプトの半分になっています（対象の画像と正則化画像を同一のバッチではなく別のバッチに分割して学習するため）。
+
+元のDiffusers版やXavierXiao氏のStable Diffusion版とほぼ同じ学習を行うには、ステップ数を倍にしてください。
+
+（学習画像と正則化画像をまとめてから shuffle するため厳密にはデータの順番が変わってしまいますが、学習には大きな影響はないと思います。）
+
+### DreamBoothでのバッチサイズについて
+
+モデル全体を学習するためLoRA等の学習に比べるとメモリ消費量は多くなります（fine tuningと同じ）。
+
+### 学習率について
+
+Diffusers版では5e-6ですがStable Diffusion版は1e-6ですので、上のサンプルでは1e-6を指定しています。
+
+### 以前の形式のデータセット指定をした場合のコマンドライン
+
+解像度やバッチサイズをオプションで指定します。コマンドラインの例は以下の通りです。
+
+```
+accelerate launch --num_cpu_threads_per_process 1 train_db.py 
+    --pretrained_model_name_or_path=<.ckptまたは.safetensordまたはDiffusers版モデルのディレクトリ> 
+    --train_data_dir=<学習用データのディレクトリ> 
+    --reg_data_dir=<正則化画像のディレクトリ> 
+    --output_dir=<学習したモデルの出力先ディレクトリ> 
+    --output_name=<学習したモデル出力時のファイル名> 
+    --prior_loss_weight=1.0 
+    --resolution=512 
+    --train_batch_size=1 
+    --learning_rate=1e-6 
+    --max_train_steps=1600 
+    --use_8bit_adam 
+    --xformers 
+    --mixed_precision="bf16" 
+    --cache_latents
+    --gradient_checkpointing
+```
+
+## 学習したモデルで画像生成する
+
+学習が終わると指定したフォルダに指定した名前でsafetensorsファイルが出力されます。
+
+v1.4/1.5およびその他の派生モデルの場合、このモデルでAutomatic1111氏のWebUIなどで推論できます。models\Stable-diffusionフォルダに置いてください。
+
+v2.xモデルでWebUIで画像生成する場合、モデルの仕様が記述された.yamlファイルが別途必要になります。v2.x baseの場合はv2-inference.yamlを、768/vの場合はv2-inference-v.yamlを、同じフォルダに置き、拡張子の前の部分をモデルと同じ名前にしてください。
+
+![image](https://user-images.githubusercontent.com/52813779/210776915-061d79c3-6582-42c2-8884-8b91d2f07313.png)
+
+各yamlファイルは[Stability AIのSD2.0のリポジトリ](https://github.com/Stability-AI/stablediffusion/tree/main/configs/stable-diffusion)にあります。
+
+# DreamBooth特有のその他の主なオプション
+
+すべてのオプションについては別文書を参照してください。
+
+## Text Encoderの学習を途中から行わない --stop_text_encoder_training
+
+stop_text_encoder_trainingオプションに数値を指定すると、そのステップ数以降はText Encoderの学習を行わずU-Netだけ学習します。場合によっては精度の向上が期待できるかもしれません。
+
+（恐らくText Encoderだけ先に過学習することがあり、それを防げるのではないかと推測していますが、詳細な影響は不明です。）
+
+## Tokenizerのパディングをしない --no_token_padding
+no_token_paddingオプションを指定するとTokenizerの出力をpaddingしません（Diffusers版の旧DreamBoothと同じ動きになります）。
+
+
+<!-- 
+bucketing（後述）を利用しかつaugmentation（後述）を使う場合の例は以下のようになります。
+
+```
+accelerate launch --num_cpu_threads_per_process 8 train_db.py 
+    --pretrained_model_name_or_path=<.ckptまたは.safetensordまたはDiffusers版モデルのディレクトリ> 
+    --train_data_dir=<学習用データのディレクトリ> 
+    --reg_data_dir=<正則化画像のディレクトリ> 
+    --output_dir=<学習したモデルの出力先ディレクトリ> 
+    --resolution=768,512 
+    --train_batch_size=20 --learning_rate=5e-6 --max_train_steps=800 
+    --use_8bit_adam --xformers --mixed_precision="bf16" 
+    --save_every_n_epochs=1 --save_state --save_precision="bf16" 
+    --logging_dir=logs 
+    --enable_bucket --min_bucket_reso=384 --max_bucket_reso=1280 
+    --color_aug --flip_aug --gradient_checkpointing --seed 42
+```
+
+
+-->
--- a/docs/train_db_README-zh.md
+++ b/docs/train_db_README-zh.md
@@ -0,0 +1,162 @@
+这是DreamBooth的指南。
+
+请同时查看[关于学习的通用文档](./train_README-zh.md)。
+
+# 概要
+
+DreamBooth是一种将特定主题添加到图像生成模型中进行学习，并使用特定识别子生成它的技术。论文链接。
+
+具体来说，它可以将角色和绘画风格等添加到Stable Diffusion模型中进行学习，并使用特定的单词（例如`shs`）来调用（呈现在生成的图像中）。
+
+脚本基于Diffusers的DreamBooth，但添加了以下功能（一些功能已在原始脚本中得到支持）。
+
+脚本的主要功能如下：
+
+- 使用8位Adam优化器和潜在变量的缓存来节省内存（与Shivam Shrirao版相似）。
+- 使用xformers来节省内存。
+- 不仅支持512x512，还支持任意尺寸的训练。
+- 通过数据增强来提高质量。
+- 支持DreamBooth和Text Encoder + U-Net的微调。
+- 支持以Stable Diffusion格式读写模型。
+- 支持Aspect Ratio Bucketing。
+- 支持Stable Diffusion v2.0。
+
+# 训练步骤
+
+请先参阅此存储库的README以进行环境设置。
+
+## 准备数据
+
+请参阅[有关准备训练数据的说明](./train_README-zh.md)。
+
+## 运行训练
+
+运行脚本。以下是最大程度地节省内存的命令（实际上，这将在一行中输入）。请根据需要修改每行。它似乎需要约12GB的VRAM才能运行。
+```
+accelerate launch --num_cpu_threads_per_process 1 train_db.py 
+    --pretrained_model_name_or_path=<.ckpt或.safetensord或Diffusers版模型的目录>
+    --dataset_config=<数据准备时创建的.toml文件>
+    --output_dir=<训练模型的输出目录>
+    --output_name=<训练模型输出时的文件名>
+    --save_model_as=safetensors 
+    --prior_loss_weight=1.0 
+    --max_train_steps=1600 
+    --learning_rate=1e-6 
+    --optimizer_type="AdamW8bit" 
+    --xformers 
+    --mixed_precision="fp16" 
+    --cache_latents 
+    --gradient_checkpointing
+```
+`num_cpu_threads_per_process` 通常应该设置为1。
+
+`pretrained_model_name_or_path` 指定要进行追加训练的基础模型。可以指定 Stable Diffusion 的 checkpoint 文件（.ckpt 或 .safetensors）、Diffusers 的本地模型目录或模型 ID（如 "stabilityai/stable-diffusion-2"）。
+
+`output_dir` 指定保存训练后模型的文件夹。在 `output_name` 中指定模型文件名，不包括扩展名。使用 `save_model_as` 指定以 safetensors 格式保存。
+
+在 `dataset_config` 中指定 `.toml` 文件。初始批处理大小应为 `1`，以减少内存消耗。
+
+`prior_loss_weight` 是正则化图像损失的权重。通常设为1.0。
+
+将要训练的步数 `max_train_steps` 设置为1600。在这里，学习率 `learning_rate` 被设置为1e-6。
+
+为了节省内存，设置 `mixed_precision="fp16"`（在 RTX30 系列及更高版本中也可以设置为 `bf16`）。同时指定 `gradient_checkpointing`。
+
+为了使用内存消耗较少的 8bit AdamW 优化器（将模型优化为适合于训练数据的状态），指定 `optimizer_type="AdamW8bit"`。
+
+指定 `xformers` 选项，并使用 xformers 的 CrossAttention。如果未安装 xformers 或出现错误（具体情况取决于环境，例如使用 `mixed_precision="no"`），则可以指定 `mem_eff_attn` 选项以使用省内存版的 CrossAttention（速度会变慢）。
+
+为了节省内存，指定 `cache_latents` 选项以缓存 VAE 的输出。
+
+如果有足够的内存，请编辑 `.toml` 文件将批处理大小增加到大约 `4`（可能会提高速度和精度）。此外，取消 `cache_latents` 选项可以进行数据增强。
+
+### 常用选项
+
+对于以下情况，请参阅“常用选项”部分。
+
+- 学习 Stable Diffusion 2.x 或其衍生模型。
+- 学习基于 clip skip 大于等于2的模型。
+- 学习超过75个令牌的标题。
+
+### 关于DreamBooth中的步数
+
+为了实现省内存化，该脚本中每个步骤的学习次数减半（因为学习和正则化的图像在训练时被分为不同的批次）。
+
+要进行与原始Diffusers版或XavierXiao的Stable Diffusion版几乎相同的学习，请将步骤数加倍。
+
+（虽然在将学习图像和正则化图像整合后再打乱顺序，但我认为对学习没有太大影响。）
+
+关于DreamBooth的批量大小
+
+与像LoRA这样的学习相比，为了训练整个模型，内存消耗量会更大（与微调相同）。
+
+关于学习率
+
+在Diffusers版中，学习率为5e-6，而在Stable Diffusion版中为1e-6，因此在上面的示例中指定了1e-6。
+
+当使用旧格式的数据集指定命令行时
+
+使用选项指定分辨率和批量大小。命令行示例如下。
+```
+accelerate launch --num_cpu_threads_per_process 1 train_db.py 
+    --pretrained_model_name_or_path=<.ckpt或.safetensord或Diffusers版模型的目录> 
+    --train_data_dir=<训练数据的目录> 
+    --reg_data_dir=<正则化图像的目录> 
+    --output_dir=<训练后模型的输出目录> 
+    --output_name=<训练后模型输出文件的名称>  
+    --prior_loss_weight=1.0 
+    --resolution=512 
+    --train_batch_size=1 
+    --learning_rate=1e-6 
+    --max_train_steps=1600 
+    --use_8bit_adam 
+    --xformers 
+    --mixed_precision="bf16" 
+    --cache_latents
+    --gradient_checkpointing
+```
+
+## 使用训练好的模型生成图像
+
+训练完成后，将在指定的文件夹中以指定的名称输出safetensors文件。
+
+对于v1.4/1.5和其他派生模型，可以在此模型中使用Automatic1111先生的WebUI进行推断。请将其放置在models\Stable-diffusion文件夹中。
+
+对于使用v2.x模型在WebUI中生成图像的情况，需要单独的.yaml文件来描述模型的规格。对于v2.x base，需要v2-inference.yaml，对于768/v，则需要v2-inference-v.yaml。请将它们放置在相同的文件夹中，并将文件扩展名之前的部分命名为与模型相同的名称。
+![image](https://user-images.githubusercontent.com/52813779/210776915-061d79c3-6582-42c2-8884-8b91d2f07313.png)
+
+每个yaml文件都在[Stability AI的SD2.0存储库](https://github.com/Stability-AI/stablediffusion/tree/main/configs/stable-diffusion)……之中。
+
+# DreamBooth的其他主要选项
+
+有关所有选项的详细信息，请参阅另一份文档。
+
+## 不在中途开始对文本编码器进行训练 --stop_text_encoder_training
+
+如果在stop_text_encoder_training选项中指定一个数字，则在该步骤之后，将不再对文本编码器进行训练，只会对U-Net进行训练。在某些情况下，可能会期望提高精度。
+
+（我们推测可能会有时候仅仅文本编码器会过度学习，而这样做可以避免这种情况，但详细影响尚不清楚。）
+
+## 不进行分词器的填充 --no_token_padding
+
+如果指定no_token_padding选项，则不会对分词器的输出进行填充（与Diffusers版本的旧DreamBooth相同）。
+
+<!-- 
+如果使用分桶（bucketing）和数据增强（augmentation），则使用示例如下：
+```
+accelerate launch --num_cpu_threads_per_process 8 train_db.py 
+    --pretrained_model_name_or_path=<.ckpt或.safetensord或Diffusers版模型的目录> 
+    --train_data_dir=<训练数据的目录> 
+    --reg_data_dir=<正则化图像的目录> 
+    --output_dir=<训练后模型的输出目录>
+    --resolution=768,512 
+    --train_batch_size=20 --learning_rate=5e-6 --max_train_steps=800 
+    --use_8bit_adam --xformers --mixed_precision="bf16" 
+    --save_every_n_epochs=1 --save_state --save_precision="bf16" 
+    --logging_dir=logs 
+    --enable_bucket --min_bucket_reso=384 --max_bucket_reso=1280 
+    --color_aug --flip_aug --gradient_checkpointing --seed 42
+```
+
+
+-->
--- a/docs/train_lllite_README-ja.md
+++ b/docs/train_lllite_README-ja.md
@@ -0,0 +1,218 @@
+# ControlNet-LLLite について
+
+__きわめて実験的な実装のため、将来的に大きく変更される可能性があります。__
+
+## 概要
+ControlNet-LLLite は、[ControlNet](https://github.com/lllyasviel/ControlNet) の軽量版です。LoRA Like Lite という意味で、LoRAからインスピレーションを得た構造を持つ、軽量なControlNetです。現在はSDXLにのみ対応しています。
+
+## サンプルの重みファイルと推論
+
+こちらにあります: https://huggingface.co/kohya-ss/controlnet-lllite
+
+ComfyUIのカスタムノードを用意しています。: https://github.com/kohya-ss/ControlNet-LLLite-ComfyUI
+
+生成サンプルはこのページの末尾にあります。
+
+## モデル構造
+ひとつのLLLiteモジュールは、制御用画像（以下conditioning image）を潜在空間に写像するconditioning image embeddingと、LoRAにちょっと似た構造を持つ小型のネットワークからなります。LLLiteモジュールを、LoRAと同様にU-NetのLinearやConvに追加します。詳しくはソースコードを参照してください。
+
+推論環境の制限で、現在はCrossAttentionのみ（attn1のq/k/v、attn2のq）に追加されます。
+
+## モデルの学習
+
+### データセットの準備
+DreamBooth 方式の dataset で、`conditioning_data_dir` で指定したディレクトリにconditioning imageを格納してください。
+
+（finetuning 方式の dataset はサポートしていません。）
+
+conditioning imageは学習用画像と同じbasenameを持つ必要があります。また、conditioning imageは学習用画像と同じサイズに自動的にリサイズされます。conditioning imageにはキャプションファイルは不要です。
+
+たとえば、キャプションにフォルダ名ではなくキャプションファイルを用いる場合の設定ファイルは以下のようになります。
+
+```toml
+[[datasets.subsets]]
+image_dir = "path/to/image/dir"
+caption_extension = ".txt"
+conditioning_data_dir = "path/to/conditioning/image/dir"
+```
+
+現時点の制約として、random_cropは使用できません。
+
+学習データとしては、元のモデルで生成した画像を学習用画像として、そこから加工した画像をconditioning imageとした、合成によるデータセットを用いるのがもっとも簡単です（データセットの品質的には問題があるかもしれません）。具体的なデータセットの合成方法については後述します。
+
+なお、元モデルと異なる画風の画像を学習用画像とすると、制御に加えて、その画風についても学ぶ必要が生じます。ControlNet-LLLiteは容量が少ないため、画風学習には不向きです。このような場合には、後述の次元数を多めにしてください。
+
+### 学習
+スクリプトで生成する場合は、`sdxl_train_control_net_lllite.py` を実行してください。`--cond_emb_dim` でconditioning image embeddingの次元数を指定できます。`--network_dim` でLoRA的モジュールのrankを指定できます。その他のオプションは`sdxl_train_network.py`に準じますが、`--network_module`の指定は不要です。
+
+学習時にはメモリを大量に使用しますので、キャッシュやgradient checkpointingなどの省メモリ化のオプションを有効にしてください。また`--full_bf16` オプションで、BFloat16を使用するのも有効です（RTX 30シリーズ以降のGPUが必要です）。24GB VRAMで動作確認しています。
+
+conditioning image embeddingの次元数は、サンプルのCannyでは32を指定しています。LoRA的モジュールのrankは同じく64です。対象とするconditioning imageの特徴に合わせて調整してください。
+
+（サンプルのCannyは恐らくかなり難しいと思われます。depthなどでは半分程度にしてもいいかもしれません。）
+
+以下は .toml の設定例です。
+
+```toml
+pretrained_model_name_or_path = "/path/to/model_trained_on.safetensors"
+max_train_epochs = 12
+max_data_loader_n_workers = 4
+persistent_data_loader_workers = true
+seed = 42
+gradient_checkpointing = true
+mixed_precision = "bf16"
+save_precision = "bf16"
+full_bf16 = true
+optimizer_type = "adamw8bit"
+learning_rate = 2e-4
+xformers = true
+output_dir = "/path/to/output/dir"
+output_name = "output_name"
+save_every_n_epochs = 1
+save_model_as = "safetensors"
+vae_batch_size = 4
+cache_latents = true
+cache_latents_to_disk = true
+cache_text_encoder_outputs = true
+cache_text_encoder_outputs_to_disk = true
+network_dim = 64
+cond_emb_dim = 32
+dataset_config = "/path/to/dataset.toml"
+```
+
+### 推論
+
+スクリプトで生成する場合は、`sdxl_gen_img.py` を実行してください。`--control_net_lllite_models` でLLLiteのモデルファイルを指定できます。次元数はモデルファイルから自動取得します。
+
+`--guide_image_path`で推論に用いるconditioning imageを指定してください。なおpreprocessは行われないため、たとえばCannyならCanny処理を行った画像を指定してください（背景黒に白線）。`--control_net_preps`, `--control_net_weights`, `--control_net_ratios` には未対応です。
+
+## データセットの合成方法
+
+### 学習用画像の生成
+
+学習のベースとなるモデルで画像生成を行います。Web UIやComfyUIなどで生成してください。画像サイズはモデルのデフォルトサイズで良いと思われます（1024x1024など）。bucketingを用いることもできます。その場合は適宜適切な解像度で生成してください。
+
+生成時のキャプション等は、ControlNet-LLLiteの利用時に生成したい画像にあわせるのが良いと思われます。
+
+生成した画像を任意のディレクトリに保存してください。このディレクトリをデータセットの設定ファイルで指定します。
+
+当リポジトリ内の `sdxl_gen_img.py` でも生成できます。例えば以下のように実行します。
+
+```dos
+python sdxl_gen_img.py --ckpt path/to/model.safetensors --n_iter 1 --scale 10 --steps 36 --outdir path/to/output/dir --xformers --W 1024 --H 1024 --original_width 2048 --original_height 2048 --bf16 --sampler ddim --batch_size 4 --vae_batch_size 2 --images_per_prompt 512 --max_embeddings_multiples 1 --prompt "{portrait|digital art|anime screen cap|detailed illustration} of 1girl, {standing|sitting|walking|running|dancing} on {classroom|street|town|beach|indoors|outdoors}, {looking at viewer|looking away|looking at another}, {in|wearing} {shirt and skirt|school uniform|casual wear} { |, dynamic pose}, (solo), teen age, {0-1$$smile,|blush,|kind smile,|expression less,|happy,|sadness,} {0-1$$upper body,|full body,|cowboy shot,|face focus,} trending on pixiv, {0-2$$depth of fields,|8k wallpaper,|highly detailed,|pov,} {0-1$$summer, |winter, |spring, |autumn, } beautiful face { |, from below|, from above|, from side|, from behind|, from back} --n nsfw, bad face, lowres, low quality, worst quality, low effort, watermark, signature, ugly, poorly drawn"
+```
+
+VRAM 24GBの設定です。VRAMサイズにより`--batch_size` `--vae_batch_size`を調整してください。
+
+`--prompt`でワイルドカードを利用してランダムに生成しています。適宜調整してください。
+
+### 画像の加工
+
+外部のプログラムを用いて、生成した画像を加工します。加工した画像を任意のディレクトリに保存してください。これらがconditioning imageになります。
+
+加工にはたとえばCannyなら以下のようなスクリプトが使えます。
+
+```python
+import glob
+import os
+import random
+import cv2
+import numpy as np
+
+IMAGES_DIR = "path/to/generated/images"
+CANNY_DIR = "path/to/canny/images"
+
+os.makedirs(CANNY_DIR, exist_ok=True)
+img_files = glob.glob(IMAGES_DIR + "/*.png")
+for img_file in img_files:
+    can_file = CANNY_DIR + "/" + os.path.basename(img_file)
+    if os.path.exists(can_file):
+        print("Skip: " + img_file)
+        continue
+
+    print(img_file)
+
+    img = cv2.imread(img_file)
+
+    # random threshold
+    # while True:
+    #     threshold1 = random.randint(0, 127)
+    #     threshold2 = random.randint(128, 255)
+    #     if threshold2 - threshold1 > 80:
+    #         break
+
+    # fixed threshold
+    threshold1 = 100
+    threshold2 = 200
+
+    img = cv2.Canny(img, threshold1, threshold2)
+
+    cv2.imwrite(can_file, img)
+```
+
+### キャプションファイルの作成
+
+学習用画像のbasenameと同じ名前で、それぞれの画像に対応したキャプションファイルを作成してください。生成時のプロンプトをそのまま利用すれば良いと思われます。
+
+`sdxl_gen_img.py` で生成した場合は、画像内のメタデータに生成時のプロンプトが記録されていますので、以下のようなスクリプトで学習用画像と同じディレクトリにキャプションファイルを作成できます（拡張子 `.txt`）。
+
+```python
+import glob
+import os
+from PIL import Image
+
+IMAGES_DIR = "path/to/generated/images"
+
+img_files = glob.glob(IMAGES_DIR + "/*.png")
+for img_file in img_files:
+    cap_file = img_file.replace(".png", ".txt")
+    if os.path.exists(cap_file):
+        print(f"Skip: {img_file}")
+        continue
+    print(img_file)
+
+    img = Image.open(img_file)
+    prompt = img.text["prompt"] if "prompt" in img.text else ""
+    if prompt == "":
+        print(f"Prompt not found in {img_file}")
+
+    with open(cap_file, "w") as f:
+        f.write(prompt + "\n")
+```
+
+### データセットの設定ファイルの作成
+
+コマンドラインオプションからの指定も可能ですが、`.toml`ファイルを作成する場合は `conditioning_data_dir` に加工した画像を保存したディレクトリを指定します。
+
+以下は設定ファイルの例です。
+
+```toml
+[general]
+flip_aug = false
+color_aug = false
+resolution = [1024,1024]
+
+[[datasets]]
+batch_size = 8
+enable_bucket = false
+
+    [[datasets.subsets]]
+    image_dir = "path/to/generated/image/dir"
+    caption_extension = ".txt"
+    conditioning_data_dir = "path/to/canny/image/dir"
+```
+
+## 謝辞
+
+ControlNetの作者である lllyasviel 氏、実装上のアドバイスとトラブル解決へのご尽力をいただいた furusu 氏、ControlNetデータセットを実装していただいた ddPn08 氏に感謝いたします。
+
+## サンプル
+Canny
+![kohya_ss_girl_standing_at_classroom_smiling_to_the_viewer_class_78976b3e-0d4d-4ea0-b8e3-053ae493abbc](https://github.com/kohya-ss/sd-scripts/assets/52813779/37e9a736-649b-4c0f-ab26-880a1bf319b5)
+
+![im_20230820104253_000_1](https://github.com/kohya-ss/sd-scripts/assets/52813779/c8896900-ab86-4120-932f-6e2ae17b77c0)
+
+![im_20230820104302_000_1](https://github.com/kohya-ss/sd-scripts/assets/52813779/b12457a0-ee3c-450e-ba9a-b712d0fe86bb)
+
+![im_20230820104310_000_1](https://github.com/kohya-ss/sd-scripts/assets/52813779/8845b8d9-804a-44ac-9618-113a28eac8a1)
+
--- a/docs/train_lllite_README.md
+++ b/docs/train_lllite_README.md
@@ -0,0 +1,219 @@
+# About ControlNet-LLLite
+
+__This is an extremely experimental implementation and may change significantly in the future.__
+
+日本語版は[こちら](./train_lllite_README-ja.md)
+
+## Overview
+
+ControlNet-LLLite is a lightweight version of [ControlNet](https://github.com/lllyasviel/ControlNet). It is a "LoRA Like Lite" that is inspired by LoRA and has a lightweight structure. Currently, only SDXL is supported.
+
+## Sample weight file and inference
+
+Sample weight file is available here: https://huggingface.co/kohya-ss/controlnet-lllite
+
+A custom node for ComfyUI is available: https://github.com/kohya-ss/ControlNet-LLLite-ComfyUI
+
+Sample images are at the end of this page.
+
+## Model structure
+
+A single LLLite module consists of a conditioning image embedding that maps a conditioning image to a latent space and a small network with a structure similar to LoRA. The LLLite module is added to U-Net's Linear and Conv in the same way as LoRA. Please refer to the source code for details.
+
+Due to the limitations of the inference environment, only CrossAttention (attn1 q/k/v, attn2 q) is currently added.
+
+## Model training
+
+### Preparing the dataset
+
+In addition to the normal DreamBooth method dataset, please store the conditioning image in the directory specified by `conditioning_data_dir`. The conditioning image must have the same basename as the training image. The conditioning image will be automatically resized to the same size as the training image. The conditioning image does not require a caption file.
+
+(We do not support the finetuning method dataset.)
+
+```toml
+[[datasets.subsets]]
+image_dir = "path/to/image/dir"
+caption_extension = ".txt"
+conditioning_data_dir = "path/to/conditioning/image/dir"
+```
+
+At the moment, random_crop cannot be used.
+
+For training data, it is easiest to use a synthetic dataset with the original model-generated images as training images and processed images as conditioning images (the quality of the dataset may be problematic). See below for specific methods of synthesizing datasets.
+
+Note that if you use an image with a different art style than the original model as a training image, the model will have to learn not only the control but also the art style. ControlNet-LLLite has a small capacity, so it is not suitable for learning art styles. In such cases, increase the number of dimensions as described below.
+
+### Training
+
+Run `sdxl_train_control_net_lllite.py`. You can specify the dimension of the conditioning image embedding with `--cond_emb_dim`. You can specify the rank of the LoRA-like module with `--network_dim`. Other options are the same as `sdxl_train_network.py`, but `--network_module` is not required.
+
+Since a large amount of memory is used during training, please enable memory-saving options such as cache and gradient checkpointing. It is also effective to use BFloat16 with the `--full_bf16` option (requires RTX 30 series or later GPU). It has been confirmed to work with 24GB VRAM.
+
+For the sample Canny, the dimension of the conditioning image embedding is 32. The rank of the LoRA-like module is also 64. Adjust according to the features of the conditioning image you are targeting.
+
+(The sample Canny is probably quite difficult. It may be better to reduce it to about half for depth, etc.)
+
+The following is an example of a .toml configuration.
+
+```toml
+pretrained_model_name_or_path = "/path/to/model_trained_on.safetensors"
+max_train_epochs = 12
+max_data_loader_n_workers = 4
+persistent_data_loader_workers = true
+seed = 42
+gradient_checkpointing = true
+mixed_precision = "bf16"
+save_precision = "bf16"
+full_bf16 = true
+optimizer_type = "adamw8bit"
+learning_rate = 2e-4
+xformers = true
+output_dir = "/path/to/output/dir"
+output_name = "output_name"
+save_every_n_epochs = 1
+save_model_as = "safetensors"
+vae_batch_size = 4
+cache_latents = true
+cache_latents_to_disk = true
+cache_text_encoder_outputs = true
+cache_text_encoder_outputs_to_disk = true
+network_dim = 64
+cond_emb_dim = 32
+dataset_config = "/path/to/dataset.toml"
+```
+
+### Inference
+
+If you want to generate images with a script, run `sdxl_gen_img.py`. You can specify the LLLite model file with `--control_net_lllite_models`. The dimension is automatically obtained from the model file.
+
+Specify the conditioning image to be used for inference with `--guide_image_path`. Since preprocess is not performed, if it is Canny, specify an image processed with Canny (white line on black background). `--control_net_preps`, `--control_net_weights`, and `--control_net_ratios` are not supported.
+
+## How to synthesize a dataset
+
+### Generating training images
+
+Generate images with the base model for training. Please generate them with Web UI or ComfyUI etc. The image size should be the default size of the model (1024x1024, etc.). You can also use bucketing. In that case, please generate it at an arbitrary resolution.
+
+The captions and other settings when generating the images should be the same as when generating the images with the trained ControlNet-LLLite model.
+
+Save the generated images in an arbitrary directory. Specify this directory in the dataset configuration file.
+
+
+You can also generate them with `sdxl_gen_img.py` in this repository. For example, run as follows:
+
+```dos
+python sdxl_gen_img.py --ckpt path/to/model.safetensors --n_iter 1 --scale 10 --steps 36 --outdir path/to/output/dir --xformers --W 1024 --H 1024 --original_width 2048 --original_height 2048 --bf16 --sampler ddim --batch_size 4 --vae_batch_size 2 --images_per_prompt 512 --max_embeddings_multiples 1 --prompt "{portrait|digital art|anime screen cap|detailed illustration} of 1girl, {standing|sitting|walking|running|dancing} on {classroom|street|town|beach|indoors|outdoors}, {looking at viewer|looking away|looking at another}, {in|wearing} {shirt and skirt|school uniform|casual wear} { |, dynamic pose}, (solo), teen age, {0-1$$smile,|blush,|kind smile,|expression less,|happy,|sadness,} {0-1$$upper body,|full body,|cowboy shot,|face focus,} trending on pixiv, {0-2$$depth of fields,|8k wallpaper,|highly detailed,|pov,} {0-1$$summer, |winter, |spring, |autumn, } beautiful face { |, from below|, from above|, from side|, from behind|, from back} --n nsfw, bad face, lowres, low quality, worst quality, low effort, watermark, signature, ugly, poorly drawn"
+```
+
+This is a setting for VRAM 24GB. Adjust `--batch_size` and `--vae_batch_size` according to the VRAM size.
+
+The images are generated randomly using wildcards in `--prompt`. Adjust as necessary.
+
+### Processing images
+
+Use an external program to process the generated images. Save the processed images in an arbitrary directory. These will be the conditioning images.
+
+For example, you can use the following script to process the images with Canny.
+
+```python
+import glob
+import os
+import random
+import cv2
+import numpy as np
+
+IMAGES_DIR = "path/to/generated/images"
+CANNY_DIR = "path/to/canny/images"
+
+os.makedirs(CANNY_DIR, exist_ok=True)
+img_files = glob.glob(IMAGES_DIR + "/*.png")
+for img_file in img_files:
+    can_file = CANNY_DIR + "/" + os.path.basename(img_file)
+    if os.path.exists(can_file):
+        print("Skip: " + img_file)
+        continue
+
+    print(img_file)
+
+    img = cv2.imread(img_file)
+
+    # random threshold
+    # while True:
+    #     threshold1 = random.randint(0, 127)
+    #     threshold2 = random.randint(128, 255)
+    #     if threshold2 - threshold1 > 80:
+    #         break
+
+    # fixed threshold
+    threshold1 = 100
+    threshold2 = 200
+
+    img = cv2.Canny(img, threshold1, threshold2)
+
+    cv2.imwrite(can_file, img)
+```
+
+### Creating caption files
+
+Create a caption file for each image with the same basename as the training image. It is fine to use the same caption as the one used when generating the image. 
+
+If you generated the images with `sdxl_gen_img.py`, you can use the following script to create the caption files (`*.txt`) from the metadata in the generated images.
+
+```python
+import glob
+import os
+from PIL import Image
+
+IMAGES_DIR = "path/to/generated/images"
+
+img_files = glob.glob(IMAGES_DIR + "/*.png")
+for img_file in img_files:
+    cap_file = img_file.replace(".png", ".txt")
+    if os.path.exists(cap_file):
+        print(f"Skip: {img_file}")
+        continue
+    print(img_file)
+
+    img = Image.open(img_file)
+    prompt = img.text["prompt"] if "prompt" in img.text else ""
+    if prompt == "":
+        print(f"Prompt not found in {img_file}")
+
+    with open(cap_file, "w") as f:
+        f.write(prompt + "\n")
+```
+
+### Creating a dataset configuration file
+
+You can use the command line argument `--conditioning_data_dir` of `sdxl_train_control_net_lllite.py` to specify the conditioning image directory. However, if you want to use a `.toml` file, specify the conditioning image directory in `conditioning_data_dir`.
+
+```toml
+[general]
+flip_aug = false
+color_aug = false
+resolution = [1024,1024]
+
+[[datasets]]
+batch_size = 8
+enable_bucket = false
+
+    [[datasets.subsets]]
+    image_dir = "path/to/generated/image/dir"
+    caption_extension = ".txt"
+    conditioning_data_dir = "path/to/canny/image/dir"
+```
+
+## Credit
+
+I would like to thank lllyasviel, the author of ControlNet, furusu, who provided me with advice on implementation and helped me solve problems, and ddPn08, who implemented the ControlNet dataset.
+
+## Sample
+
+Canny
+![kohya_ss_girl_standing_at_classroom_smiling_to_the_viewer_class_78976b3e-0d4d-4ea0-b8e3-053ae493abbc](https://github.com/kohya-ss/sd-scripts/assets/52813779/37e9a736-649b-4c0f-ab26-880a1bf319b5)
+
+![im_20230820104253_000_1](https://github.com/kohya-ss/sd-scripts/assets/52813779/c8896900-ab86-4120-932f-6e2ae17b77c0)
+
+![im_20230820104302_000_1](https://github.com/kohya-ss/sd-scripts/assets/52813779/b12457a0-ee3c-450e-ba9a-b712d0fe86bb)
+
+![im_20230820104310_000_1](https://github.com/kohya-ss/sd-scripts/assets/52813779/8845b8d9-804a-44ac-9618-113a28eac8a1)
--- a/docs/train_network_README-ja.md
+++ b/docs/train_network_README-ja.md
@@ -0,0 +1,491 @@
+# LoRAの学習について
+
+[LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)（arxiv）、[LoRA](https://github.com/microsoft/LoRA)（github）をStable Diffusionに適用したものです。
+
+[cloneofsimo氏のリポジトリ](https://github.com/cloneofsimo/lora)を大いに参考にさせていただきました。ありがとうございます。
+
+通常のLoRAは Linear およぴカーネルサイズ 1x1 の Conv2d にのみ適用されますが、カーネルサイズ 3x3 のConv2dに適用を拡大することもできます。
+
+Conv2d 3x3への拡大は [cloneofsimo氏](https://github.com/cloneofsimo/lora) が最初にリリースし、KohakuBlueleaf氏が [LoCon](https://github.com/KohakuBlueleaf/LoCon) でその有効性を明らかにしたものです。KohakuBlueleaf氏に深く感謝します。
+
+8GB VRAMでもぎりぎり動作するようです。
+
+[学習についての共通ドキュメント](./train_README-ja.md) もあわせてご覧ください。
+
+# 学習できるLoRAの種類
+
+以下の二種類をサポートします。以下は当リポジトリ内の独自の名称です。
+
+1. __LoRA-LierLa__ : (LoRA for __Li__ n __e__ a __r__  __La__ yers、リエラと読みます)
+
+    Linear およびカーネルサイズ 1x1 の Conv2d に適用されるLoRA
+
+2. __LoRA-C3Lier__ : (LoRA for __C__ olutional layers with __3__ x3 Kernel and  __Li__ n __e__ a __r__ layers、セリアと読みます)
+
+    1.に加え、カーネルサイズ 3x3 の Conv2d に適用されるLoRA
+
+LoRA-LierLaに比べ、LoRA-C3Liarは適用される層が増える分、高い精度が期待できるかもしれません。
+
+また学習時は __DyLoRA__ を使用することもできます（後述します）。
+
+## 学習したモデルに関する注意
+
+LoRA-LierLa は、AUTOMATIC1111氏のWeb UIのLoRA機能で使用することができます。
+
+LoRA-C3Liarを使いWeb UIで生成するには、こちらの[WebUI用extension](https://github.com/kohya-ss/sd-webui-additional-networks)を使ってください。
+
+いずれも学習したLoRAのモデルを、Stable Diffusionのモデルにこのリポジトリ内のスクリプトであらかじめマージすることもできます。
+
+cloneofsimo氏のリポジトリ、およびd8ahazard氏の[Dreambooth Extension for Stable-Diffusion-WebUI](https://github.com/d8ahazard/sd_dreambooth_extension)とは、現時点では互換性がありません。いくつかの機能拡張を行っているためです（後述）。
+
+# 学習の手順
+
+あらかじめこのリポジトリのREADMEを参照し、環境整備を行ってください。
+
+## データの準備
+
+[学習データの準備について](./train_README-ja.md) を参照してください。
+
+
+## 学習の実行
+
+`train_network.py`を用います。
+
+`train_network.py`では `--network_module` オプションに、学習対象のモジュール名を指定します。LoRAに対応するのは`network.lora`となりますので、それを指定してください。
+
+なお学習率は通常のDreamBoothやfine tuningよりも高めの、`1e-4`～`1e-3`程度を指定するとよいようです。
+
+以下はコマンドラインの例です。
+
+```
+accelerate launch --num_cpu_threads_per_process 1 train_network.py 
+    --pretrained_model_name_or_path=<.ckptまたは.safetensordまたはDiffusers版モデルのディレクトリ> 
+    --dataset_config=<データ準備で作成した.tomlファイル> 
+    --output_dir=<学習したモデルの出力先フォルダ>  
+    --output_name=<学習したモデル出力時のファイル名> 
+    --save_model_as=safetensors 
+    --prior_loss_weight=1.0 
+    --max_train_steps=400 
+    --learning_rate=1e-4 
+    --optimizer_type="AdamW8bit" 
+    --xformers 
+    --mixed_precision="fp16" 
+    --cache_latents 
+    --gradient_checkpointing
+    --save_every_n_epochs=1 
+    --network_module=networks.lora
+```
+
+このコマンドラインでは LoRA-LierLa が学習されます。
+
+`--output_dir` オプションで指定したフォルダに、LoRAのモデルが保存されます。他のオプション、オプティマイザ等については [学習の共通ドキュメント](./train_README-ja.md) の「よく使われるオプション」も参照してください。
+
+その他、以下のオプションが指定できます。
+
+* `--network_dim`
+  * LoRAのRANKを指定します（``--networkdim=4``など）。省略時は4になります。数が多いほど表現力は増しますが、学習に必要なメモリ、時間は増えます。また闇雲に増やしても良くないようです。
+* `--network_alpha`
+  *  アンダーフローを防ぎ安定して学習するための ``alpha`` 値を指定します。デフォルトは1です。``network_dim``と同じ値を指定すると以前のバージョンと同じ動作になります。
+* `--persistent_data_loader_workers`
+  * Windows環境で指定するとエポック間の待ち時間が大幅に短縮されます。
+* `--max_data_loader_n_workers`
+  * データ読み込みのプロセス数を指定します。プロセス数が多いとデータ読み込みが速くなりGPUを効率的に利用できますが、メインメモリを消費します。デフォルトは「`8` または `CPU同時実行スレッド数-1` の小さいほう」なので、メインメモリに余裕がない場合や、GPU使用率が90%程度以上なら、それらの数値を見ながら `2` または `1` 程度まで下げてください。
+* `--network_weights`
+  * 学習前に学習済みのLoRAの重みを読み込み、そこから追加で学習します。
+* `--network_train_unet_only`
+  * U-Netに関連するLoRAモジュールのみ有効とします。fine tuning的な学習で指定するとよいかもしれません。
+* `--network_train_text_encoder_only`
+  * Text Encoderに関連するLoRAモジュールのみ有効とします。Textual Inversion的な効果が期待できるかもしれません。
+* `--unet_lr`
+  * U-Netに関連するLoRAモジュールに、通常の学習率（--learning_rateオプションで指定）とは異なる学習率を使う時に指定します。
+* `--text_encoder_lr`
+  * Text Encoderに関連するLoRAモジュールに、通常の学習率（--learning_rateオプションで指定）とは異なる学習率を使う時に指定します。Text Encoderのほうを若干低めの学習率（5e-5など）にしたほうが良い、という話もあるようです。
+* `--network_args`
+  * 複数の引数を指定できます。後述します。
+* `--alpha_mask`
+  * 画像のアルファ値をマスクとして使用します。透過画像を学習する際に使用します。[PR #1223](https://github.com/kohya-ss/sd-scripts/pull/1223)
+
+`--network_train_unet_only` と `--network_train_text_encoder_only` の両方とも未指定時（デフォルト）はText EncoderとU-Netの両方のLoRAモジュールを有効にします。
+
+# その他の学習方法
+
+## LoRA-C3Lier を学習する
+
+`--network_args` に以下のように指定してください。`conv_dim` で Conv2d (3x3) の rank を、`conv_alpha` で alpha を指定してください。
+
+```
+--network_args "conv_dim=4" "conv_alpha=1"
+```
+
+以下のように alpha 省略時は1になります。
+
+```
+--network_args "conv_dim=4"
+```
+
+## DyLoRA
+
+DyLoRAはこちらの論文で提案されたものです。[DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation](https://arxiv.org/abs/2210.07558)　公式実装は[こちら](https://github.com/huawei-noah/KD-NLP/tree/main/DyLoRA)です。
+
+論文によると、LoRAのrankは必ずしも高いほうが良いわけではなく、対象のモデル、データセット、タスクなどにより適切なrankを探す必要があるようです。DyLoRAを使うと、指定したdim(rank)以下のさまざまなrankで同時にLoRAを学習します。これにより最適なrankをそれぞれ学習して探す手間を省くことができます。
+
+当リポジトリの実装は公式実装をベースに独自の拡張を加えています（そのため不具合などあるかもしれません）。
+
+### 当リポジトリのDyLoRAの特徴
+
+学習後のDyLoRAのモデルファイルはLoRAと互換性があります。また、モデルファイルから指定したdim(rank)以下の複数のdimのLoRAを抽出できます。
+
+DyLoRA-LierLa、DyLoRA-C3Lierのどちらも学習できます。
+
+### DyLoRAで学習する
+
+`--network_module=networks.dylora` のように、DyLoRAに対応する`network.dylora`を指定してください。
+
+また `--network_args` に、たとえば`--network_args "unit=4"`のように`unit`を指定します。`unit`はrankを分割する単位です。たとえば`--network_dim=16 --network_args "unit=4"` のように指定します。`unit`は`network_dim`を割り切れる値（`network_dim`は`unit`の倍数）としてください。
+
+`unit`を指定しない場合は、`unit=1`として扱われます。
+
+記述例は以下です。
+
+```
+--network_module=networks.dylora --network_dim=16 --network_args "unit=4"
+
+--network_module=networks.dylora --network_dim=32 --network_alpha=16 --network_args "unit=4"
+```
+
+DyLoRA-C3Lierの場合は、`--network_args` に`"conv_dim=4"`のように`conv_dim`を指定します。通常のLoRAと異なり、`conv_dim`は`network_dim`と同じ値である必要があります。記述例は以下です。
+
+```
+--network_module=networks.dylora --network_dim=16 --network_args "conv_dim=16" "unit=4"
+
+--network_module=networks.dylora --network_dim=32 --network_alpha=16 --network_args "conv_dim=32" "conv_alpha=16" "unit=8"
+```
+
+たとえばdim=16、unit=4（後述）で学習すると、4、8、12、16の4つのrankのLoRAを学習、抽出できます。抽出した各モデルで画像を生成し、比較することで、最適なrankのLoRAを選択できます。
+
+その他のオプションは通常のLoRAと同じです。
+
+※ `unit`は当リポジトリの独自拡張で、DyLoRAでは同dim(rank)の通常LoRAに比べると学習時間が長くなることが予想されるため、分割単位を大きくしたものです。
+
+### DyLoRAのモデルからLoRAモデルを抽出する
+
+`networks`フォルダ内の `extract_lora_from_dylora.py`を使用します。指定した`unit`単位で、DyLoRAのモデルからLoRAのモデルを抽出します。
+
+コマンドラインはたとえば以下のようになります。
+
+```powershell
+python networks\extract_lora_from_dylora.py --model "foldername/dylora-model.safetensors" --save_to "foldername/dylora-model-split.safetensors" --unit 4
+```
+
+`--model` にはDyLoRAのモデルファイルを指定します。`--save_to` には抽出したモデルを保存するファイル名を指定します（rankの数値がファイル名に付加されます）。`--unit` にはDyLoRAの学習時の`unit`を指定します。
+
+## 階層別学習率
+
+詳細は[PR #355](https://github.com/kohya-ss/sd-scripts/pull/355) をご覧ください。
+
+フルモデルの25個のブロックの重みを指定できます。最初のブロックに該当するLoRAは存在しませんが、階層別LoRA適用等との互換性のために25個としています。またconv2d3x3に拡張しない場合も一部のブロックにはLoRAが存在しませんが、記述を統一するため常に25個の値を指定してください。
+
+SDXL では down/up 9 個、middle 3 個の値を指定してください。
+
+`--network_args` で以下の引数を指定してください。
+
+- `down_lr_weight` : U-Netのdown blocksの学習率の重みを指定します。以下が指定可能です。
+  - ブロックごとの重み : `"down_lr_weight=0,0,0,0,0,0,1,1,1,1,1,1"` のように12個（SDXL では 9 個）の数値を指定します。
+  - プリセットからの指定 : `"down_lr_weight=sine"` のように指定します（サインカーブで重みを指定します）。sine, cosine, linear, reverse_linear, zeros が指定可能です。また `"down_lr_weight=cosine+.25"` のように `+数値` を追加すると、指定した数値を加算します（0.25~1.25になります）。
+- `mid_lr_weight` : U-Netのmid blockの学習率の重みを指定します。`"down_lr_weight=0.5"` のように数値を一つだけ指定します（SDXL の場合は 3 個）。
+- `up_lr_weight` : U-Netのup blocksの学習率の重みを指定します。down_lr_weightと同様です。
+- 指定を省略した部分は1.0として扱われます。また重みを0にするとそのブロックのLoRAモジュールは作成されません。
+- `block_lr_zero_threshold` : 重みがこの値以下の場合、LoRAモジュールを作成しません。デフォルトは0です。
+
+### 階層別学習率コマンドライン指定例:
+
+```powershell
+--network_args "down_lr_weight=0.5,0.5,0.5,0.5,1.0,1.0,1.0,1.0,1.5,1.5,1.5,1.5" "mid_lr_weight=2.0" "up_lr_weight=1.5,1.5,1.5,1.5,1.0,1.0,1.0,1.0,0.5,0.5,0.5,0.5"
+
+--network_args "block_lr_zero_threshold=0.1" "down_lr_weight=sine+.5" "mid_lr_weight=1.5" "up_lr_weight=cosine+.5"
+```
+
+###  階層別学習率tomlファイル指定例:
+
+```toml
+network_args = [ "down_lr_weight=0.5,0.5,0.5,0.5,1.0,1.0,1.0,1.0,1.5,1.5,1.5,1.5", "mid_lr_weight=2.0", "up_lr_weight=1.5,1.5,1.5,1.5,1.0,1.0,1.0,1.0,0.5,0.5,0.5,0.5",]
+
+network_args = [ "block_lr_zero_threshold=0.1", "down_lr_weight=sine+.5", "mid_lr_weight=1.5", "up_lr_weight=cosine+.5", ]
+```
+
+## 階層別dim (rank)
+
+フルモデルの25個のブロックのdim (rank)を指定できます。階層別学習率と同様に一部のブロックにはLoRAが存在しない場合がありますが、常に25個の値を指定してください。
+
+SDXL では 23 個の値を指定してください。一部のブロックにはLoRA が存在しませんが、`sdxl_train.py` の[階層別学習率](./train_SDXL-en.md) との互換性のためです。
+対応は、`0: time/label embed, 1-9: input blocks 0-8, 10-12: mid blocks 0-2, 13-21: output blocks 0-8, 22: out` です。
+
+`--network_args` で以下の引数を指定してください。
+
+- `block_dims` : 各ブロックのdim (rank)を指定します。`"block_dims=2,2,2,2,4,4,4,4,6,6,6,6,8,6,6,6,6,4,4,4,4,2,2,2,2"` のように25個の数値を指定します。
+- `block_alphas` : 各ブロックのalphaを指定します。block_dimsと同様に25個の数値を指定します。省略時はnetwork_alphaの値が使用されます。
+- `conv_block_dims` : LoRAをConv2d 3x3に拡張し、各ブロックのdim (rank)を指定します。
+- `conv_block_alphas` : LoRAをConv2d 3x3に拡張したときの各ブロックのalphaを指定します。省略時はconv_alphaの値が使用されます。
+
+###  階層別dim (rank)コマンドライン指定例:
+
+```powershell
+--network_args "block_dims=2,4,4,4,8,8,8,8,12,12,12,12,16,12,12,12,12,8,8,8,8,4,4,4,2"
+
+--network_args "block_dims=2,4,4,4,8,8,8,8,12,12,12,12,16,12,12,12,12,8,8,8,8,4,4,4,2" "conv_block_dims=2,2,2,2,4,4,4,4,6,6,6,6,8,6,6,6,6,4,4,4,4,2,2,2,2"
+
+--network_args "block_dims=2,4,4,4,8,8,8,8,12,12,12,12,16,12,12,12,12,8,8,8,8,4,4,4,2" "block_alphas=2,2,2,2,4,4,4,4,6,6,6,6,8,6,6,6,6,4,4,4,4,2,2,2,2"
+```
+
+###  階層別dim (rank)tomlファイル指定例:
+
+```toml
+network_args = [ "block_dims=2,4,4,4,8,8,8,8,12,12,12,12,16,12,12,12,12,8,8,8,8,4,4,4,2",]
+  
+network_args = [ "block_dims=2,4,4,4,8,8,8,8,12,12,12,12,16,12,12,12,12,8,8,8,8,4,4,4,2", "block_alphas=2,2,2,2,4,4,4,4,6,6,6,6,8,6,6,6,6,4,4,4,4,2,2,2,2",]
+```
+
+# その他のスクリプト
+
+マージ等LoRAに関連するスクリプト群です。
+
+## マージスクリプトについて
+
+merge_lora.pyでStable DiffusionのモデルにLoRAの学習結果をマージしたり、複数のLoRAモデルをマージしたりできます。
+
+SDXL向けにはsdxl_merge_lora.pyを用意しています。オプション等は同一ですので、以下のmerge_lora.pyを読み替えてください。
+
+### Stable DiffusionのモデルにLoRAのモデルをマージする
+
+マージ後のモデルは通常のStable Diffusionのckptと同様に扱えます。たとえば以下のようなコマンドラインになります。
+
+```
+python networks\merge_lora.py --sd_model ..\model\model.ckpt 
+    --save_to ..\lora_train1\model-char1-merged.safetensors 
+    --models ..\lora_train1\last.safetensors --ratios 0.8
+```
+
+Stable Diffusion v2.xのモデルで学習し、それにマージする場合は、--v2オプションを指定してください。
+
+--sd_modelオプションにマージの元となるStable Diffusionのモデルファイルを指定します（.ckptまたは.safetensorsのみ対応で、Diffusersは今のところ対応していません）。
+
+--save_toオプションにマージ後のモデルの保存先を指定します（.ckptまたは.safetensors、拡張子で自動判定）。
+
+--modelsに学習したLoRAのモデルファイルを指定します。複数指定も可能で、その時は順にマージします。
+
+--ratiosにそれぞれのモデルの適用率（どのくらい重みを元モデルに反映するか）を0~1.0の数値で指定します。例えば過学習に近いような場合は、適用率を下げるとマシになるかもしれません。モデルの数と同じだけ指定してください。
+
+複数指定時は以下のようになります。
+
+```
+python networks\merge_lora.py --sd_model ..\model\model.ckpt 
+    --save_to ..\lora_train1\model-char1-merged.safetensors 
+    --models ..\lora_train1\last.safetensors ..\lora_train2\last.safetensors --ratios 0.8 0.5
+```
+
+### 複数のLoRAのモデルをマージする
+
+--concatオプションを指定すると、複数のLoRAを単純に結合して新しいLoRAモデルを作成できます。ファイルサイズ（およびdim/rank）は指定したLoRAの合計サイズになります（マージ時にdim (rank)を変更する場合は `svd_merge_lora.py` を使用してください）。
+
+たとえば以下のようなコマンドラインになります。
+
+```
+python networks\merge_lora.py --save_precision bf16 
+    --save_to ..\lora_train1\model-char1-style1-merged.safetensors 
+    --models ..\lora_train1\last.safetensors ..\lora_train2\last.safetensors 
+    --ratios 1.0 -1.0 --concat --shuffle
+```
+
+--concatオプションを指定します。
+
+また--shuffleオプションを追加し、重みをシャッフルします。シャッフルしないとマージ後のLoRAから元のLoRAを取り出せるため、コピー機学習などの場合には学習元データが明らかになります。ご注意ください。
+
+--save_toオプションにマージ後のLoRAモデルの保存先を指定します（.ckptまたは.safetensors、拡張子で自動判定）。
+
+--modelsに学習したLoRAのモデルファイルを指定します。三つ以上も指定可能です。
+
+--ratiosにそれぞれのモデルの比率（どのくらい重みを元モデルに反映するか）を0~1.0の数値で指定します。二つのモデルを一対一でマージする場合は、「0.5 0.5」になります。「1.0 1.0」では合計の重みが大きくなりすぎて、恐らく結果はあまり望ましくないものになると思われます。
+
+v1で学習したLoRAとv2で学習したLoRA、rank（次元数）の異なるLoRAはマージできません。U-NetだけのLoRAとU-Net+Text EncoderのLoRAはマージできるはずですが、結果は未知数です。
+
+### その他のオプション
+
+* precision
+  * マージ計算時の精度をfloat、fp16、bf16から指定できます。省略時は精度を確保するためfloatになります。メモリ使用量を減らしたい場合はfp16/bf16を指定してください。
+* save_precision
+  * モデル保存時の精度をfloat、fp16、bf16から指定できます。省略時はprecisionと同じ精度になります。
+
+他にもいくつかのオプションがありますので、--helpで確認してください。
+
+## 複数のrankが異なるLoRAのモデルをマージする
+
+複数のLoRAをひとつのLoRAで近似します（完全な再現はできません）。`svd_merge_lora.py`を用います。たとえば以下のようなコマンドラインになります。
+
+```
+python networks\svd_merge_lora.py 
+    --save_to ..\lora_train1\model-char1-style1-merged.safetensors 
+    --models ..\lora_train1\last.safetensors ..\lora_train2\last.safetensors 
+    --ratios 0.6 0.4 --new_rank 32 --device cuda
+```
+
+`merge_lora.py` と主なオプションは同一です。以下のオプションが追加されています。
+
+- `--new_rank`
+  - 作成するLoRAのrankを指定します。
+- `--new_conv_rank`
+  - 作成する Conv2d 3x3 LoRA の rank を指定します。省略時は `new_rank` と同じになります。
+- `--device`
+  - `--device cuda`としてcudaを指定すると計算をGPU上で行います。処理が速くなります。
+
+## 当リポジトリ内の画像生成スクリプトで生成する
+
+gen_img_diffusers.pyに、--network_module、--network_weightsの各オプションを追加してください。意味は学習時と同様です。
+
+--network_mulオプションで0~1.0の数値を指定すると、LoRAの適用率を変えられます。
+
+## Diffusersのpipelineで生成する
+
+以下の例を参考にしてください。必要なファイルはnetworks/lora.pyのみです。Diffusersのバージョンは0.10.2以外では動作しない可能性があります。
+
+```python
+import torch
+from diffusers import StableDiffusionPipeline
+from networks.lora import LoRAModule, create_network_from_weights
+from safetensors.torch import load_file
+
+# if the ckpt is CompVis based, convert it to Diffusers beforehand with tools/convert_diffusers20_original_sd.py. See --help for more details.
+
+model_id_or_dir = r"model_id_on_hugging_face_or_dir"
+device = "cuda"
+
+# create pipe
+print(f"creating pipe from {model_id_or_dir}...")
+pipe = StableDiffusionPipeline.from_pretrained(model_id_or_dir, revision="fp16", torch_dtype=torch.float16)
+pipe = pipe.to(device)
+vae = pipe.vae
+text_encoder = pipe.text_encoder
+unet = pipe.unet
+
+# load lora networks
+print(f"loading lora networks...")
+
+lora_path1 = r"lora1.safetensors"
+sd = load_file(lora_path1)   # If the file is .ckpt, use torch.load instead.
+network1, sd = create_network_from_weights(0.5, None, vae, text_encoder,unet, sd)
+network1.apply_to(text_encoder, unet)
+network1.load_state_dict(sd)
+network1.to(device, dtype=torch.float16)
+
+# # You can merge weights instead of apply_to+load_state_dict. network.set_multiplier does not work
+# network.merge_to(text_encoder, unet, sd)
+
+lora_path2 = r"lora2.safetensors"
+sd = load_file(lora_path2) 
+network2, sd = create_network_from_weights(0.7, None, vae, text_encoder,unet, sd)
+network2.apply_to(text_encoder, unet)
+network2.load_state_dict(sd)
+network2.to(device, dtype=torch.float16)
+
+lora_path3 = r"lora3.safetensors"
+sd = load_file(lora_path3)
+network3, sd = create_network_from_weights(0.5, None, vae, text_encoder,unet, sd)
+network3.apply_to(text_encoder, unet)
+network3.load_state_dict(sd)
+network3.to(device, dtype=torch.float16)
+
+# prompts
+prompt = "masterpiece, best quality, 1girl, in white shirt, looking at viewer"
+negative_prompt = "bad quality, worst quality, bad anatomy, bad hands"
+
+# exec pipe
+print("generating image...")
+with torch.autocast("cuda"):
+    image = pipe(prompt, guidance_scale=7.5, negative_prompt=negative_prompt).images[0]
+
+# if not merged, you can use set_multiplier
+# network1.set_multiplier(0.8)
+# and generate image again...
+
+# save image
+image.save(r"by_diffusers..png")
+```
+
+## 二つのモデルの差分からLoRAモデルを作成する
+
+[こちらのディスカッション](https://github.com/cloneofsimo/lora/discussions/56)を参考に実装したものです。数式はそのまま使わせていただきました（よく理解していませんが近似には特異値分解を用いるようです）。
+
+二つのモデル（たとえばfine tuningの元モデルとfine tuning後のモデル）の差分を、LoRAで近似します。
+
+### スクリプトの実行方法
+
+以下のように指定してください。
+```
+python networks\extract_lora_from_models.py --model_org base-model.ckpt
+    --model_tuned fine-tuned-model.ckpt 
+    --save_to lora-weights.safetensors --dim 4
+```
+
+--model_orgオプションに元のStable Diffusionモデルを指定します。作成したLoRAモデルを適用する場合は、このモデルを指定して適用することになります。.ckptまたは.safetensorsが指定できます。
+
+--model_tunedオプションに差分を抽出する対象のStable Diffusionモデルを指定します。たとえばfine tuningやDreamBooth後のモデルを指定します。.ckptまたは.safetensorsが指定できます。
+
+--save_toにLoRAモデルの保存先を指定します。--dimにLoRAの次元数を指定します。
+
+生成されたLoRAモデルは、学習したLoRAモデルと同様に使用できます。
+
+Text Encoderが二つのモデルで同じ場合にはLoRAはU-NetのみのLoRAとなります。
+
+### その他のオプション
+
+- `--v2`
+  - v2.xのStable Diffusionモデルを使う場合に指定してください。
+- `--device`
+  - ``--device cuda``としてcudaを指定すると計算をGPU上で行います。処理が速くなります（CPUでもそこまで遅くないため、せいぜい倍～数倍程度のようです）。
+- `--save_precision`
+  - LoRAの保存形式を"float", "fp16", "bf16"から指定します。省略時はfloatになります。
+- `--conv_dim`
+  - 指定するとLoRAの適用範囲を Conv2d 3x3 へ拡大します。Conv2d 3x3 の rank を指定します。
+
+## 画像リサイズスクリプト
+
+（のちほどドキュメントを整理しますがとりあえずここに説明を書いておきます。）
+
+Aspect Ratio Bucketingの機能拡張で、小さな画像については拡大しないでそのまま教師データとすることが可能になりました。元の教師画像を縮小した画像を、教師データに加えると精度が向上したという報告とともに前処理用のスクリプトをいただきましたので整備して追加しました。bmaltais氏に感謝します。
+
+### スクリプトの実行方法
+
+以下のように指定してください。元の画像そのまま、およびリサイズ後の画像が変換先フォルダに保存されます。リサイズ後の画像には、ファイル名に ``+512x512`` のようにリサイズ先の解像度が付け加えられます（画像サイズとは異なります）。リサイズ先の解像度より小さい画像は拡大されることはありません。
+
+```
+python tools\resize_images_to_resolution.py --max_resolution 512x512,384x384,256x256 --save_as_png 
+    --copy_associated_files 元画像フォルダ 変換先フォルダ
+```
+
+元画像フォルダ内の画像ファイルが、指定した解像度（複数指定可）と同じ面積になるようにリサイズされ、変換先フォルダに保存されます。画像以外のファイルはそのままコピーされます。
+
+``--max_resolution`` オプションにリサイズ先のサイズを例のように指定してください。面積がそのサイズになるようにリサイズします。複数指定すると、それぞれの解像度でリサイズされます。``512x512,384x384,256x256``なら、変換先フォルダの画像は、元サイズとリサイズ後サイズ×3の計4枚になります。
+
+``--save_as_png`` オプションを指定するとpng形式で保存します。省略するとjpeg形式（quality=100）で保存されます。
+
+``--copy_associated_files`` オプションを指定すると、拡張子を除き画像と同じファイル名（たとえばキャプションなど）のファイルが、リサイズ後の画像のファイル名と同じ名前でコピーされます。
+
+
+### その他のオプション
+
+- divisible_by
+  - リサイズ後の画像のサイズ（縦、横のそれぞれ）がこの値で割り切れるように、画像中心を切り出します。
+- interpolation
+  - 縮小時の補完方法を指定します。``area, cubic, lanczos4``から選択可能で、デフォルトは``area``です。
+
+
+# 追加情報
+
+## cloneofsimo氏のリポジトリとの違い
+
+2022/12/25時点では、当リポジトリはLoRAの適用個所をText EncoderのMLP、U-NetのFFN、Transformerのin/out projectionに拡大し、表現力が増しています。ただその代わりメモリ使用量は増え、8GBぎりぎりになりました。
+
+またモジュール入れ替え機構は全く異なります。
+
+## 将来拡張について
+
+LoRAだけでなく他の拡張にも対応可能ですので、それらも追加予定です。
--- a/docs/train_network_README-zh.md
+++ b/docs/train_network_README-zh.md
@@ -0,0 +1,468 @@
+# 关于LoRA的学习。
+
+[LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)（arxiv）、[LoRA](https://github.com/microsoft/LoRA)（github）这是应用于Stable Diffusion“稳定扩散”的内容。
+
+[cloneofsimo先生的代码仓库](https://github.com/cloneofsimo/lora) 我们非常感謝您提供的参考。非常感謝。
+
+通常情況下，LoRA只适用于Linear和Kernel大小为1x1的Conv2d，但也可以將其擴展到Kernel大小为3x3的Conv2d。
+
+Conv2d 3x3的扩展最初是由 [cloneofsimo先生的代码仓库](https://github.com/cloneofsimo/lora) 
+而KohakuBlueleaf先生在[LoCon](https://github.com/KohakuBlueleaf/LoCon)中揭示了其有效性。我们深深地感谢KohakuBlueleaf先生。
+
+看起来即使在8GB VRAM上也可以勉强运行。
+
+请同时查看关于[学习的通用文档](./train_README-zh.md)。
+# 可学习的LoRA 类型
+
+支持以下两种类型。以下是本仓库中自定义的名称。
+
+1. __LoRA-LierLa__：(用于 __Li__ n __e__ a __r__  __La__ yers 的 LoRA，读作 "Liela")
+
+    适用于 Linear 和卷积层 Conv2d 的 1x1 Kernel 的 LoRA
+
+2. __LoRA-C3Lier__：(用于具有 3x3 Kernel 的卷积层和 __Li__ n __e__ a __r__ 层的 LoRA，读作 "Seria")
+
+    除了第一种类型外，还适用于 3x3 Kernel 的 Conv2d 的 LoRA
+
+与 LoRA-LierLa 相比，LoRA-C3Lier 可能会获得更高的准确性，因为它适用于更多的层。
+
+在训练时，也可以使用 __DyLoRA__（将在后面介绍）。
+
+## 请注意与所学模型相关的事项。
+
+LoRA-LierLa可以用于AUTOMATIC1111先生的Web UI LoRA功能。
+
+要使用LoRA-C3Liar并在Web UI中生成，请使用此处的[WebUI用extension](https://github.com/kohya-ss/sd-webui-additional-networks)。
+
+在此存储库的脚本中，您还可以预先将经过训练的LoRA模型合并到Stable Diffusion模型中。
+
+请注意，与cloneofsimo先生的存储库以及d8ahazard先生的[Stable-Diffusion-WebUI的Dreambooth扩展](https://github.com/d8ahazard/sd_dreambooth_extension)不兼容，因为它们进行了一些功能扩展（如下文所述）。
+
+# 学习步骤
+
+请先参考此存储库的README文件并进行环境设置。
+
+## 准备数据
+
+请参考 [关于准备学习数据](./train_README-zh.md)。
+
+## 网络训练
+
+使用`train_network.py`。
+
+在`train_network.py`中，使用`--network_module`选项指定要训练的模块名称。对于LoRA模块，它应该是`network.lora`，请指定它。
+
+请注意，学习率应该比通常的DreamBooth或fine tuning要高，建议指定为`1e-4`至`1e-3`左右。
+
+以下是命令行示例。
+
+```
+accelerate launch --num_cpu_threads_per_process 1 train_network.py 
+    --pretrained_model_name_or_path=<.ckpt或.safetensord或Diffusers版模型目录> 
+    --dataset_config=<数据集配置的.toml文件> 
+    --output_dir=<训练过程中的模型输出文件夹>  
+    --output_name=<训练模型输出时的文件名> 
+    --save_model_as=safetensors 
+    --prior_loss_weight=1.0 
+    --max_train_steps=400 
+    --learning_rate=1e-4 
+    --optimizer_type="AdamW8bit" 
+    --xformers 
+    --mixed_precision="fp16" 
+    --cache_latents 
+    --gradient_checkpointing
+    --save_every_n_epochs=1 
+    --network_module=networks.lora
+```
+
+在这个命令行中，LoRA-LierLa将会被训练。
+
+LoRA的模型将会被保存在通过`--output_dir`选项指定的文件夹中。关于其他选项和优化器等，请参阅[学习的通用文档](./train_README-zh.md)中的“常用选项”。
+
+此外，还可以指定以下选项：
+
+* `--network_dim`
+  * 指定LoRA的RANK（例如：`--network_dim=4`）。默认值为4。数值越大表示表现力越强，但需要更多的内存和时间来训练。而且不要盲目增加此数值。
+* `--network_alpha`
+  * 指定用于防止下溢并稳定训练的alpha值。默认值为1。如果与`network_dim`指定相同的值，则将获得与以前版本相同的行为。
+* `--persistent_data_loader_workers`
+  * 在Windows环境中指定可大幅缩短epoch之间的等待时间。
+* `--max_data_loader_n_workers`
+  * 指定数据读取进程的数量。进程数越多，数据读取速度越快，可以更有效地利用GPU，但会占用主存。默认值为“`8`或`CPU同步执行线程数-1`的最小值”，因此如果主存不足或GPU使用率超过90％，则应将这些数字降低到约`2`或`1`。
+* `--network_weights`
+  * 在训练之前读取预训练的LoRA权重，并在此基础上进行进一步的训练。
+* `--network_train_unet_only`
+  * 仅启用与U-Net相关的LoRA模块。在类似fine tuning的学习中指定此选项可能会很有用。
+* `--network_train_text_encoder_only`
+  * 仅启用与Text Encoder相关的LoRA模块。可能会期望Textual Inversion效果。
+* `--unet_lr`
+  * 当在U-Net相关的LoRA模块中使用与常规学习率（由`--learning_rate`选项指定）不同的学习率时，应指定此选项。
+* `--text_encoder_lr`
+  * 当在Text Encoder相关的LoRA模块中使用与常规学习率（由`--learning_rate`选项指定）不同的学习率时，应指定此选项。可能最好将Text Encoder的学习率稍微降低（例如5e-5）。
+* `--network_args`
+  * 可以指定多个参数。将在下面详细说明。
+* `--alpha_mask`
+  * 使用图像的 Alpha 值作为遮罩。这在学习透明图像时使用。[PR #1223](https://github.com/kohya-ss/sd-scripts/pull/1223)
+
+当未指定`--network_train_unet_only`和`--network_train_text_encoder_only`时（默认情况），将启用Text Encoder和U-Net的两个LoRA模块。
+
+# 其他的学习方法
+
+## 学习 LoRA-C3Lier
+
+请使用以下方式
+
+```
+--network_args "conv_dim=4"
+```
+
+DyLoRA是在这篇论文中提出的[DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation](https://arxiv.org/abs/2210.07558)，
+[其官方实现可在这里找到](https://github.com/huawei-noah/KD-NLP/tree/main/DyLoRA)。
+
+根据论文，LoRA的rank并不是越高越好，而是需要根据模型、数据集、任务等因素来寻找合适的rank。使用DyLoRA，可以同时在指定的维度(rank)下学习多种rank的LoRA，从而省去了寻找最佳rank的麻烦。
+
+本存储库的实现基于官方实现进行了自定义扩展（因此可能存在缺陷）。
+
+### 本存储库DyLoRA的特点
+
+DyLoRA训练后的模型文件与LoRA兼容。此外，可以从模型文件中提取多个低于指定维度(rank)的LoRA。
+
+DyLoRA-LierLa和DyLoRA-C3Lier均可训练。
+
+### 使用DyLoRA进行训练
+
+请指定与DyLoRA相对应的`network.dylora`，例如 `--network_module=networks.dylora`。
+
+此外，通过 `--network_args` 指定例如`--network_args "unit=4"`的参数。`unit`是划分rank的单位。例如，可以指定为`--network_dim=16 --network_args "unit=4"`。请将`unit`视为可以被`network_dim`整除的值（`network_dim`是`unit`的倍数）。
+
+如果未指定`unit`，则默认为`unit=1`。
+
+以下是示例说明。
+
+```
+--network_module=networks.dylora --network_dim=16 --network_args "unit=4"
+
+--network_module=networks.dylora --network_dim=32 --network_alpha=16 --network_args "unit=4"
+```
+
+对于DyLoRA-C3Lier，需要在 `--network_args` 中指定 `conv_dim`，例如 `conv_dim=4`。与普通的LoRA不同，`conv_dim`必须与`network_dim`具有相同的值。以下是一个示例描述：
+
+```
+--network_module=networks.dylora --network_dim=16 --network_args "conv_dim=16" "unit=4"
+
+--network_module=networks.dylora --network_dim=32 --network_alpha=16 --network_args "conv_dim=32" "conv_alpha=16" "unit=8"
+```
+
+例如，当使用dim=16、unit=4（如下所述）进行学习时，可以学习和提取4个rank的LoRA，即4、8、12和16。通过在每个提取的模型中生成图像并进行比较，可以选择最佳rank的LoRA。
+
+其他选项与普通的LoRA相同。
+
+*`unit`是本存储库的独有扩展，在DyLoRA中，由于预计相比同维度（rank）的普通LoRA，学习时间更长，因此将分割单位增加。
+
+### 从DyLoRA模型中提取LoRA模型
+
+请使用`networks`文件夹中的`extract_lora_from_dylora.py`。指定`unit`单位后，从DyLoRA模型中提取LoRA模型。
+
+例如，命令行如下：
+
+```powershell
+python networks\extract_lora_from_dylora.py --model "foldername/dylora-model.safetensors" --save_to "foldername/dylora-model-split.safetensors" --unit 4
+```
+
+`--model` 参数用于指定DyLoRA模型文件。`--save_to` 参数用于指定要保存提取的模型的文件名（rank值将附加到文件名中）。`--unit` 参数用于指定DyLoRA训练时的`unit`。 
+
+## 分层学习率
+
+请参阅PR＃355了解详细信息。
+
+您可以指定完整模型的25个块的权重。虽然第一个块没有对应的LoRA，但为了与分层LoRA应用等的兼容性，将其设为25个。此外，如果不扩展到conv2d3x3，则某些块中可能不存在LoRA，但为了统一描述，请始终指定25个值。
+
+请在 `--network_args` 中指定以下参数。
+
+- `down_lr_weight`：指定U-Net down blocks的学习率权重。可以指定以下内容：
+  - 每个块的权重：指定12个数字，例如`"down_lr_weight=0,0,0,0,0,0,1,1,1,1,1,1"`
+  - 从预设中指定：例如`"down_lr_weight=sine"`（使用正弦曲线指定权重）。可以指定sine、cosine、linear、reverse_linear、zeros。另外，添加 `+数字` 时，可以将指定的数字加上（变为0.25〜1.25）。
+- `mid_lr_weight`：指定U-Net mid block的学习率权重。只需指定一个数字，例如 `"mid_lr_weight=0.5"`。
+- `up_lr_weight`：指定U-Net up blocks的学习率权重。与down_lr_weight相同。
+- 省略指定的部分将被视为1.0。另外，如果将权重设为0，则不会创建该块的LoRA模块。
+- `block_lr_zero_threshold`：如果权重小于此值，则不会创建LoRA模块。默认值为0。
+
+### 分层学习率命令行指定示例：
+
+
+```powershell
+--network_args "down_lr_weight=0.5,0.5,0.5,0.5,1.0,1.0,1.0,1.0,1.5,1.5,1.5,1.5" "mid_lr_weight=2.0" "up_lr_weight=1.5,1.5,1.5,1.5,1.0,1.0,1.0,1.0,0.5,0.5,0.5,0.5"
+
+--network_args "block_lr_zero_threshold=0.1" "down_lr_weight=sine+.5" "mid_lr_weight=1.5" "up_lr_weight=cosine+.5"
+```
+
+###  Hierarchical Learning Rate指定的toml文件示例：
+
+```toml
+network_args = [ "down_lr_weight=0.5,0.5,0.5,0.5,1.0,1.0,1.0,1.0,1.5,1.5,1.5,1.5", "mid_lr_weight=2.0", "up_lr_weight=1.5,1.5,1.5,1.5,1.0,1.0,1.0,1.0,0.5,0.5,0.5,0.5",]
+
+network_args = [ "block_lr_zero_threshold=0.1", "down_lr_weight=sine+.5", "mid_lr_weight=1.5", "up_lr_weight=cosine+.5", ]
+```
+
+## 层次结构维度（rank）
+
+您可以指定完整模型的25个块的维度（rank）。与分层学习率一样，某些块可能不存在LoRA，但请始终指定25个值。
+
+请在 `--network_args` 中指定以下参数：
+
+- `block_dims`：指定每个块的维度（rank）。指定25个数字，例如 `"block_dims=2,2,2,2,4,4,4,4,6,6,6,6,8,6,6,6,6,4,4,4,4,2,2,2,2"`。
+- `block_alphas`：指定每个块的alpha。与block_dims一样，指定25个数字。如果省略，将使用network_alpha的值。
+- `conv_block_dims`：将LoRA扩展到Conv2d 3x3，并指定每个块的维度（rank）。
+- `conv_block_alphas`：在将LoRA扩展到Conv2d 3x3时指定每个块的alpha。如果省略，将使用conv_alpha的值。
+
+### 层次结构维度（rank）命令行指定示例：
+
+
+```powershell
+--network_args "block_dims=2,4,4,4,8,8,8,8,12,12,12,12,16,12,12,12,12,8,8,8,8,4,4,4,2"
+
+--network_args "block_dims=2,4,4,4,8,8,8,8,12,12,12,12,16,12,12,12,12,8,8,8,8,4,4,4,2" "conv_block_dims=2,2,2,2,4,4,4,4,6,6,6,6,8,6,6,6,6,4,4,4,4,2,2,2,2"
+
+--network_args "block_dims=2,4,4,4,8,8,8,8,12,12,12,12,16,12,12,12,12,8,8,8,8,4,4,4,2" "block_alphas=2,2,2,2,4,4,4,4,6,6,6,6,8,6,6,6,6,4,4,4,4,2,2,2,2"
+```
+
+### 层级别dim(rank) toml文件指定示例：
+
+```toml
+network_args = [ "block_dims=2,4,4,4,8,8,8,8,12,12,12,12,16,12,12,12,12,8,8,8,8,4,4,4,2",]
+  
+network_args = [ "block_dims=2,4,4,4,8,8,8,8,12,12,12,12,16,12,12,12,12,8,8,8,8,4,4,4,2", "block_alphas=2,2,2,2,4,4,4,4,6,6,6,6,8,6,6,6,6,4,4,4,4,2,2,2,2",]
+```
+
+# Other scripts
+这些是与LoRA相关的脚本，如合并脚本等。
+
+关于合并脚本
+您可以使用merge_lora.py脚本将LoRA的训练结果合并到稳定扩散模型中，也可以将多个LoRA模型合并。
+
+合并到稳定扩散模型中的LoRA模型
+合并后的模型可以像常规的稳定扩散ckpt一样使用。例如，以下是一个命令行示例：
+
+```
+python networks\merge_lora.py --sd_model ..\model\model.ckpt 
+    --save_to ..\lora_train1\model-char1-merged.safetensors 
+    --models ..\lora_train1\last.safetensors --ratios 0.8
+```
+
+请使用 Stable Diffusion v2.x 模型进行训练并进行合并时，需要指定--v2选项。
+
+使用--sd_model选项指定要合并的 Stable Diffusion 模型文件（仅支持 .ckpt 或 .safetensors 格式，目前不支持 Diffusers）。
+
+使用--save_to选项指定合并后模型的保存路径（根据扩展名自动判断为 .ckpt 或 .safetensors）。
+
+使用--models选项指定已训练的 LoRA 模型文件，也可以指定多个，然后按顺序进行合并。
+
+使用--ratios选项以0~1.0的数字指定每个模型的应用率（将多大比例的权重反映到原始模型中）。例如，在接近过度拟合的情况下，降低应用率可能会使结果更好。请指定与模型数量相同的比率。 
+
+当指定多个模型时，格式如下：
+
+
+```
+python networks\merge_lora.py --sd_model ..\model\model.ckpt 
+    --save_to ..\lora_train1\model-char1-merged.safetensors 
+    --models ..\lora_train1\last.safetensors ..\lora_train2\last.safetensors --ratios 0.8 0.5
+```
+
+### 将多个LoRA模型合并
+
+将多个LoRA模型逐个应用于SD模型与将多个LoRA模型合并后再应用于SD模型之间，由于计算顺序的不同，会得到微妙不同的结果。
+
+例如，下面是一个命令行示例：
+
+```
+python networks\merge_lora.py 
+    --save_to ..\lora_train1\model-char1-style1-merged.safetensors 
+    --models ..\lora_train1\last.safetensors ..\lora_train2\last.safetensors --ratios 0.6 0.4
+```
+
+--sd_model选项不需要指定。
+
+通过--save_to选项指定合并后的LoRA模型的保存位置（.ckpt或.safetensors，根据扩展名自动识别）。
+
+通过--models选项指定学习的LoRA模型文件。可以指定三个或更多。
+
+通过--ratios选项以0~1.0的数字指定每个模型的比率（反映多少权重来自原始模型）。如果将两个模型一对一合并，则比率将是“0.5 0.5”。如果比率为“1.0 1.0”，则总重量将过大，可能会产生不理想的结果。
+
+在v1和v2中学习的LoRA，以及rank（维数）或“alpha”不同的LoRA不能合并。仅包含U-Net的LoRA和包含U-Net+文本编码器的LoRA可以合并，但结果未知。
+
+### 其他选项
+
+* 精度
+  * 可以从float、fp16或bf16中选择合并计算时的精度。默认为float以保证精度。如果想减少内存使用量，请指定fp16/bf16。
+* save_precision
+  * 可以从float、fp16或bf16中选择在保存模型时的精度。默认与精度相同。
+
+## 合并多个维度不同的LoRA模型
+
+将多个LoRA近似为一个LoRA（无法完全复制）。使用'svd_merge_lora.py'。例如，以下是命令行的示例。
+```
+python networks\svd_merge_lora.py 
+    --save_to ..\lora_train1\model-char1-style1-merged.safetensors 
+    --models ..\lora_train1\last.safetensors ..\lora_train2\last.safetensors 
+    --ratios 0.6 0.4 --new_rank 32 --device cuda
+```
+`merge_lora.py`和主要选项相同。以下选项已添加：
+
+- `--new_rank`
+  - 指定要创建的LoRA rank。
+- `--new_conv_rank`
+  - 指定要创建的Conv2d 3x3 LoRA的rank。如果省略，则与`new_rank`相同。
+- `--device`
+  - 如果指定为`--device cuda`，则在GPU上执行计算。处理速度将更快。
+
+## 在此存储库中生成图像的脚本中
+
+请在`gen_img_diffusers.py`中添加`--network_module`和`--network_weights`选项。其含义与训练时相同。
+
+通过`--network_mul`选项，可以指定0~1.0的数字来改变LoRA的应用率。
+
+## 请参考以下示例，在Diffusers的pipeline中生成。
+
+所需文件仅为networks/lora.py。请注意，该示例只能在Diffusers版本0.10.2中正常运行。
+
+```python
+import torch
+from diffusers import StableDiffusionPipeline
+from networks.lora import LoRAModule, create_network_from_weights
+from safetensors.torch import load_file
+
+# if the ckpt is CompVis based, convert it to Diffusers beforehand with tools/convert_diffusers20_original_sd.py. See --help for more details.
+
+model_id_or_dir = r"model_id_on_hugging_face_or_dir"
+device = "cuda"
+
+# create pipe
+print(f"creating pipe from {model_id_or_dir}...")
+pipe = StableDiffusionPipeline.from_pretrained(model_id_or_dir, revision="fp16", torch_dtype=torch.float16)
+pipe = pipe.to(device)
+vae = pipe.vae
+text_encoder = pipe.text_encoder
+unet = pipe.unet
+
+# load lora networks
+print(f"loading lora networks...")
+
+lora_path1 = r"lora1.safetensors"
+sd = load_file(lora_path1)   # If the file is .ckpt, use torch.load instead.
+network1, sd = create_network_from_weights(0.5, None, vae, text_encoder,unet, sd)
+network1.apply_to(text_encoder, unet)
+network1.load_state_dict(sd)
+network1.to(device, dtype=torch.float16)
+
+# # You can merge weights instead of apply_to+load_state_dict. network.set_multiplier does not work
+# network.merge_to(text_encoder, unet, sd)
+
+lora_path2 = r"lora2.safetensors"
+sd = load_file(lora_path2) 
+network2, sd = create_network_from_weights(0.7, None, vae, text_encoder,unet, sd)
+network2.apply_to(text_encoder, unet)
+network2.load_state_dict(sd)
+network2.to(device, dtype=torch.float16)
+
+lora_path3 = r"lora3.safetensors"
+sd = load_file(lora_path3)
+network3, sd = create_network_from_weights(0.5, None, vae, text_encoder,unet, sd)
+network3.apply_to(text_encoder, unet)
+network3.load_state_dict(sd)
+network3.to(device, dtype=torch.float16)
+
+# prompts
+prompt = "masterpiece, best quality, 1girl, in white shirt, looking at viewer"
+negative_prompt = "bad quality, worst quality, bad anatomy, bad hands"
+
+# exec pipe
+print("generating image...")
+with torch.autocast("cuda"):
+    image = pipe(prompt, guidance_scale=7.5, negative_prompt=negative_prompt).images[0]
+
+# if not merged, you can use set_multiplier
+# network1.set_multiplier(0.8)
+# and generate image again...
+
+# save image
+image.save(r"by_diffusers..png")
+```
+
+## 从两个模型的差异中创建LoRA模型。
+
+[参考讨论链接](https://github.com/cloneofsimo/lora/discussions/56)這是參考實現的結果。數學公式沒有改變（我並不完全理解，但似乎使用奇異值分解進行了近似）。
+
+将两个模型（例如微调原始模型和微调后的模型）的差异近似为LoRA。
+
+### 脚本执行方法
+
+请按以下方式指定。
+
+```
+python networks\extract_lora_from_models.py --model_org base-model.ckpt
+    --model_tuned fine-tuned-model.ckpt 
+    --save_to lora-weights.safetensors --dim 4
+```
+
+--model_org 选项指定原始的Stable Diffusion模型。如果要应用创建的LoRA模型，则需要指定该模型并将其应用。可以指定.ckpt或.safetensors文件。
+
+--model_tuned 选项指定要提取差分的目标Stable Diffusion模型。例如，可以指定经过Fine Tuning或DreamBooth后的模型。可以指定.ckpt或.safetensors文件。
+
+--save_to 指定LoRA模型的保存路径。--dim指定LoRA的维数。
+
+生成的LoRA模型可以像已训练的LoRA模型一样使用。
+
+当两个模型的文本编码器相同时，LoRA将成为仅包含U-Net的LoRA。
+
+### 其他选项
+
+- `--v2`
+  - 如果使用v2.x的稳定扩散模型，请指定此选项。
+- `--device`
+  - 指定为 ``--device cuda`` 可在GPU上执行计算。这会使处理速度更快（即使在CPU上也不会太慢，大约快几倍）。
+- `--save_precision`
+  - 指定LoRA的保存格式为“float”、“fp16”、“bf16”。如果省略，将使用float。
+- `--conv_dim`
+  - 指定后，将扩展LoRA的应用范围到Conv2d 3x3。指定Conv2d 3x3的rank。
+  - 
+## 图像大小调整脚本
+
+（稍后将整理文件，但现在先在这里写下说明。）
+
+在 Aspect Ratio Bucketing 的功能扩展中，现在可以将小图像直接用作教师数据，而无需进行放大。我收到了一个用于前处理的脚本，其中包括将原始教师图像缩小的图像添加到教师数据中可以提高准确性的报告。我整理了这个脚本并加入了感谢 bmaltais 先生。
+
+### 执行脚本的方法如下。
+原始图像以及调整大小后的图像将保存到转换目标文件夹中。调整大小后的图像将在文件名中添加“+512x512”之类的调整后的分辨率（与图像大小不同）。小于调整大小后分辨率的图像将不会被放大。
+
+```
+python tools\resize_images_to_resolution.py --max_resolution 512x512,384x384,256x256 --save_as_png 
+    --copy_associated_files 源图像文件夹目标文件夹
+```
+
+在元画像文件夹中的图像文件将被调整大小以达到指定的分辨率（可以指定多个），并保存到目标文件夹中。除图像外的文件将被保留为原样。
+
+请使用“--max_resolution”选项指定调整大小后的大小，使其达到指定的面积大小。如果指定多个，则会在每个分辨率上进行调整大小。例如，“512x512，384x384，256x256”将使目标文件夹中的图像变为原始大小和调整大小后的大小×3共计4张图像。
+
+如果使用“--save_as_png”选项，则会以PNG格式保存。如果省略，则默认以JPEG格式（quality=100）保存。
+
+如果使用“--copy_associated_files”选项，则会将与图像相同的文件名（例如标题等）的文件复制到调整大小后的图像文件的文件名相同的位置，但不包括扩展名。
+
+### 其他选项
+
+- divisible_by
+  - 将图像中心裁剪到能够被该值整除的大小（分别是垂直和水平的大小），以便调整大小后的图像大小可以被该值整除。
+- interpolation
+  - 指定缩小时的插值方法。可从``area、cubic、lanczos4``中选择，默认为``area``。
+
+
+# 追加信息
+
+## 与cloneofsimo的代码库的区别
+
+截至2022年12月25日，本代码库将LoRA应用扩展到了Text Encoder的MLP、U-Net的FFN以及Transformer的输入/输出投影中，从而增强了表现力。但是，内存使用量增加了，接近了8GB的限制。
+
+此外，模块交换机制也完全不同。
+
+## 关于未来的扩展
+
+除了LoRA之外，我们还计划添加其他扩展，以支持更多的功能。
--- a/docs/train_ti_README-ja.md
+++ b/docs/train_ti_README-ja.md
@@ -0,0 +1,105 @@
+[Textual Inversion](https://textual-inversion.github.io/) の学習についての説明です。
+
+[学習についての共通ドキュメント](./train_README-ja.md) もあわせてご覧ください。
+
+実装に当たっては https://github.com/huggingface/diffusers/tree/main/examples/textual_inversion を大いに参考にしました。
+
+学習したモデルはWeb UIでもそのまま使えます。
+
+# 学習の手順
+
+あらかじめこのリポジトリのREADMEを参照し、環境整備を行ってください。
+
+## データの準備
+
+[学習データの準備について](./train_README-ja.md) を参照してください。
+
+## 学習の実行
+
+``train_textual_inversion.py`` を用います。以下はコマンドラインの例です（DreamBooth手法）。
+
+```
+accelerate launch --num_cpu_threads_per_process 1 train_textual_inversion.py 
+    --dataset_config=<データ準備で作成した.tomlファイル> 
+    --output_dir=<学習したモデルの出力先フォルダ>  
+    --output_name=<学習したモデル出力時のファイル名> 
+    --save_model_as=safetensors 
+    --prior_loss_weight=1.0 
+    --max_train_steps=1600 
+    --learning_rate=1e-6 
+    --optimizer_type="AdamW8bit" 
+    --xformers 
+    --mixed_precision="fp16" 
+    --cache_latents 
+    --gradient_checkpointing
+    --token_string=mychar4 --init_word=cute --num_vectors_per_token=4
+```
+
+``--token_string`` に学習時のトークン文字列を指定します。__学習時のプロンプトは、この文字列を含むようにしてください（token_stringがmychar4なら、``mychar4 1girl`` など）__。プロンプトのこの文字列の部分が、Textual Inversionの新しいtokenに置換されて学習されます。DreamBooth, class+identifier形式のデータセットとして、`token_string` をトークン文字列にするのが最も簡単で確実です。
+
+プロンプトにトークン文字列が含まれているかどうかは、``--debug_dataset`` で置換後のtoken idが表示されますので、以下のように ``49408`` 以降のtokenが存在するかどうかで確認できます。
+
+```
+input ids: tensor([[49406, 49408, 49409, 49410, 49411, 49412, 49413, 49414, 49415, 49407,
+         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
+         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
+         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
+         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
+         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
+         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
+         49407, 49407, 49407, 49407, 49407, 49407, 49407]])
+```
+
+tokenizerがすでに持っている単語（一般的な単語）は使用できません。
+
+``--init_word`` にembeddingsを初期化するときのコピー元トークンの文字列を指定します。学ばせたい概念が近いものを選ぶとよいようです。二つ以上のトークンになる文字列は指定できません。
+
+``--num_vectors_per_token`` にいくつのトークンをこの学習で使うかを指定します。多いほうが表現力が増しますが、その分多くのトークンを消費します。たとえばnum_vectors_per_token=8の場合、指定したトークン文字列は（一般的なプロンプトの77トークン制限のうち）8トークンを消費します。
+
+以上がTextual Inversionのための主なオプションです。以降は他の学習スクリプトと同様です。
+
+`num_cpu_threads_per_process` には通常は1を指定するとよいようです。
+
+`pretrained_model_name_or_path` に追加学習を行う元となるモデルを指定します。Stable Diffusionのcheckpointファイル（.ckptまたは.safetensors）、Diffusersのローカルディスクにあるモデルディレクトリ、DiffusersのモデルID（"stabilityai/stable-diffusion-2"など）が指定できます。
+
+`output_dir` に学習後のモデルを保存するフォルダを指定します。`output_name` にモデルのファイル名を拡張子を除いて指定します。`save_model_as` でsafetensors形式での保存を指定しています。
+
+`dataset_config` に `.toml` ファイルを指定します。ファイル内でのバッチサイズ指定は、当初はメモリ消費を抑えるために `1` としてください。
+
+学習させるステップ数 `max_train_steps` を10000とします。学習率 `learning_rate` はここでは5e-6を指定しています。
+
+省メモリ化のため `mixed_precision="fp16"` を指定します（RTX30 シリーズ以降では `bf16` も指定できます。環境整備時にaccelerateに行った設定と合わせてください）。また `gradient_checkpointing` を指定します。
+
+オプティマイザ（モデルを学習データにあうように最適化＝学習させるクラス）にメモリ消費の少ない 8bit AdamW を使うため、 `optimizer_type="AdamW8bit"` を指定します。
+
+`xformers` オプションを指定し、xformersのCrossAttentionを用います。xformersをインストールしていない場合やエラーとなる場合（環境にもよりますが `mixed_precision="no"` の場合など）、代わりに `mem_eff_attn` オプションを指定すると省メモリ版CrossAttentionを使用します（速度は遅くなります）。
+
+ある程度メモリがある場合は、`.toml` ファイルを編集してバッチサイズをたとえば `8` くらいに増やしてください（高速化と精度向上の可能性があります）。
+
+### よく使われるオプションについて
+
+以下の場合にはオプションに関するドキュメントを参照してください。
+
+- Stable Diffusion 2.xまたはそこからの派生モデルを学習する
+- clip skipを2以上を前提としたモデルを学習する
+- 75トークンを超えたキャプションで学習する
+
+### Textual Inversionでのバッチサイズについて
+
+モデル全体を学習するDreamBoothやfine tuningに比べてメモリ使用量が少ないため、バッチサイズは大きめにできます。
+
+# Textual Inversionのその他の主なオプション
+
+すべてのオプションについては別文書を参照してください。
+
+* `--weights`
+  * 学習前に学習済みのembeddingsを読み込み、そこから追加で学習します。
+* `--use_object_template`
+  * キャプションではなく既定の物体用テンプレート文字列（``a photo of a {}``など）で学習します。公式実装と同じになります。キャプションは無視されます。
+* `--use_style_template`
+  * キャプションではなく既定のスタイル用テンプレート文字列で学習します（``a painting in the style of {}``など）。公式実装と同じになります。キャプションは無視されます。
+
+## 当リポジトリ内の画像生成スクリプトで生成する
+
+gen_img_diffusers.pyに、``--textual_inversion_embeddings`` オプションで学習したembeddingsファイルを指定してください（複数可）。プロンプトでembeddingsファイルのファイル名（拡張子を除く）を使うと、そのembeddingsが適用されます。
+
--- a/docs/wd14_tagger_README-en.md
+++ b/docs/wd14_tagger_README-en.md
@@ -0,0 +1,88 @@
+# Image Tagging using WD14Tagger
+
+This document is based on the information from this github page (https://github.com/toriato/stable-diffusion-webui-wd14-tagger#mrsmilingwolfs-model-aka-waifu-diffusion-14-tagger).
+
+Using onnx for inference is recommended. Please install onnx with the following command:
+
+```powershell
+pip install onnx==1.15.0 onnxruntime-gpu==1.17.1  
+```
+
+The model weights will be automatically downloaded from Hugging Face.
+
+# Usage
+
+Run the script to perform tagging.
+
+```powershell
+python finetune/tag_images_by_wd14_tagger.py --onnx --repo_id <model repo id> --batch_size <batch size> <training data folder>
+```
+
+For example, if using the repository `SmilingWolf/wd-swinv2-tagger-v3` with a batch size of 4, and the training data is located in the parent folder `train_data`, it would be:
+
+```powershell
+python tag_images_by_wd14_tagger.py --onnx --repo_id SmilingWolf/wd-swinv2-tagger-v3 --batch_size 4 ..\train_data
+```
+
+On the first run, the model files will be automatically downloaded to the `wd14_tagger_model` folder (the folder can be changed with an option). 
+
+Tag files will be created in the same directory as the training data images, with the same filename and a `.txt` extension.
+
+![Generated tag files](https://user-images.githubusercontent.com/52813779/208910534-ea514373-1185-4b7d-9ae3-61eb50bc294e.png)
+
+![Tags and image](https://user-images.githubusercontent.com/52813779/208910599-29070c15-7639-474f-b3e4-06bd5a3df29e.png)
+
+## Example
+
+To output in the Animagine XL 3.1 format, it would be as follows (enter on a single line in practice):
+
+```
+python tag_images_by_wd14_tagger.py --onnx --repo_id SmilingWolf/wd-swinv2-tagger-v3 
+    --batch_size 4  --remove_underscore --undesired_tags "PUT,YOUR,UNDESIRED,TAGS" --recursive 
+    --use_rating_tags_as_last_tag --character_tags_first --character_tag_expand 
+    --always_first_tags "1girl,1boy"  ..\train_data
+```
+
+## Available Repository IDs
+
+[SmilingWolf's V2 and V3 models](https://huggingface.co/SmilingWolf) are available for use. Specify them in the format like `SmilingWolf/wd-vit-tagger-v3`. The default when omitted is `SmilingWolf/wd-v1-4-convnext-tagger-v2`.
+
+# Options 
+
+## General Options
+
+- `--onnx`: Use ONNX for inference. If not specified, TensorFlow will be used. If using TensorFlow, please install TensorFlow separately. 
+- `--batch_size`: Number of images to process at once. Default is 1. Adjust according to VRAM capacity.
+- `--caption_extension`: File extension for caption files. Default is `.txt`.
+- `--max_data_loader_n_workers`: Maximum number of workers for DataLoader. Specifying a value of 1 or more will use DataLoader to speed up image loading. If unspecified, DataLoader will not be used.
+- `--thresh`: Confidence threshold for outputting tags. Default is 0.35. Lowering the value will assign more tags but accuracy will decrease. 
+- `--general_threshold`: Confidence threshold for general tags. If omitted, same as `--thresh`.
+- `--character_threshold`: Confidence threshold for character tags. If omitted, same as `--thresh`.
+- `--recursive`: If specified, subfolders within the specified folder will also be processed recursively.
+- `--append_tags`: Append tags to existing tag files.
+- `--frequency_tags`: Output tag frequencies.  
+- `--debug`: Debug mode. Outputs debug information if specified.
+
+## Model Download
+
+- `--model_dir`: Folder to save model files. Default is `wd14_tagger_model`.  
+- `--force_download`: Re-download model files if specified.
+
+## Tag Editing
+
+- `--remove_underscore`: Remove underscores from output tags.
+- `--undesired_tags`: Specify tags not to output. Multiple tags can be specified, separated by commas. For example, `black eyes,black hair`.
+- `--use_rating_tags`: Output rating tags at the beginning of the tags.
+- `--use_rating_tags_as_last_tag`: Add rating tags at the end of the tags.
+- `--character_tags_first`: Output character tags first.
+- `--character_tag_expand`: Expand character tag series names. For example, split the tag `chara_name_(series)` into `chara_name, series`.  
+- `--always_first_tags`: Specify tags to always output first when a certain tag appears in an image. Multiple tags can be specified, separated by commas. For example, `1girl,1boy`.
+- `--caption_separator`: Separate tags with this string in the output file. Default is `, `.
+- `--tag_replacement`: Perform tag replacement. Specify in the format `tag1,tag2;tag3,tag4`. If using `,` and `;`, escape them with `\`. \
+    For example, specify `aira tsubase,aira tsubase (uniform)` (when you want to train a specific costume), `aira tsubase,aira tsubase\, heir of shadows` (when the series name is not included in the tag).
+
+When using `tag_replacement`, it is applied after `character_tag_expand`.
+
+When specifying `remove_underscore`, specify `undesired_tags`, `always_first_tags`, and `tag_replacement` without including underscores.
+
+When specifying `caption_separator`, separate `undesired_tags` and `always_first_tags` with `caption_separator`. Always separate `tag_replacement` with `,`.
--- a/docs/wd14_tagger_README-ja.md
+++ b/docs/wd14_tagger_README-ja.md
@@ -0,0 +1,88 @@
+# WD14Taggerによるタグ付け
+
+こちらのgithubページ（https://github.com/toriato/stable-diffusion-webui-wd14-tagger#mrsmilingwolfs-model-aka-waifu-diffusion-14-tagger ）の情報を参考にさせていただきました。
+
+onnx を用いた推論を推奨します。以下のコマンドで onnx をインストールしてください。
+
+```powershell
+pip install onnx==1.15.0 onnxruntime-gpu==1.17.1
+```
+
+モデルの重みはHugging Faceから自動的にダウンロードしてきます。
+
+# 使い方
+
+スクリプトを実行してタグ付けを行います。
+```
+python fintune/tag_images_by_wd14_tagger.py --onnx --repo_id <モデルのrepo id> --batch_size <バッチサイズ> <教師データフォルダ>
+```
+
+レポジトリに `SmilingWolf/wd-swinv2-tagger-v3` を使用し、バッチサイズを4にして、教師データを親フォルダの `train_data`に置いた場合、以下のようになります。
+
+```
+python tag_images_by_wd14_tagger.py --onnx --repo_id SmilingWolf/wd-swinv2-tagger-v3 --batch_size 4 ..\train_data
+```
+
+初回起動時にはモデルファイルが `wd14_tagger_model` フォルダに自動的にダウンロードされます（フォルダはオプションで変えられます）。
+
+タグファイルが教師データ画像と同じディレクトリに、同じファイル名、拡張子.txtで作成されます。
+
+![生成されたタグファイル](https://user-images.githubusercontent.com/52813779/208910534-ea514373-1185-4b7d-9ae3-61eb50bc294e.png)
+
+![タグと画像](https://user-images.githubusercontent.com/52813779/208910599-29070c15-7639-474f-b3e4-06bd5a3df29e.png)
+
+## 記述例
+
+Animagine XL 3.1 方式で出力する場合、以下のようになります（実際には 1 行で入力してください）。
+
+```
+python tag_images_by_wd14_tagger.py --onnx --repo_id SmilingWolf/wd-swinv2-tagger-v3 
+    --batch_size 4  --remove_underscore --undesired_tags "PUT,YOUR,UNDESIRED,TAGS" --recursive 
+    --use_rating_tags_as_last_tag --character_tags_first --character_tag_expand 
+    --always_first_tags "1girl,1boy"  ..\train_data
+```
+
+## 使用可能なリポジトリID
+
+[SmilingWolf 氏の V2、V3 のモデル](https://huggingface.co/SmilingWolf)が使用可能です。`SmilingWolf/wd-vit-tagger-v3` のように指定してください。省略時のデフォルトは `SmilingWolf/wd-v1-4-convnext-tagger-v2` です。
+
+# オプション
+
+## 一般オプション
+
+- `--onnx` : ONNX を使用して推論します。指定しない場合は TensorFlow を使用します。TensorFlow 使用時は別途 TensorFlow をインストールしてください。
+- `--batch_size` : 一度に処理する画像の数。デフォルトは1です。VRAMの容量に応じて増減してください。
+- `--caption_extension` : キャプションファイルの拡張子。デフォルトは `.txt` です。
+- `--max_data_loader_n_workers` : DataLoader の最大ワーカー数です。このオプションに 1 以上の数値を指定すると、DataLoader を用いて画像読み込みを高速化します。未指定時は DataLoader を用いません。
+- `--thresh` : 出力するタグの信頼度の閾値。デフォルトは0.35です。値を下げるとより多くのタグが付与されますが、精度は下がります。
+- `--general_threshold` : 一般タグの信頼度の閾値。省略時は `--thresh` と同じです。
+- `--character_threshold` : キャラクタータグの信頼度の閾値。省略時は `--thresh` と同じです。
+- `--recursive` : 指定すると、指定したフォルダ内のサブフォルダも再帰的に処理します。
+- `--append_tags` : 既存のタグファイルにタグを追加します。
+- `--frequency_tags` : タグの頻度を出力します。
+- `--debug` : デバッグモード。指定するとデバッグ情報を出力します。
+
+## モデルのダウンロード
+
+- `--model_dir` : モデルファイルの保存先フォルダ。デフォルトは `wd14_tagger_model` です。
+- `--force_download` : 指定するとモデルファイルを再ダウンロードします。
+
+## タグ編集関連
+
+- `--remove_underscore` : 出力するタグからアンダースコアを削除します。
+- `--undesired_tags` : 出力しないタグを指定します。カンマ区切りで複数指定できます。たとえば `black eyes,black hair` のように指定します。
+- `--use_rating_tags` : タグの最初にレーティングタグを出力します。
+- `--use_rating_tags_as_last_tag` : タグの最後にレーティングタグを追加します。
+- `--character_tags_first` : キャラクタータグを最初に出力します。
+- `--character_tag_expand` : キャラクタータグのシリーズ名を展開します。たとえば `chara_name_(series)` のタグを `chara_name, series` に分割します。
+- `--always_first_tags` : あるタグが画像に出力されたとき、そのタグを最初に出力するタグを指定します。カンマ区切りで複数指定できます。たとえば `1girl,1boy` のように指定します。
+- `--caption_separator` : 出力するファイルでタグをこの文字列で区切ります。デフォルトは `, ` です。
+- `--tag_replacement` : タグの置換を行います。`tag1,tag2;tag3,tag4` のように指定します。`,` および `;` を使う場合は `\` でエスケープしてください。\
+    たとえば `aira tsubase,aira tsubase (uniform)` （特定の衣装を学習させたいとき）、`aira tsubase,aira tsubase\, heir of shadows` （シリーズ名がタグに含まれないとき）のように指定します。
+
+`tag_replacement` は `character_tag_expand` の後に適用されます。
+
+`remove_underscore` 指定時は、`undesired_tags`、`always_first_tags`、`tag_replacement` はアンダースコアを含めずに指定してください。
+
+`caption_separator` 指定時は、`undesired_tags`、`always_first_tags` は `caption_separator`  で区切ってください。`tag_replacement` は必ず `,` で区切ってください。
+
--- a/fine_tune.py
+++ b/fine_tune.py
@@ -2,342 +2,555 @@
 # XXX dropped option: hypernetwork training

 import argparse
-import gc
 import math
 import os
+from multiprocessing import Value
+import toml

 from tqdm import tqdm
+
 import torch
+from library import deepspeed_utils, strategy_base
+from library.device_utils import init_ipex, clean_memory_on_device
+
+init_ipex()
+
 from accelerate.utils import set_seed
-import diffusers
 from diffusers import DDPMScheduler

+from library.utils import setup_logging, add_logging_arguments
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
 import library.train_util as train_util
-
-
-def collate_fn(examples):
-  return examples[0]
+import library.config_util as config_util
+from library.config_util import (
+    ConfigSanitizer,
+    BlueprintGenerator,
+)
+import library.custom_train_functions as custom_train_functions
+from library.custom_train_functions import (
+    apply_snr_weight,
+    get_weighted_text_embeddings,
+    prepare_scheduler_for_custom_training,
+    scale_v_prediction_loss_like_noise_prediction,
+    apply_debiased_estimation,
+)
+import library.strategy_sd as strategy_sd


 def train(args):
-  train_util.verify_training_args(args)
-  train_util.prepare_dataset_args(args, True)
+    train_util.verify_training_args(args)
+    train_util.prepare_dataset_args(args, True)
+    deepspeed_utils.prepare_deepspeed_args(args)
+    setup_logging(args, reset=True)

-  cache_latents = args.cache_latents
+    cache_latents = args.cache_latents

-  if args.seed is not None:
-    set_seed(args.seed)                           # 乱数系列を初期化する
+    if args.seed is not None:
+        set_seed(args.seed)  # 乱数系列を初期化する

-  tokenizer = train_util.load_tokenizer(args)
+    tokenize_strategy = strategy_sd.SdTokenizeStrategy(args.v2, args.max_token_length, args.tokenizer_cache_dir)
+    strategy_base.TokenizeStrategy.set_strategy(tokenize_strategy)

-  train_dataset = train_util.FineTuningDataset(args.in_json, args.train_batch_size, args.train_data_dir,
-                                               tokenizer, args.max_token_length, args.shuffle_caption, args.keep_tokens,
-                                               args.resolution, args.enable_bucket, args.min_bucket_reso, args.max_bucket_reso,
-                                               args.flip_aug, args.color_aug, args.face_crop_aug_range, args.random_crop,
-                                               args.dataset_repeats, args.debug_dataset)
-  train_dataset.make_buckets()
+    # prepare caching strategy: this must be set before preparing dataset. because dataset may use this strategy for initialization.
+    if cache_latents:
+        latents_caching_strategy = strategy_sd.SdSdxlLatentsCachingStrategy(
+            False, args.cache_latents_to_disk, args.vae_batch_size, args.skip_cache_check
+        )
+        strategy_base.LatentsCachingStrategy.set_strategy(latents_caching_strategy)

-  if args.debug_dataset:
-    train_util.debug_dataset(train_dataset)
-    return
-  if len(train_dataset) == 0:
-    print("No data found. Please verify the metadata file and train_data_dir option. / 画像がありません。メタデータおよびtrain_data_dirオプションを確認してください。")
-    return
-
-  # acceleratorを準備する
-  print("prepare accelerator")
-  accelerator, unwrap_model = train_util.prepare_accelerator(args)
-
-  # mixed precisionに対応した型を用意しておき適宜castする
-  weight_dtype, save_dtype = train_util.prepare_dtype(args)
-
-  # モデルを読み込む
-  text_encoder, vae, unet, load_stable_diffusion_format = train_util.load_target_model(args, weight_dtype)
-
-  # verify load/save model formats
-  if load_stable_diffusion_format:
-    src_stable_diffusion_ckpt = args.pretrained_model_name_or_path
-    src_diffusers_model_path = None
-  else:
-    src_stable_diffusion_ckpt = None
-    src_diffusers_model_path = args.pretrained_model_name_or_path
-
-  if args.save_model_as is None:
-    save_stable_diffusion_format = load_stable_diffusion_format
-    use_safetensors = args.use_safetensors
-  else:
-    save_stable_diffusion_format = args.save_model_as.lower() == 'ckpt' or args.save_model_as.lower() == 'safetensors'
-    use_safetensors = args.use_safetensors or ("safetensors" in args.save_model_as.lower())
-
-  # Diffusers版のxformers使用フラグを設定する関数
-  def set_diffusers_xformers_flag(model, valid):
-    #   model.set_use_memory_efficient_attention_xformers(valid)            # 次のリリースでなくなりそう
-    # pipeが自動で再帰的にset_use_memory_efficient_attention_xformersを探すんだって(;´Д｀)
-    # U-Netだけ使う時にはどうすればいいのか……仕方ないからコピって使うか
-    # 0.10.2でなんか巻き戻って個別に指定するようになった(;^ω^)
-
-    # Recursively walk through all the children.
-    # Any children which exposes the set_use_memory_efficient_attention_xformers method
-    # gets the message
-    def fn_recursive_set_mem_eff(module: torch.nn.Module):
-      if hasattr(module, "set_use_memory_efficient_attention_xformers"):
-        module.set_use_memory_efficient_attention_xformers(valid)
-
-      for child in module.children():
-        fn_recursive_set_mem_eff(child)
-
-    fn_recursive_set_mem_eff(model)
-
-  # モデルに xformers とか memory efficient attention を組み込む
-  if args.diffusers_xformers:
-    print("Use xformers by Diffusers")
-    set_diffusers_xformers_flag(unet, True)
-  else:
-    # Windows版のxformersはfloatで学習できないのでxformersを使わない設定も可能にしておく必要がある
-    print("Disable Diffusers' xformers")
-    set_diffusers_xformers_flag(unet, False)
-    train_util.replace_unet_modules(unet, args.mem_eff_attn, args.xformers)
-
-  # 学習を準備する
-  if cache_latents:
-    vae.to(accelerator.device, dtype=weight_dtype)
-    vae.requires_grad_(False)
-    vae.eval()
-    with torch.no_grad():
-      train_dataset.cache_latents(vae)
-    vae.to("cpu")
-    if torch.cuda.is_available():
-      torch.cuda.empty_cache()
-    gc.collect()
-
-  # 学習を準備する：モデルを適切な状態にする
-  training_models = []
-  if args.gradient_checkpointing:
-    unet.enable_gradient_checkpointing()
-  training_models.append(unet)
-
-  if args.train_text_encoder:
-    print("enable text encoder training")
-    if args.gradient_checkpointing:
-      text_encoder.gradient_checkpointing_enable()
-    training_models.append(text_encoder)
-  else:
-    text_encoder.to(accelerator.device, dtype=weight_dtype)
-    text_encoder.requires_grad_(False)             # text encoderは学習しない
-    if args.gradient_checkpointing:
-      text_encoder.gradient_checkpointing_enable()
-      text_encoder.train()                # required for gradient_checkpointing
-    else:
-      text_encoder.eval()
-
-  if not cache_latents:
-    vae.requires_grad_(False)
-    vae.eval()
-    vae.to(accelerator.device, dtype=weight_dtype)
-
-  for m in training_models:
-    m.requires_grad_(True)
-  params = []
-  for m in training_models:
-    params.extend(m.parameters())
-  params_to_optimize = params
-
-  # 学習に必要なクラスを準備する
-  print("prepare optimizer, data loader etc.")
-
-  # 8-bit Adamを使う
-  if args.use_8bit_adam:
-    try:
-      import bitsandbytes as bnb
-    except ImportError:
-      raise ImportError("No bitsand bytes / bitsandbytesがインストールされていないようです")
-    print("use 8-bit Adam optimizer")
-    optimizer_class = bnb.optim.AdamW8bit
-  else:
-    optimizer_class = torch.optim.AdamW
-
-  # betaやweight decayはdiffusers DreamBoothもDreamBooth SDもデフォルト値のようなのでオプションはとりあえず省略
-  optimizer = optimizer_class(params_to_optimize, lr=args.learning_rate)
-
-  # dataloaderを準備する
-  # DataLoaderのプロセス数：0はメインプロセスになる
-  n_workers = min(args.max_data_loader_n_workers, os.cpu_count() - 1)      # cpu_count-1 ただし最大で指定された数まで
-  train_dataloader = torch.utils.data.DataLoader(
-      train_dataset, batch_size=1, shuffle=False, collate_fn=collate_fn, num_workers=n_workers, persistent_workers=args.persistent_data_loader_workers)
-
-  # 学習ステップ数を計算する
-  if args.max_train_epochs is not None:
-    args.max_train_steps = args.max_train_epochs * len(train_dataloader)
-    print(f"override steps. steps for {args.max_train_epochs} epochs is / 指定エポックまでのステップ数: {args.max_train_steps}")
-
-  # lr schedulerを用意する
-  lr_scheduler = diffusers.optimization.get_scheduler(
-      args.lr_scheduler, optimizer, num_warmup_steps=args.lr_warmup_steps, num_training_steps=args.max_train_steps * args.gradient_accumulation_steps)
-
-  # 実験的機能：勾配も含めたfp16学習を行う　モデル全体をfp16にする
-  if args.full_fp16:
-    assert args.mixed_precision == "fp16", "full_fp16 requires mixed precision='fp16' / full_fp16を使う場合はmixed_precision='fp16'を指定してください。"
-    print("enable full fp16 training.")
-    unet.to(weight_dtype)
-    text_encoder.to(weight_dtype)
-
-  # acceleratorがなんかよろしくやってくれるらしい
-  if args.train_text_encoder:
-    unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
-        unet, text_encoder, optimizer, train_dataloader, lr_scheduler)
-  else:
-    unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(unet, optimizer, train_dataloader, lr_scheduler)
-
-  # 実験的機能：勾配も含めたfp16学習を行う　PyTorchにパッチを当ててfp16でのgrad scaleを有効にする
-  if args.full_fp16:
-    train_util.patch_accelerator_for_fp16_training(accelerator)
-
-  # resumeする
-  if args.resume is not None:
-    print(f"resume training from state: {args.resume}")
-    accelerator.load_state(args.resume)
-
-  # epoch数を計算する
-  num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
-  num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)
-  if (args.save_n_epoch_ratio is not None) and (args.save_n_epoch_ratio > 0):
-    args.save_every_n_epochs = math.floor(num_train_epochs / args.save_n_epoch_ratio) or 1
-
-  # 学習する
-  total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
-  print("running training / 学習開始")
-  print(f"  num examples / サンプル数: {train_dataset.num_train_images}")
-  print(f"  num batches per epoch / 1epochのバッチ数: {len(train_dataloader)}")
-  print(f"  num epochs / epoch数: {num_train_epochs}")
-  print(f"  batch size per device / バッチサイズ: {args.train_batch_size}")
-  print(f"  total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ（並列学習、勾配合計含む）: {total_batch_size}")
-  print(f"  gradient ccumulation steps / 勾配を合計するステップ数 = {args.gradient_accumulation_steps}")
-  print(f"  total optimization steps / 学習ステップ数: {args.max_train_steps}")
-
-  progress_bar = tqdm(range(args.max_train_steps), smoothing=0, disable=not accelerator.is_local_main_process, desc="steps")
-  global_step = 0
-
-  noise_scheduler = DDPMScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear",
-                                  num_train_timesteps=1000, clip_sample=False)
-
-  if accelerator.is_main_process:
-    accelerator.init_trackers("finetuning")
-
-  for epoch in range(num_train_epochs):
-    print(f"epoch {epoch+1}/{num_train_epochs}")
-    for m in training_models:
-      m.train()
-
-    loss_total = 0
-    for step, batch in enumerate(train_dataloader):
-      with accelerator.accumulate(training_models[0]):  # 複数モデルに対応していない模様だがとりあえずこうしておく
-        with torch.no_grad():
-          if "latents" in batch and batch["latents"] is not None:
-            latents = batch["latents"].to(accelerator.device)
-          else:
-            # latentに変換
-            latents = vae.encode(batch["images"].to(dtype=weight_dtype)).latent_dist.sample()
-          latents = latents * 0.18215
-        b_size = latents.shape[0]
-
-        with torch.set_grad_enabled(args.train_text_encoder):
-          # Get the text embedding for conditioning
-          input_ids = batch["input_ids"].to(accelerator.device)
-          encoder_hidden_states = train_util.get_hidden_states(
-              args, input_ids, tokenizer, text_encoder, None if not args.full_fp16 else weight_dtype)
-
-        # Sample noise that we'll add to the latents
-        noise = torch.randn_like(latents, device=latents.device)
-
-        # Sample a random timestep for each image
-        timesteps = torch.randint(0, noise_scheduler.config.num_train_timesteps, (b_size,), device=latents.device)
-        timesteps = timesteps.long()
-
-        # Add noise to the latents according to the noise magnitude at each timestep
-        # (this is the forward diffusion process)
-        noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)
-
-        # Predict the noise residual
-        noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
-
-        if args.v_parameterization:
-          # v-parameterization training
-          target = noise_scheduler.get_velocity(latents, noise, timesteps)
+    # データセットを準備する
+    if args.dataset_class is None:
+        blueprint_generator = BlueprintGenerator(ConfigSanitizer(False, True, False, True))
+        if args.dataset_config is not None:
+            logger.info(f"Load dataset config from {args.dataset_config}")
+            user_config = config_util.load_user_config(args.dataset_config)
+            ignored = ["train_data_dir", "in_json"]
+            if any(getattr(args, attr) is not None for attr in ignored):
+                logger.warning(
+                    "ignore following options because config file is found: {0} / 設定ファイルが利用されるため以下のオプションは無視されます: {0}".format(
+                        ", ".join(ignored)
+                    )
+                )
        else:
-          target = noise
+            user_config = {
+                "datasets": [
+                    {
+                        "subsets": [
+                            {
+                                "image_dir": args.train_data_dir,
+                                "metadata_file": args.in_json,
+                            }
+                        ]
+                    }
+                ]
+            }

-        loss = torch.nn.functional.mse_loss(noise_pred.float(), target.float(), reduction="mean")
+        blueprint = blueprint_generator.generate(user_config, args)
+        train_dataset_group, val_dataset_group = config_util.generate_dataset_group_by_blueprint(blueprint.dataset_group)
+    else:
+        train_dataset_group = train_util.load_arbitrary_dataset(args)
+        val_dataset_group = None

-        accelerator.backward(loss)
-        if accelerator.sync_gradients:
-          params_to_clip = []
-          for m in training_models:
-            params_to_clip.extend(m.parameters())
-          accelerator.clip_grad_norm_(params_to_clip, 1.0)  # args.max_grad_norm)
+    current_epoch = Value("i", 0)
+    current_step = Value("i", 0)
+    ds_for_collator = train_dataset_group if args.max_data_loader_n_workers == 0 else None
+    collator = train_util.collator_class(current_epoch, current_step, ds_for_collator)

-        optimizer.step()
-        lr_scheduler.step()
-        optimizer.zero_grad(set_to_none=True)
+    train_dataset_group.verify_bucket_reso_steps(64)

-      # Checks if the accelerator has performed an optimization step behind the scenes
-      if accelerator.sync_gradients:
-        progress_bar.update(1)
-        global_step += 1
+    if args.debug_dataset:
+        train_util.debug_dataset(train_dataset_group)
+        return
+    if len(train_dataset_group) == 0:
+        logger.error(
+            "No data found. Please verify the metadata file and train_data_dir option. / 画像がありません。メタデータおよびtrain_data_dirオプションを確認してください。"
+        )
+        return

-      current_loss = loss.detach().item()        # 平均なのでbatch sizeは関係ないはず
-      if args.logging_dir is not None:
-        logs = {"loss": current_loss, "lr": lr_scheduler.get_last_lr()[0]}
-        accelerator.log(logs, step=global_step)
+    if cache_latents:
+        assert (
+            train_dataset_group.is_latent_cacheable()
+        ), "when caching latents, either color_aug or random_crop cannot be used / latentをキャッシュするときはcolor_augとrandom_cropは使えません"

-      loss_total += current_loss
-      avr_loss = loss_total / (step+1)
-      logs = {"loss": avr_loss}  # , "lr": lr_scheduler.get_last_lr()[0]}
-      progress_bar.set_postfix(**logs)
+    # acceleratorを準備する
+    logger.info("prepare accelerator")
+    accelerator = train_util.prepare_accelerator(args)

-      if global_step >= args.max_train_steps:
-        break
+    # mixed precisionに対応した型を用意しておき適宜castする
+    weight_dtype, save_dtype = train_util.prepare_dtype(args)
+    vae_dtype = torch.float32 if args.no_half_vae else weight_dtype

-    if args.logging_dir is not None:
-      logs = {"epoch_loss": loss_total / len(train_dataloader)}
-      accelerator.log(logs, step=epoch+1)
+    # モデルを読み込む
+    text_encoder, vae, unet, load_stable_diffusion_format = train_util.load_target_model(args, weight_dtype, accelerator)

-    accelerator.wait_for_everyone()
+    # verify load/save model formats
+    if load_stable_diffusion_format:
+        src_stable_diffusion_ckpt = args.pretrained_model_name_or_path
+        src_diffusers_model_path = None
+    else:
+        src_stable_diffusion_ckpt = None
+        src_diffusers_model_path = args.pretrained_model_name_or_path

-    if args.save_every_n_epochs is not None:
-      src_path = src_stable_diffusion_ckpt if save_stable_diffusion_format else src_diffusers_model_path
-      train_util.save_sd_model_on_epoch_end(args, accelerator, src_path, save_stable_diffusion_format, use_safetensors,
-                                            save_dtype, epoch, num_train_epochs, global_step,  unwrap_model(text_encoder), unwrap_model(unet), vae)
+    if args.save_model_as is None:
+        save_stable_diffusion_format = load_stable_diffusion_format
+        use_safetensors = args.use_safetensors
+    else:
+        save_stable_diffusion_format = args.save_model_as.lower() == "ckpt" or args.save_model_as.lower() == "safetensors"
+        use_safetensors = args.use_safetensors or ("safetensors" in args.save_model_as.lower())

-  is_main_process = accelerator.is_main_process
-  if is_main_process:
-    unet = unwrap_model(unet)
-    text_encoder = unwrap_model(text_encoder)
+    # Diffusers版のxformers使用フラグを設定する関数
+    def set_diffusers_xformers_flag(model, valid):
+        #   model.set_use_memory_efficient_attention_xformers(valid)            # 次のリリースでなくなりそう
+        # pipeが自動で再帰的にset_use_memory_efficient_attention_xformersを探すんだって(;´Д｀)
+        # U-Netだけ使う時にはどうすればいいのか……仕方ないからコピって使うか
+        # 0.10.2でなんか巻き戻って個別に指定するようになった(;^ω^)

-  accelerator.end_training()
+        # Recursively walk through all the children.
+        # Any children which exposes the set_use_memory_efficient_attention_xformers method
+        # gets the message
+        def fn_recursive_set_mem_eff(module: torch.nn.Module):
+            if hasattr(module, "set_use_memory_efficient_attention_xformers"):
+                module.set_use_memory_efficient_attention_xformers(valid)

-  if args.save_state:
-    train_util.save_state_on_train_end(args, accelerator)
+            for child in module.children():
+                fn_recursive_set_mem_eff(child)

-  del accelerator                         # この後メモリを使うのでこれは消す
+        fn_recursive_set_mem_eff(model)

-  if is_main_process:
-    src_path = src_stable_diffusion_ckpt if save_stable_diffusion_format else src_diffusers_model_path
-    train_util.save_sd_model_on_train_end(args, src_path, save_stable_diffusion_format, use_safetensors,
-                                          save_dtype, epoch, global_step,  text_encoder, unet, vae)
-    print("model saved.")
+    # モデルに xformers とか memory efficient attention を組み込む
+    if args.diffusers_xformers:
+        accelerator.print("Use xformers by Diffusers")
+        set_diffusers_xformers_flag(unet, True)
+    else:
+        # Windows版のxformersはfloatで学習できないのでxformersを使わない設定も可能にしておく必要がある
+        accelerator.print("Disable Diffusers' xformers")
+        set_diffusers_xformers_flag(unet, False)
+        train_util.replace_unet_modules(unet, args.mem_eff_attn, args.xformers, args.sdpa)
+
+    # 学習を準備する
+    if cache_latents:
+        vae.to(accelerator.device, dtype=vae_dtype)
+        vae.requires_grad_(False)
+        vae.eval()
+
+        train_dataset_group.new_cache_latents(vae, accelerator)
+
+        vae.to("cpu")
+        clean_memory_on_device(accelerator.device)
+
+        accelerator.wait_for_everyone()
+
+    # 学習を準備する：モデルを適切な状態にする
+    training_models = []
+    if args.gradient_checkpointing:
+        unet.enable_gradient_checkpointing()
+    training_models.append(unet)
+
+    if args.train_text_encoder:
+        accelerator.print("enable text encoder training")
+        if args.gradient_checkpointing:
+            text_encoder.gradient_checkpointing_enable()
+        training_models.append(text_encoder)
+    else:
+        text_encoder.to(accelerator.device, dtype=weight_dtype)
+        text_encoder.requires_grad_(False)  # text encoderは学習しない
+        if args.gradient_checkpointing:
+            text_encoder.gradient_checkpointing_enable()
+            text_encoder.train()  # required for gradient_checkpointing
+        else:
+            text_encoder.eval()
+
+    text_encoding_strategy = strategy_sd.SdTextEncodingStrategy(args.clip_skip)
+    strategy_base.TextEncodingStrategy.set_strategy(text_encoding_strategy)
+
+    if not cache_latents:
+        vae.requires_grad_(False)
+        vae.eval()
+        vae.to(accelerator.device, dtype=vae_dtype)
+
+    for m in training_models:
+        m.requires_grad_(True)
+
+    trainable_params = []
+    if args.learning_rate_te is None or not args.train_text_encoder:
+        for m in training_models:
+            trainable_params.extend(m.parameters())
+    else:
+        trainable_params = [
+            {"params": list(unet.parameters()), "lr": args.learning_rate},
+            {"params": list(text_encoder.parameters()), "lr": args.learning_rate_te},
+        ]
+
+    # 学習に必要なクラスを準備する
+    accelerator.print("prepare optimizer, data loader etc.")
+    _, _, optimizer = train_util.get_optimizer(args, trainable_params=trainable_params)
+
+    # prepare dataloader
+    # strategies are set here because they cannot be referenced in another process. Copy them with the dataset
+    # some strategies can be None
+    train_dataset_group.set_current_strategies()
+
+    # DataLoaderのプロセス数：0 は persistent_workers が使えないので注意
+    n_workers = min(args.max_data_loader_n_workers, os.cpu_count())  # cpu_count or max_data_loader_n_workers
+    train_dataloader = torch.utils.data.DataLoader(
+        train_dataset_group,
+        batch_size=1,
+        shuffle=True,
+        collate_fn=collator,
+        num_workers=n_workers,
+        persistent_workers=args.persistent_data_loader_workers,
+    )
+
+    # 学習ステップ数を計算する
+    if args.max_train_epochs is not None:
+        args.max_train_steps = args.max_train_epochs * math.ceil(
+            len(train_dataloader) / accelerator.num_processes / args.gradient_accumulation_steps
+        )
+        accelerator.print(
+            f"override steps. steps for {args.max_train_epochs} epochs is / 指定エポックまでのステップ数: {args.max_train_steps}"
+        )
+
+    # データセット側にも学習ステップを送信
+    train_dataset_group.set_max_train_steps(args.max_train_steps)
+
+    # lr schedulerを用意する
+    lr_scheduler = train_util.get_scheduler_fix(args, optimizer, accelerator.num_processes)
+
+    # 実験的機能：勾配も含めたfp16学習を行う　モデル全体をfp16にする
+    if args.full_fp16:
+        assert (
+            args.mixed_precision == "fp16"
+        ), "full_fp16 requires mixed precision='fp16' / full_fp16を使う場合はmixed_precision='fp16'を指定してください。"
+        accelerator.print("enable full fp16 training.")
+        unet.to(weight_dtype)
+        text_encoder.to(weight_dtype)
+
+    if args.deepspeed:
+        if args.train_text_encoder:
+            ds_model = deepspeed_utils.prepare_deepspeed_model(args, unet=unet, text_encoder=text_encoder)
+        else:
+            ds_model = deepspeed_utils.prepare_deepspeed_model(args, unet=unet)
+        ds_model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
+            ds_model, optimizer, train_dataloader, lr_scheduler
+        )
+        training_models = [ds_model]
+    else:
+        # acceleratorがなんかよろしくやってくれるらしい
+        if args.train_text_encoder:
+            unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
+                unet, text_encoder, optimizer, train_dataloader, lr_scheduler
+            )
+        else:
+            unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(unet, optimizer, train_dataloader, lr_scheduler)
+
+    # 実験的機能：勾配も含めたfp16学習を行う　PyTorchにパッチを当ててfp16でのgrad scaleを有効にする
+    if args.full_fp16:
+        train_util.patch_accelerator_for_fp16_training(accelerator)
+
+    # resumeする
+    train_util.resume_from_local_or_hf_if_specified(accelerator, args)
+
+    # epoch数を計算する
+    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
+    num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)
+    if (args.save_n_epoch_ratio is not None) and (args.save_n_epoch_ratio > 0):
+        args.save_every_n_epochs = math.floor(num_train_epochs / args.save_n_epoch_ratio) or 1
+
+    # 学習する
+    total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
+    accelerator.print("running training / 学習開始")
+    accelerator.print(f"  num examples / サンプル数: {train_dataset_group.num_train_images}")
+    accelerator.print(f"  num batches per epoch / 1epochのバッチ数: {len(train_dataloader)}")
+    accelerator.print(f"  num epochs / epoch数: {num_train_epochs}")
+    accelerator.print(f"  batch size per device / バッチサイズ: {args.train_batch_size}")
+    accelerator.print(
+        f"  total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ（並列学習、勾配合計含む）: {total_batch_size}"
+    )
+    accelerator.print(f"  gradient accumulation steps / 勾配を合計するステップ数 = {args.gradient_accumulation_steps}")
+    accelerator.print(f"  total optimization steps / 学習ステップ数: {args.max_train_steps}")
+
+    progress_bar = tqdm(range(args.max_train_steps), smoothing=0, disable=not accelerator.is_local_main_process, desc="steps")
+    global_step = 0
+
+    noise_scheduler = DDPMScheduler(
+        beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000, clip_sample=False
+    )
+    prepare_scheduler_for_custom_training(noise_scheduler, accelerator.device)
+    if args.zero_terminal_snr:
+        custom_train_functions.fix_noise_scheduler_betas_for_zero_terminal_snr(noise_scheduler)
+
+    if accelerator.is_main_process:
+        init_kwargs = {}
+        if args.wandb_run_name:
+            init_kwargs["wandb"] = {"name": args.wandb_run_name}
+        if args.log_tracker_config is not None:
+            init_kwargs = toml.load(args.log_tracker_config)
+        accelerator.init_trackers(
+            "finetuning" if args.log_tracker_name is None else args.log_tracker_name,
+            config=train_util.get_sanitized_config_or_none(args),
+            init_kwargs=init_kwargs,
+        )
+
+    # For --sample_at_first
+    train_util.sample_images(
+        accelerator, args, 0, global_step, accelerator.device, vae, tokenize_strategy.tokenizer, text_encoder, unet
+    )
+    if len(accelerator.trackers) > 0:
+        # log empty object to commit the sample images to wandb
+        accelerator.log({}, step=0)
+
+    loss_recorder = train_util.LossRecorder()
+    for epoch in range(num_train_epochs):
+        accelerator.print(f"\nepoch {epoch+1}/{num_train_epochs}")
+        current_epoch.value = epoch + 1
+
+        for m in training_models:
+            m.train()
+
+        for step, batch in enumerate(train_dataloader):
+            current_step.value = global_step
+            with accelerator.accumulate(*training_models):
+                with torch.no_grad():
+                    if "latents" in batch and batch["latents"] is not None:
+                        latents = batch["latents"].to(accelerator.device).to(dtype=weight_dtype)
+                    else:
+                        # latentに変換
+                        latents = vae.encode(batch["images"].to(dtype=vae_dtype)).latent_dist.sample().to(weight_dtype)
+                    latents = latents * 0.18215
+                b_size = latents.shape[0]
+
+                with torch.set_grad_enabled(args.train_text_encoder):
+                    # Get the text embedding for conditioning
+                    if args.weighted_captions:
+                        input_ids_list, weights_list = tokenize_strategy.tokenize_with_weights(batch["captions"])
+                        encoder_hidden_states = text_encoding_strategy.encode_tokens_with_weights(
+                            tokenize_strategy, [text_encoder], input_ids_list, weights_list
+                        )[0]
+                    else:
+                        input_ids = batch["input_ids_list"][0].to(accelerator.device)
+                        encoder_hidden_states = text_encoding_strategy.encode_tokens(
+                            tokenize_strategy, [text_encoder], [input_ids]
+                        )[0]
+                    if args.full_fp16:
+                        encoder_hidden_states = encoder_hidden_states.to(weight_dtype)
+
+                # Sample noise, sample a random timestep for each image, and add noise to the latents,
+                # with noise offset and/or multires noise if specified
+                noise, noisy_latents, timesteps = train_util.get_noise_noisy_latents_and_timesteps(args, noise_scheduler, latents)
+
+                # Predict the noise residual
+                with accelerator.autocast():
+                    noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
+
+                if args.v_parameterization:
+                    # v-parameterization training
+                    target = noise_scheduler.get_velocity(latents, noise, timesteps)
+                else:
+                    target = noise
+
+                huber_c = train_util.get_huber_threshold_if_needed(args, timesteps, noise_scheduler)
+                if args.min_snr_gamma or args.scale_v_pred_loss_like_noise_pred or args.debiased_estimation_loss:
+                    # do not mean over batch dimension for snr weight or scale v-pred loss
+                    loss = train_util.conditional_loss(noise_pred.float(), target.float(), args.loss_type, "none", huber_c)
+                    loss = loss.mean([1, 2, 3])
+
+                    if args.min_snr_gamma:
+                        loss = apply_snr_weight(loss, timesteps, noise_scheduler, args.min_snr_gamma, args.v_parameterization)
+                    if args.scale_v_pred_loss_like_noise_pred:
+                        loss = scale_v_prediction_loss_like_noise_prediction(loss, timesteps, noise_scheduler)
+                    if args.debiased_estimation_loss:
+                        loss = apply_debiased_estimation(loss, timesteps, noise_scheduler, args.v_parameterization)
+
+                    loss = loss.mean()  # mean over batch dimension
+                else:
+                    loss = train_util.conditional_loss(noise_pred.float(), target.float(), args.loss_type, "mean", huber_c)
+
+                accelerator.backward(loss)
+                if accelerator.sync_gradients and args.max_grad_norm != 0.0:
+                    params_to_clip = []
+                    for m in training_models:
+                        params_to_clip.extend(m.parameters())
+                    accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
+
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.zero_grad(set_to_none=True)
+
+            # Checks if the accelerator has performed an optimization step behind the scenes
+            if accelerator.sync_gradients:
+                progress_bar.update(1)
+                global_step += 1
+
+                train_util.sample_images(
+                    accelerator, args, None, global_step, accelerator.device, vae, tokenize_strategy.tokenizer, text_encoder, unet
+                )
+
+                # 指定ステップごとにモデルを保存
+                if args.save_every_n_steps is not None and global_step % args.save_every_n_steps == 0:
+                    accelerator.wait_for_everyone()
+                    if accelerator.is_main_process:
+                        src_path = src_stable_diffusion_ckpt if save_stable_diffusion_format else src_diffusers_model_path
+                        train_util.save_sd_model_on_epoch_end_or_stepwise(
+                            args,
+                            False,
+                            accelerator,
+                            src_path,
+                            save_stable_diffusion_format,
+                            use_safetensors,
+                            save_dtype,
+                            epoch,
+                            num_train_epochs,
+                            global_step,
+                            accelerator.unwrap_model(text_encoder),
+                            accelerator.unwrap_model(unet),
+                            vae,
+                        )
+
+            current_loss = loss.detach().item()  # 平均なのでbatch sizeは関係ないはず
+            if len(accelerator.trackers) > 0:
+                logs = {"loss": current_loss}
+                train_util.append_lr_to_logs(logs, lr_scheduler, args.optimizer_type, including_unet=True)
+                accelerator.log(logs, step=global_step)
+
+            loss_recorder.add(epoch=epoch, step=step, loss=current_loss)
+            avr_loss: float = loss_recorder.moving_average
+            logs = {"avr_loss": avr_loss}  # , "lr": lr_scheduler.get_last_lr()[0]}
+            progress_bar.set_postfix(**logs)
+
+            if global_step >= args.max_train_steps:
+                break
+
+        if len(accelerator.trackers) > 0:
+            logs = {"loss/epoch": loss_recorder.moving_average}
+            accelerator.log(logs, step=epoch + 1)
+
+        accelerator.wait_for_everyone()
+
+        if args.save_every_n_epochs is not None:
+            if accelerator.is_main_process:
+                src_path = src_stable_diffusion_ckpt if save_stable_diffusion_format else src_diffusers_model_path
+                train_util.save_sd_model_on_epoch_end_or_stepwise(
+                    args,
+                    True,
+                    accelerator,
+                    src_path,
+                    save_stable_diffusion_format,
+                    use_safetensors,
+                    save_dtype,
+                    epoch,
+                    num_train_epochs,
+                    global_step,
+                    accelerator.unwrap_model(text_encoder),
+                    accelerator.unwrap_model(unet),
+                    vae,
+                )
+
+        train_util.sample_images(
+            accelerator, args, epoch + 1, global_step, accelerator.device, vae, tokenize_strategy.tokenizer, text_encoder, unet
+        )
+
+    is_main_process = accelerator.is_main_process
+    if is_main_process:
+        unet = accelerator.unwrap_model(unet)
+        text_encoder = accelerator.unwrap_model(text_encoder)
+
+    accelerator.end_training()
+
+    if is_main_process and (args.save_state or args.save_state_on_train_end):
+        train_util.save_state_on_train_end(args, accelerator)
+
+    del accelerator  # この後メモリを使うのでこれは消す
+
+    if is_main_process:
+        src_path = src_stable_diffusion_ckpt if save_stable_diffusion_format else src_diffusers_model_path
+        train_util.save_sd_model_on_train_end(
+            args, src_path, save_stable_diffusion_format, use_safetensors, save_dtype, epoch, global_step, text_encoder, unet, vae
+        )
+        logger.info("model saved.")


-if __name__ == '__main__':
-  parser = argparse.ArgumentParser()
+def setup_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser()

-  train_util.add_sd_models_arguments(parser)
-  train_util.add_dataset_arguments(parser, False, True)
-  train_util.add_training_arguments(parser, False)
-  train_util.add_sd_saving_arguments(parser)
+    add_logging_arguments(parser)
+    train_util.add_sd_models_arguments(parser)
+    train_util.add_dataset_arguments(parser, False, True, True)
+    train_util.add_training_arguments(parser, False)
+    deepspeed_utils.add_deepspeed_arguments(parser)
+    train_util.add_sd_saving_arguments(parser)
+    train_util.add_optimizer_arguments(parser)
+    config_util.add_config_arguments(parser)
+    custom_train_functions.add_custom_train_arguments(parser)

-  parser.add_argument("--diffusers_xformers", action='store_true',
-                      help='use xformers by diffusers / Diffusersでxformersを使用する')
-  parser.add_argument("--train_text_encoder", action="store_true", help="train text encoder / text encoderも学習する")
+    parser.add_argument(
+        "--diffusers_xformers", action="store_true", help="use xformers by diffusers / Diffusersでxformersを使用する"
+    )
+    parser.add_argument("--train_text_encoder", action="store_true", help="train text encoder / text encoderも学習する")
+    parser.add_argument(
+        "--learning_rate_te",
+        type=float,
+        default=None,
+        help="learning rate for text encoder, default is same as unet / Text Encoderの学習率、デフォルトはunetと同じ",
+    )
+    parser.add_argument(
+        "--no_half_vae",
+        action="store_true",
+        help="do not use fp16/bf16 VAE in mixed precision (use float VAE) / mixed precisionでも fp16/bf16 VAEを使わずfloat VAEを使う",
+    )

-  args = parser.parse_args()
-  train(args)
+    return parser
+
+
+if __name__ == "__main__":
+    parser = setup_parser()
+
+    args = parser.parse_args()
+    train_util.verify_command_line_training_args(args)
+    args = train_util.read_config_from_file(args, parser)
+
+    train(args)
--- a/fine_tune_README_ja.md
+++ b/fine_tune_README_ja.md
@@ -1,465 +0,0 @@
-NovelAIの提案した学習手法、自動キャプションニング、タグ付け、Windows＋VRAM 12GB（v1.4/1.5の場合）環境等に対応したfine tuningです。
-
-## 概要
-Diffusersを用いてStable DiffusionのU-Netのfine tuningを行います。NovelAIの記事にある以下の改善に対応しています（Aspect Ratio BucketingについてはNovelAIのコードを参考にしましたが、最終的なコードはすべてオリジナルです）。
-
-* CLIP（Text Encoder）の最後の層ではなく最後から二番目の層の出力を用いる。
-* 正方形以外の解像度での学習（Aspect Ratio Bucketing） 。
-* トークン長を75から225に拡張する。
-* BLIPによるキャプショニング（キャプションの自動作成）、DeepDanbooruまたはWD14Taggerによる自動タグ付けを行う。
-* Hypernetworkの学習にも対応する。
-* Stable Diffusion v2.0（baseおよび768/v）に対応。
-* VAEの出力をあらかじめ取得しディスクに保存しておくことで、学習の省メモリ化、高速化を図る。
-
-デフォルトではText Encoderの学習は行いません。モデル全体のfine tuningではU-Netだけを学習するのが一般的なようです（NovelAIもそのようです）。オプション指定でText Encoderも学習対象とできます。
-
-## 追加機能について
-### CLIPの出力の変更
-プロンプトを画像に反映するため、テキストの特徴量への変換を行うのがCLIP（Text Encoder）です。Stable DiffusionではCLIPの最後の層の出力を用いていますが、それを最後から二番目の層の出力を用いるよう変更できます。NovelAIによると、これによりより正確にプロンプトが反映されるようになるとのことです。
-元のまま、最後の層の出力を用いることも可能です。
-※Stable Diffusion 2.0では最後から二番目の層をデフォルトで使います。clip_skipオプションを指定しないでください。
-
-### 正方形以外の解像度での学習
-Stable Diffusionは512\*512で学習されていますが、それに加えて256\*1024や384\*640といった解像度でも学習します。これによりトリミングされる部分が減り、より正しくプロンプトと画像の関係が学習されることが期待されます。
-学習解像度はパラメータとして与えられた解像度の面積（＝メモリ使用量）を超えない範囲で、64ピクセル単位で縦横に調整、作成されます。
-
-機械学習では入力サイズをすべて統一するのが一般的ですが、特に制約があるわけではなく、実際は同一のバッチ内で統一されていれば大丈夫です。NovelAIの言うbucketingは、あらかじめ教師データを、アスペクト比に応じた学習解像度ごとに分類しておくことを指しているようです。そしてバッチを各bucket内の画像で作成することで、バッチの画像サイズを統一します。
-
-### トークン長の75から225への拡張
-Stable Diffusionでは最大75トークン（開始・終了を含むと77トークン）ですが、それを225トークンまで拡張します。
-ただしCLIPが受け付ける最大長は75トークンですので、225トークンの場合、単純に三分割してCLIPを呼び出してから結果を連結しています。
-
-※これが望ましい実装なのかどうかはいまひとつわかりません。とりあえず動いてはいるようです。特に2.0では何も参考になる実装がないので独自に実装してあります。
-
-※Automatic1111氏のWeb UIではカンマを意識して分割、といったこともしているようですが、私の場合はそこまでしておらず単純な分割です。
-
-## 環境整備
-
-このリポジトリの[README](./README-ja.md)を参照してください。
-
-## 教師データの用意
-
-学習させたい画像データを用意し、任意のフォルダに入れてください。リサイズ等の事前の準備は必要ありません。
-ただし学習解像度よりもサイズが小さい画像については、超解像などで品質を保ったまま拡大しておくことをお勧めします。
-
-複数の教師データフォルダにも対応しています。前処理をそれぞれのフォルダに対して実行する形となります。
-
-たとえば以下のように画像を格納します。
-
-![教師データフォルダのスクショ](https://user-images.githubusercontent.com/52813779/208907739-8e89d5fa-6ca8-4b60-8927-f484d2a9ae04.png)
-
-## 自動キャプショニング
-キャプションを使わずタグだけで学習する場合はスキップしてください。
-
-また手動でキャプションを用意する場合、キャプションは教師データ画像と同じディレクトリに、同じファイル名、拡張子.caption等で用意してください。各ファイルは1行のみのテキストファイルとします。
-
-### BLIPによるキャプショニング
-
-最新版ではBLIPのダウンロード、重みのダウンロード、仮想環境の追加は不要になりました。そのままで動作します。
-
-finetuneフォルダ内のmake_captions.pyを実行します。
-
-```
-python finetune\make_captions.py --batch_size <バッチサイズ> <教師データフォルダ>
-```
-
-バッチサイズ8、教師データを親フォルダのtrain_dataに置いた場合、以下のようになります。
-
-```
-python finetune\make_captions.py --batch_size 8 ..\train_data
-```
-
-キャプションファイルが教師データ画像と同じディレクトリに、同じファイル名、拡張子.captionで作成されます。
-
-batch_sizeはGPUのVRAM容量に応じて増減してください。大きいほうが速くなります（VRAM 12GBでももう少し増やせると思います）。
-max_lengthオプションでキャプションの最大長を指定できます。デフォルトは75です。モデルをトークン長225で学習する場合には長くしても良いかもしれません。
-caption_extensionオプションでキャプションの拡張子を変更できます。デフォルトは.captionです（.txtにすると後述のDeepDanbooruと競合します）。
-
-複数の教師データフォルダがある場合には、それぞれのフォルダに対して実行してください。
-
-なお、推論にランダム性があるため、実行するたびに結果が変わります。固定する場合には--seedオプションで「--seed 42」のように乱数seedを指定してください。
-
-その他のオプションは--helpでヘルプをご参照ください（パラメータの意味についてはドキュメントがまとまっていないようで、ソースを見るしかないようです）。
-
-デフォルトでは拡張子.captionでキャプションファイルが生成されます。
-
-![captionが生成されたフォルダ](https://user-images.githubusercontent.com/52813779/208908845-48a9d36c-f6ee-4dae-af71-9ab462d1459e.png)
-
-たとえば以下のようなキャプションが付きます。
-
-![キャプションと画像](https://user-images.githubusercontent.com/52813779/208908947-af936957-5d73-4339-b6c8-945a52857373.png)
-
-## DeepDanbooruによるタグ付け
-danbooruタグのタグ付け自体を行わない場合は「キャプションとタグ情報の前処理」に進んでください。
-
-タグ付けはDeepDanbooruまたはWD14Taggerで行います。WD14Taggerのほうが精度が良いようです。WD14Taggerでタグ付けする場合は、次の章へ進んでください。
-
-### 環境整備
-DeepDanbooru https://github.com/KichangKim/DeepDanbooru  を作業フォルダにcloneしてくるか、zipをダウンロードして展開します。私はzipで展開しました。
-またDeepDanbooruのReleasesのページ https://github.com/KichangKim/DeepDanbooru/releases  の「DeepDanbooru Pretrained Model v3-20211112-sgd-e28」のAssetsから、deepdanbooru-v3-20211112-sgd-e28.zipをダウンロードしてきてDeepDanbooruのフォルダに展開します。
-
-以下からダウンロードします。Assetsをクリックして開き、そこからダウンロードします。
-
-![DeepDanbooruダウンロードページ](https://user-images.githubusercontent.com/52813779/208909417-10e597df-7085-41ee-bd06-3e856a1339df.png)
-
-以下のようなこういうディレクトリ構造にしてください
-
-![DeepDanbooruのディレクトリ構造](https://user-images.githubusercontent.com/52813779/208909486-38935d8b-8dc6-43f1-84d3-fef99bc471aa.png)
-
-Diffusersの環境に必要なライブラリをインストールします。DeepDanbooruのフォルダに移動してインストールします（実質的にはtensorflow-ioが追加されるだけだと思います）。
-
-```
-pip install -r requirements.txt
-```
-
-続いてDeepDanbooru自体をインストールします。
-
-```
-pip install .
-```
-
-以上でタグ付けの環境整備は完了です。
-
-### タグ付けの実施
-DeepDanbooruのフォルダに移動し、deepdanbooruを実行してタグ付けを行います。
-
-```
-deepdanbooru evaluate <教師データフォルダ> --project-path deepdanbooru-v3-20211112-sgd-e28 --allow-folder --save-txt
-```
-
-教師データを親フォルダのtrain_dataに置いた場合、以下のようになります。
-
-```
-deepdanbooru evaluate ../train_data --project-path deepdanbooru-v3-20211112-sgd-e28 --allow-folder --save-txt
-```
-
-タグファイルが教師データ画像と同じディレクトリに、同じファイル名、拡張子.txtで作成されます。1件ずつ処理されるためわりと遅いです。
-
-複数の教師データフォルダがある場合には、それぞれのフォルダに対して実行してください。
-
-以下のように生成されます。
-
-![DeepDanbooruの生成ファイル](https://user-images.githubusercontent.com/52813779/208909855-d21b9c98-f2d3-4283-8238-5b0e5aad6691.png)
-
-こんな感じにタグが付きます（すごい情報量……）。
-
-![DeepDanbooruタグと画像](https://user-images.githubusercontent.com/52813779/208909908-a7920174-266e-48d5-aaef-940aba709519.png)
-
-## WD14Taggerによるタグ付け
-DeepDanbooruの代わりにWD14Taggerを用いる手順です。
-
-Automatic1111氏のWebUIで使用しているtaggerを利用します。こちらのgithubページ（https://github.com/toriato/stable-diffusion-webui-wd14-tagger#mrsmilingwolfs-model-aka-waifu-diffusion-14-tagger ）の情報を参考にさせていただきました。
-
-最初の環境整備で必要なモジュールはインストール済みです。また重みはHugging Faceから自動的にダウンロードしてきます。
-
-### タグ付けの実施
-スクリプトを実行してタグ付けを行います。
-```
-python tag_images_by_wd14_tagger.py --batch_size <バッチサイズ> <教師データフォルダ>
-```
-
-教師データを親フォルダのtrain_dataに置いた場合、以下のようになります。
-```
-python tag_images_by_wd14_tagger.py --batch_size 4 ..\train_data
-```
-
-初回起動時にはモデルファイルがwd14_tagger_modelフォルダに自動的にダウンロードされます（フォルダはオプションで変えられます）。以下のようになります。
-
-![ダウンロードされたファイル](https://user-images.githubusercontent.com/52813779/208910447-f7eb0582-90d6-49d3-a666-2b508c7d1842.png)
-
-タグファイルが教師データ画像と同じディレクトリに、同じファイル名、拡張子.txtで作成されます。
-
-![生成されたタグファイル](https://user-images.githubusercontent.com/52813779/208910534-ea514373-1185-4b7d-9ae3-61eb50bc294e.png)
-
-![タグと画像](https://user-images.githubusercontent.com/52813779/208910599-29070c15-7639-474f-b3e4-06bd5a3df29e.png)
-
-threshオプションで、判定されたタグのconfidence（確信度）がいくつ以上でタグをつけるかが指定できます。デフォルトはWD14Taggerのサンプルと同じ0.35です。値を下げるとより多くのタグが付与されますが、精度は下がります。
-batch_sizeはGPUのVRAM容量に応じて増減してください。大きいほうが速くなります（VRAM 12GBでももう少し増やせると思います）。caption_extensionオプションでタグファイルの拡張子を変更できます。デフォルトは.txtです。
-model_dirオプションでモデルの保存先フォルダを指定できます。
-またforce_downloadオプションを指定すると保存先フォルダがあってもモデルを再ダウンロードします。
-
-複数の教師データフォルダがある場合には、それぞれのフォルダに対して実行してください。
-
-## キャプションとタグ情報の前処理
-
-スクリプトから処理しやすいようにキャプションとタグをメタデータとしてひとつのファイルにまとめます。
-
-### キャプションの前処理
-
-キャプションをメタデータに入れるには、作業フォルダ内で以下を実行してください（キャプションを学習に使わない場合は実行不要です）（実際は1行で記述します、以下同様）。
-
-```
-python merge_captions_to_metadata.py <教師データフォルダ>
-　  --in_json <読み込むメタデータファイル名> 
-    <メタデータファイル名>
-```
-
-メタデータファイル名は任意の名前です。
-教師データがtrain_data、読み込むメタデータファイルなし、メタデータファイルがmeta_cap.jsonの場合、以下のようになります。
-
-```
-python merge_captions_to_metadata.py train_data meta_cap.json
-```
-
-caption_extensionオプションでキャプションの拡張子を指定できます。
-
-複数の教師データフォルダがある場合には、full_path引数を指定してください（メタデータにフルパスで情報を持つようになります）。そして、それぞれのフォルダに対して実行してください。
-
-```
-python merge_captions_to_metadata.py --full_path 
-    train_data1 meta_cap1.json
-python merge_captions_to_metadata.py --full_path --in_json meta_cap1.json 
-    train_data2 meta_cap2.json
-```
-
-in_jsonを省略すると書き込み先メタデータファイルがあるとそこから読み込み、そこに上書きします。
-
-__※in_jsonオプションと書き込み先を都度書き換えて、別のメタデータファイルへ書き出すようにすると安全です。__
-
-### タグの前処理
-
-同様にタグもメタデータにまとめます（タグを学習に使わない場合は実行不要です）。
-```
-python merge_dd_tags_to_metadata.py <教師データフォルダ> 
-    --in_json <読み込むメタデータファイル名>
-    <書き込むメタデータファイル名>
-```
-
-先と同じディレクトリ構成で、meta_cap.jsonを読み、meta_cap_dd.jsonに書きだす場合、以下となります。
-```
-python merge_dd_tags_to_metadata.py train_data --in_json meta_cap.json meta_cap_dd.json
-```
-
-複数の教師データフォルダがある場合には、full_path引数を指定してください。そして、それぞれのフォルダに対して実行してください。
-
-```
-python merge_dd_tags_to_metadata.py --full_path --in_json meta_cap2.json
-    train_data1 meta_cap_dd1.json
-python merge_dd_tags_to_metadata.py --full_path --in_json meta_cap_dd1.json 
-    train_data2 meta_cap_dd2.json
-```
-
-in_jsonを省略すると書き込み先メタデータファイルがあるとそこから読み込み、そこに上書きします。
-
-__※in_jsonオプションと書き込み先を都度書き換えて、別のメタデータファイルへ書き出すようにすると安全です。__
-
-### キャプションとタグのクリーニング
-ここまででメタデータファイルにキャプションとDeepDanbooruのタグがまとめられています。ただ自動キャプショニングにしたキャプションは表記ゆれなどがあり微妙（※）ですし、タグにはアンダースコアが含まれていたりratingが付いていたりしますので（DeepDanbooruの場合）、エディタの置換機能などを用いてキャプションとタグのクリーニングをしたほうがいいでしょう。
-
-※たとえばアニメ絵の少女を学習する場合、キャプションにはgirl/girls/woman/womenなどのばらつきがあります。また「anime girl」なども単に「girl」としたほうが適切かもしれません。
-
-クリーニング用のスクリプトが用意してありますので、スクリプトの内容を状況に応じて編集してお使いください。
-
-（教師データフォルダの指定は不要になりました。メタデータ内の全データをクリーニングします。）
-
-```
-python clean_captions_and_tags.py <読み込むメタデータファイル名> <書き込むメタデータファイル名>
-```
-
--in_jsonは付きませんのでご注意ください。たとえば次のようになります。
-
-```
-python clean_captions_and_tags.py meta_cap_dd.json meta_clean.json
-```
-
-以上でキャプションとタグの前処理は完了です。
-
-## latentsの事前取得
-
-学習を高速に進めるためあらかじめ画像の潜在表現を取得しディスクに保存しておきます。あわせてbucketing（教師データをアスペクト比に応じて分類する）を行います。
-
-作業フォルダで以下のように入力してください。
-```
-python prepare_buckets_latents.py <教師データフォルダ>  
-    <読み込むメタデータファイル名> <書き込むメタデータファイル名> 
-    <fine tuningするモデル名またはcheckpoint> 
-    --batch_size <バッチサイズ> 
-    --max_resolution <解像度 幅,高さ> 
-    --mixed_precision <精度>
-```
-
-モデルがmodel.ckpt、バッチサイズ4、学習解像度は512\*512、精度no（float32）で、meta_clean.jsonからメタデータを読み込み、meta_lat.jsonに書き込む場合、以下のようになります。
-
-```
-python prepare_buckets_latents.py 
-    train_data meta_clean.json meta_lat.json model.ckpt 
-    --batch_size 4 --max_resolution 512,512 --mixed_precision no
-```
-
-教師データフォルダにnumpyのnpz形式でlatentsが保存されます。
-
-Stable Diffusion 2.0のモデルを読み込む場合は--v2オプションを指定してください（--v_parameterizationは不要です）。
-
-解像度の最小サイズを--min_bucket_resoオプションで、最大サイズを--max_bucket_resoで指定できます。デフォルトはそれぞれ256、1024です。たとえば最小サイズに384を指定すると、256\*1024や320\*768などの解像度は使わなくなります。
-解像度を768\*768のように大きくした場合、最大サイズに1280などを指定すると良いでしょう。
-
--flip_augオプションを指定すると左右反転のaugmentation（データ拡張）を行います。疑似的にデータ量を二倍に増やすことができますが、データが左右対称でない場合に指定すると（例えばキャラクタの外見、髪型など）学習がうまく行かなくなります。
-（反転した画像についてもlatentsを取得し、\*\_flip.npzファイルを保存する単純な実装です。fline_tune.pyには特にオプション指定は必要ありません。\_flip付きのファイルがある場合、flip付き・なしのファイルを、ランダムに読み込みます。）
-
-バッチサイズはVRAM 12GBでももう少し増やせるかもしれません。
-解像度は64で割り切れる数字で、"幅,高さ"で指定します。解像度はfine tuning時のメモリサイズに直結します。VRAM 12GBでは512,512が限界と思われます（※）。16GBなら512,704や512,768まで上げられるかもしれません。なお256,256等にしてもVRAM 8GBでは厳しいようです（パラメータやoptimizerなどは解像度に関係せず一定のメモリが必要なため）。
-
-※batch size 1の学習で12GB VRAM、640,640で動いたとの報告もありました。
-
-以下のようにbucketingの結果が表示されます。
-
-![bucketingの結果](https://user-images.githubusercontent.com/52813779/208911419-71c00fbb-2ce6-49d5-89b5-b78d7715e441.png)
-
-複数の教師データフォルダがある場合には、full_path引数を指定してください。そして、それぞれのフォルダに対して実行してください。
-```
-python prepare_buckets_latents.py --full_path  
-    train_data1 meta_clean.json meta_lat1.json model.ckpt 
-    --batch_size 4 --max_resolution 512,512 --mixed_precision no
-
-python prepare_buckets_latents.py --full_path 
-    train_data2 meta_lat1.json meta_lat2.json model.ckpt 
-    --batch_size 4 --max_resolution 512,512 --mixed_precision no
-
-```
-読み込み元と書き込み先を同じにすることも可能ですが別々の方が安全です。
-
-__※引数を都度書き換えて、別のメタデータファイルに書き込むと安全です。__
-
-
-## 学習の実行
-たとえば以下のように実行します。以下は省メモリ化のための設定です。
-```
-accelerate launch --num_cpu_threads_per_process 1 fine_tune.py 
-    --pretrained_model_name_or_path=model.ckpt 
-    --in_json meta_lat.json 
-    --train_data_dir=train_data 
-    --output_dir=fine_tuned 
-    --shuffle_caption 
-    --train_batch_size=1 --learning_rate=5e-6 --max_train_steps=10000 
-    --use_8bit_adam --xformers --gradient_checkpointing
-    --mixed_precision=bf16
-    --save_every_n_epochs=4
-```
-
-accelerateのnum_cpu_threads_per_processには通常は1を指定するとよいようです。
-
-pretrained_model_name_or_pathに学習対象のモデルを指定します（Stable DiffusionのcheckpointかDiffusersのモデル）。Stable Diffusionのcheckpointは.ckptと.safetensorsに対応しています（拡張子で自動判定）。
-
-in_jsonにlatentをキャッシュしたときのメタデータファイルを指定します。
-
-train_data_dirに教師データのフォルダを、output_dirに学習後のモデルの出力先フォルダを指定します。
-
-shuffle_captionを指定すると、キャプション、タグをカンマ区切りされた単位でシャッフルして学習します（Waifu Diffusion v1.3で行っている手法です）。
-（先頭のトークンのいくつかをシャッフルせずに固定できます。その他のオプションのkeep_tokensをご覧ください。）
-
-train_batch_sizeにバッチサイズを指定します。VRAM 12GBでは1か2程度を指定してください。解像度によっても指定可能な数は変わってきます。
-学習に使用される実際のデータ量は「バッチサイズ×ステップ数」です。バッチサイズを増やした時には、それに応じてステップ数を下げることが可能です。
-
-learning_rateに学習率を指定します。たとえばWaifu Diffusion v1.3は5e-6のようです。
-max_train_stepsにステップ数を指定します。
-
-use_8bit_adamを指定すると8-bit Adam Optimizerを使用します。省メモリ化、高速化されますが精度は下がる可能性があります。
-
-xformersを指定するとCrossAttentionを置換して省メモリ化、高速化します。
-※11/9時点ではfloat32の学習ではxformersがエラーになるため、bf16/fp16を使うか、代わりにmem_eff_attnを指定して省メモリ版CrossAttentionを使ってください（速度はxformersに劣ります）。
-
-gradient_checkpointingで勾配の途中保存を有効にします。速度は遅くなりますが使用メモリ量が減ります。
-
-mixed_precisionで混合精度を使うか否かを指定します。"fp16"または"bf16"を指定すると省メモリになりますが精度は劣ります。
-"fp16"と"bf16"は使用メモリ量はほぼ同じで、bf16の方が学習結果は良くなるとの話もあります（試した範囲ではあまり違いは感じられませんでした）。
-"no"を指定すると使用しません（float32になります）。
-
-※bf16で学習したcheckpointをAUTOMATIC1111氏のWeb UIで読み込むとエラーになるようです。これはデータ型のbfloat16がWeb UIのモデルsafety checkerでエラーとなるためのようです。save_precisionオプションを指定してfp16またはfloat32形式で保存してください。またはsafetensors形式で保管しても良さそうです。
-
-save_every_n_epochsを指定するとそのエポックだけ経過するたびに学習中のモデルを保存します。
-
-### Stable Diffusion 2.0対応
-Hugging Faceのstable-diffusion-2-baseを使う場合は--v2オプションを、stable-diffusion-2または768-v-ema.ckptを使う場合は--v2と--v_parameterizationの両方のオプションを指定してください。
-
-### メモリに余裕がある場合に精度や速度を上げる
-まずgradient_checkpointingを外すと速度が上がります。ただし設定できるバッチサイズが減りますので、精度と速度のバランスを見ながら設定してください。
-
-バッチサイズを増やすと速度、精度が上がります。メモリが足りる範囲で、1データ当たりの速度を確認しながら増やしてください（メモリがぎりぎりになるとかえって速度が落ちることがあります）。
-
-### 使用するCLIP出力の変更
-clip_skipオプションに2を指定すると、後ろから二番目の層の出力を用います。1またはオプション省略時は最後の層を用います。
-学習したモデルはAutomatic1111氏のWeb UIで推論できるはずです。
-
-※SD2.0はデフォルトで後ろから二番目の層を使うため、SD2.0の学習では指定しないでください。
-
-学習対象のモデルがもともと二番目の層を使うように学習されている場合は、2を指定するとよいでしょう。
-
-そうではなく最後の層を使用していた場合はモデル全体がそれを前提に学習されています。そのため改めて二番目の層を使用して学習すると、望ましい学習結果を得るにはある程度の枚数の教師データ、長めの学習が必要になるかもしれません。
-
-### トークン長の拡張
-max_token_lengthに150または225を指定することでトークン長を拡張して学習できます。
-学習したモデルはAutomatic1111氏のWeb UIで推論できるはずです。
-
-clip_skipと同様に、モデルの学習状態と異なる長さで学習するには、ある程度の教師データ枚数、長めの学習時間が必要になると思われます。
-
-### 学習ログの保存
-logging_dirオプションにログ保存先フォルダを指定してください。TensorBoard形式のログが保存されます。
-
-たとえば--logging_dir=logsと指定すると、作業フォルダにlogsフォルダが作成され、その中の日時フォルダにログが保存されます。
-また--log_prefixオプションを指定すると、日時の前に指定した文字列が追加されます。「--logging_dir=logs --log_prefix=fine_tune_style1」などとして識別用にお使いください。
-
-TensorBoardでログを確認するには、別のコマンドプロンプトを開き、作業フォルダで以下のように入力します（tensorboardはDiffusersのインストール時にあわせてインストールされると思いますが、もし入っていないならpip install tensorboardで入れてください）。
-```
-tensorboard --logdir=logs
-```
-
-### Hypernetworkの学習
-別の記事で解説予定です。
-
-### 勾配をfp16とした学習（実験的機能）
-full_fp16オプションを指定すると勾配を通常のfloat32からfloat16（fp16）に変更して学習します（mixed precisionではなく完全なfp16学習になるようです）。これによりSD1.xの512*512サイズでは8GB未満、SD2.xの512*512サイズで12GB未満のVRAM使用量で学習できるようです。
-
-あらかじめaccelerate configでfp16を指定し、オプションでmixed_precision="fp16"としてください（bf16では動作しません）。
-
-メモリ使用量を最小化するためには、xformers、use_8bit_adam、gradient_checkpointingの各オプションを指定し、train_batch_sizeを1としてください。
-（余裕があるようならtrain_batch_sizeを段階的に増やすと若干精度が上がるはずです。）
-
-PyTorchのソースにパッチを当てて無理やり実現しています（PyTorch 1.12.1と1.13.0で確認）。精度はかなり落ちますし、途中で学習失敗する確率も高くなります。学習率やステップ数の設定もシビアなようです。それらを認識したうえで自己責任でお使いください。
-
-### その他のオプション
-
-#### keep_tokens
-数値を指定するとキャプションの先頭から、指定した数だけのトークン（カンマ区切りの文字列）をシャッフルせず固定します。
-
-キャプションとタグが両方ある場合、学習時のプロンプトは「キャプション,タグ1,タグ2……」のように連結されますので、「--keep_tokens=1」とすれば、学習時にキャプションが必ず先頭に来るようになります。
-
-#### dataset_repeats
-データセットの枚数が極端に少ない場合、epochがすぐに終わってしまうため（epochの区切りで少し時間が掛かります）、数値を指定してデータを何倍かしてepochを長めにしてください。
-
-#### train_text_encoder
-Text Encoderも学習対象とします。メモリ使用量が若干増加します。
-
-通常のfine tuningではText Encoderは学習対象としませんが（恐らくText Encoderの出力に従うようにU-Netを学習するため）、学習データ数が少ない場合には、DreamBoothのようにText Encoder側に学習させるのも有効的なようです。
-
-#### save_precision
-checkpoint保存時のデータ形式をfloat、fp16、bf16から指定できます（未指定時は学習中のデータ形式と同じ）。ディスク容量が節約できますがモデルによる生成結果は変わってきます。またfloatやfp16を指定すると、1111氏のWeb UIでも読めるようになるはずです。
-
-※VAEについては元のcheckpointのデータ形式のままになりますので、fp16でもモデルサイズが2GB強まで小さくならない場合があります。
-
-#### save_model_as
-モデルの保存形式を指定します。ckpt、safetensors、diffusers、diffusers_safetensorsのいずれかを指定してください。
-
-Stable Diffusion形式（ckptまたはsafetensors）を読み込み、Diffusers形式で保存する場合、不足する情報はHugging Faceからv1.5またはv2.1の情報を落としてきて補完します。
-
-#### use_safetensors
-このオプションを指定するとsafetensors形式でcheckpointを保存します。保存形式はデフォルト（読み込んだ形式と同じ）になります。
-
-#### save_stateとresume
-save_stateオプションで、途中保存時および最終保存時に、checkpointに加えてoptimizer等の学習状態をフォルダに保存します。これにより中断してから学習再開したときの精度低下が避けられます（optimizerは状態を持ちながら最適化をしていくため、その状態がリセットされると再び初期状態から最適化を行わなくてはなりません）。なお、Accelerateの仕様でステップ数は保存されません。
-
-スクリプト起動時、resumeオプションで状態の保存されたフォルダを指定すると再開できます。
-
-学習状態は一回の保存あたり5GB程度になりますのでディスク容量にご注意ください。
-
-#### gradient_accumulation_steps
-指定したステップ数だけまとめて勾配を更新します。バッチサイズを増やすのと同様の効果がありますが、メモリを若干消費します。
-
-※Accelerateの仕様で学習モデルが複数の場合には対応していないとのことですので、Text Encoderを学習対象にして、このオプションに2以上の値を指定するとエラーになるかもしれません。
-
-#### lr_scheduler / lr_warmup_steps
-lr_schedulerオプションで学習率のスケジューラをlinear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmupから選べます。デフォルトはconstantです。
-
-lr_warmup_stepsでスケジューラのウォームアップ（だんだん学習率を変えていく）ステップ数を指定できます。詳細については各自お調べください。
-
-#### diffusers_xformers
-スクリプト独自のxformers置換機能ではなくDiffusersのxformers機能を利用します。Hypernetworkの学習はできなくなります。
--- a/finetune/blip/blip.py
+++ b/finetune/blip/blip.py
@@ -21,6 +21,10 @@ import torch.nn.functional as F
 import os
 from urllib.parse import urlparse
 from timm.models.hub import download_cached_file
+from library.utils import setup_logging
+setup_logging()
+import logging
+logger = logging.getLogger(__name__)

 class BLIP_Base(nn.Module):
    def __init__(self,                 
@@ -130,8 +134,9 @@ class BLIP_Decoder(nn.Module):
    def generate(self, image, sample=False, num_beams=3, max_length=30, min_length=10, top_p=0.9, repetition_penalty=1.0):
        image_embeds = self.visual_encoder(image)

-        if not sample:
-            image_embeds = image_embeds.repeat_interleave(num_beams,dim=0)
+        # recent version of transformers seems to do repeat_interleave automatically
+        # if not sample:
+        #     image_embeds = image_embeds.repeat_interleave(num_beams,dim=0)
            
        image_atts = torch.ones(image_embeds.size()[:-1],dtype=torch.long).to(image.device)
        model_kwargs = {"encoder_hidden_states": image_embeds, "encoder_attention_mask":image_atts}
@@ -235,6 +240,6 @@ def load_checkpoint(model,url_or_filename):
                del state_dict[key]
    
    msg = model.load_state_dict(state_dict,strict=False)
-    print('load checkpoint from %s'%url_or_filename)  
+    logger.info('load checkpoint from %s'%url_or_filename)  
    return model,msg
    
--- a/finetune/clean_captions_and_tags.py
+++ b/finetune/clean_captions_and_tags.py
@@ -8,6 +8,10 @@ import json
 import re

 from tqdm import tqdm
+from library.utils import setup_logging
+setup_logging()
+import logging
+logger = logging.getLogger(__name__)

 PATTERN_HAIR_LENGTH = re.compile(r', (long|short|medium) hair, ')
 PATTERN_HAIR_CUT = re.compile(r', (bob|hime) cut, ')
@@ -36,13 +40,13 @@ def clean_tags(image_key, tags):
  tokens = tags.split(", rating")
  if len(tokens) == 1:
    # WD14 taggerのときはこちらになるのでメッセージは出さない
-    # print("no rating:")
-    # print(f"{image_key} {tags}")
+    # logger.info("no rating:")
+    # logger.info(f"{image_key} {tags}")
    pass
  else:
    if len(tokens) > 2:
-      print("multiple ratings:")
-      print(f"{image_key} {tags}")
+      logger.info("multiple ratings:")
+      logger.info(f"{image_key} {tags}")
    tags = tokens[0]

  tags = ", " + tags.replace(", ", ", , ") + ", "     # カンマ付きで検索をするための身も蓋もない対策
@@ -124,58 +128,64 @@ def clean_caption(caption):

 def main(args):
  if os.path.exists(args.in_json):
-    print(f"loading existing metadata: {args.in_json}")
+    logger.info(f"loading existing metadata: {args.in_json}")
    with open(args.in_json, "rt", encoding='utf-8') as f:
      metadata = json.load(f)
  else:
-    print("no metadata / メタデータファイルがありません")
+    logger.error("no metadata / メタデータファイルがありません")
    return

-  print("cleaning captions and tags.")
+  logger.info("cleaning captions and tags.")
  image_keys = list(metadata.keys())
  for image_key in tqdm(image_keys):
    tags = metadata[image_key].get('tags')
    if tags is None:
-      print(f"image does not have tags / メタデータにタグがありません: {image_key}")
+      logger.error(f"image does not have tags / メタデータにタグがありません: {image_key}")
    else:
      org = tags
      tags = clean_tags(image_key, tags)
      metadata[image_key]['tags'] = tags
      if args.debug and org != tags:
-        print("FROM: " + org)
-        print("TO:   " + tags)
+        logger.info("FROM: " + org)
+        logger.info("TO:   " + tags)

    caption = metadata[image_key].get('caption')
    if caption is None:
-      print(f"image does not have caption / メタデータにキャプションがありません: {image_key}")
+      logger.error(f"image does not have caption / メタデータにキャプションがありません: {image_key}")
    else:
      org = caption
      caption = clean_caption(caption)
      metadata[image_key]['caption'] = caption
      if args.debug and org != caption:
-        print("FROM: " + org)
-        print("TO:   " + caption)
+        logger.info("FROM: " + org)
+        logger.info("TO:   " + caption)

  # metadataを書き出して終わり
-  print(f"writing metadata: {args.out_json}")
+  logger.info(f"writing metadata: {args.out_json}")
  with open(args.out_json, "wt", encoding='utf-8') as f:
    json.dump(metadata, f, indent=2)
-  print("done!")
+  logger.info("done!")


-if __name__ == '__main__':
+def setup_parser() -> argparse.ArgumentParser:
  parser = argparse.ArgumentParser()
  # parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
  parser.add_argument("in_json", type=str, help="metadata file to input / 読み込むメタデータファイル")
  parser.add_argument("out_json", type=str, help="metadata file to output / メタデータファイル書き出し先")
  parser.add_argument("--debug", action="store_true", help="debug mode")

+  return parser
+
+
+if __name__ == '__main__':
+  parser = setup_parser()
+
  args, unknown = parser.parse_known_args()
  if len(unknown) == 1:
-    print("WARNING: train_data_dir argument is removed. This script will not work with three arguments in future. Please specify two arguments: in_json and out_json.")
-    print("All captions and tags in the metadata are processed.")
-    print("警告: train_data_dir引数は不要になりました。将来的には三つの引数を指定すると動かなくなる予定です。読み込み元のメタデータと書き出し先の二つの引数だけ指定してください。")
-    print("メタデータ内のすべてのキャプションとタグが処理されます。")
+    logger.warning("WARNING: train_data_dir argument is removed. This script will not work with three arguments in future. Please specify two arguments: in_json and out_json.")
+    logger.warning("All captions and tags in the metadata are processed.")
+    logger.warning("警告: train_data_dir引数は不要になりました。将来的には三つの引数を指定すると動かなくなる予定です。読み込み元のメタデータと書き出し先の二つの引数だけ指定してください。")
+    logger.warning("メタデータ内のすべてのキャプションとタグが処理されます。")
    args.in_json = args.out_json
    args.out_json = unknown[0]
  elif len(unknown) > 0:
--- a/finetune/make_captions.py
+++ b/finetune/make_captions.py
@@ -3,160 +3,208 @@ import glob
 import os
 import json
 import random
+import sys

+from pathlib import Path
 from PIL import Image
 from tqdm import tqdm
 import numpy as np
+
 import torch
+from library.device_utils import init_ipex, get_preferred_device
+init_ipex()
+
 from torchvision import transforms
 from torchvision.transforms.functional import InterpolationMode
-from blip.blip import blip_decoder
+sys.path.append(os.path.dirname(__file__))
+from blip.blip import blip_decoder, is_url
 import library.train_util as train_util
+from library.utils import setup_logging
+setup_logging()
+import logging
+logger = logging.getLogger(__name__)

-DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+DEVICE = get_preferred_device()


 IMAGE_SIZE = 384

 # 正方形でいいのか？　という気がするがソースがそうなので
-IMAGE_TRANSFORM = transforms.Compose([
-    transforms.Resize((IMAGE_SIZE, IMAGE_SIZE), interpolation=InterpolationMode.BICUBIC),
-    transforms.ToTensor(),
-    transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711))
-])
+IMAGE_TRANSFORM = transforms.Compose(
+    [
+        transforms.Resize((IMAGE_SIZE, IMAGE_SIZE), interpolation=InterpolationMode.BICUBIC),
+        transforms.ToTensor(),
+        transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
+    ]
+)
+

 # 共通化したいが微妙に処理が異なる……
 class ImageLoadingTransformDataset(torch.utils.data.Dataset):
-  def __init__(self, image_paths):
-    self.images = image_paths
+    def __init__(self, image_paths):
+        self.images = image_paths

-  def __len__(self):
-    return len(self.images)
+    def __len__(self):
+        return len(self.images)

-  def __getitem__(self, idx):
-    img_path = self.images[idx]
+    def __getitem__(self, idx):
+        img_path = self.images[idx]

-    try:
-      image = Image.open(img_path).convert("RGB")
-      # convert to tensor temporarily so dataloader will accept it
-      tensor = IMAGE_TRANSFORM(image)
-    except Exception as e:
-      print(f"Could not load image path / 画像を読み込めません: {img_path}, error: {e}")
-      return None
+        try:
+            image = Image.open(img_path).convert("RGB")
+            # convert to tensor temporarily so dataloader will accept it
+            tensor = IMAGE_TRANSFORM(image)
+        except Exception as e:
+            logger.error(f"Could not load image path / 画像を読み込めません: {img_path}, error: {e}")
+            return None

-    return (tensor, img_path)
+        return (tensor, img_path)


 def collate_fn_remove_corrupted(batch):
-  """Collate function that allows to remove corrupted examples in the
-  dataloader. It expects that the dataloader returns 'None' when that occurs.
-  The 'None's in the batch are removed.
-  """
-  # Filter out all the Nones (corrupted examples)
-  batch = list(filter(lambda x: x is not None, batch))
-  return batch
+    """Collate function that allows to remove corrupted examples in the
+    dataloader. It expects that the dataloader returns 'None' when that occurs.
+    The 'None's in the batch are removed.
+    """
+    # Filter out all the Nones (corrupted examples)
+    batch = list(filter(lambda x: x is not None, batch))
+    return batch


 def main(args):
-  # fix the seed for reproducibility
-  seed = args.seed  # + utils.get_rank()
-  torch.manual_seed(seed)
-  np.random.seed(seed)
-  random.seed(seed)
+    # fix the seed for reproducibility
+    seed = args.seed  # + utils.get_rank()
+    torch.manual_seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)

-  if not os.path.exists("blip"):
-    args.train_data_dir = os.path.abspath(args.train_data_dir)        # convert to absolute path
+    if not os.path.exists("blip"):
+        args.train_data_dir = os.path.abspath(args.train_data_dir)  # convert to absolute path

-    cwd = os.getcwd()
-    print('Current Working Directory is: ', cwd)
-    os.chdir('finetune')
+        cwd = os.getcwd()
+        logger.info(f"Current Working Directory is: {cwd}")
+        os.chdir("finetune")
+        if not is_url(args.caption_weights) and not os.path.isfile(args.caption_weights):
+            args.caption_weights = os.path.join("..", args.caption_weights)

-  print(f"load images from {args.train_data_dir}")
-  image_paths = train_util.glob_images(args.train_data_dir)
-  print(f"found {len(image_paths)} images.")
+    logger.info(f"load images from {args.train_data_dir}")
+    train_data_dir_path = Path(args.train_data_dir)
+    image_paths = train_util.glob_images_pathlib(train_data_dir_path, args.recursive)
+    logger.info(f"found {len(image_paths)} images.")

-  print(f"loading BLIP caption: {args.caption_weights}")
-  model = blip_decoder(pretrained=args.caption_weights, image_size=IMAGE_SIZE, vit='large', med_config="./blip/med_config.json")
-  model.eval()
-  model = model.to(DEVICE)
-  print("BLIP loaded")
+    logger.info(f"loading BLIP caption: {args.caption_weights}")
+    model = blip_decoder(pretrained=args.caption_weights, image_size=IMAGE_SIZE, vit="large", med_config="./blip/med_config.json")
+    model.eval()
+    model = model.to(DEVICE)
+    logger.info("BLIP loaded")

-  # captioningする
-  def run_batch(path_imgs):
-    imgs = torch.stack([im for _, im in path_imgs]).to(DEVICE)
+    # captioningする
+    def run_batch(path_imgs):
+        imgs = torch.stack([im for _, im in path_imgs]).to(DEVICE)

-    with torch.no_grad():
-      if args.beam_search:
-        captions = model.generate(imgs, sample=False, num_beams=args.num_beams,
-                                  max_length=args.max_length, min_length=args.min_length)
-      else:
-        captions = model.generate(imgs, sample=True, top_p=args.top_p, max_length=args.max_length, min_length=args.min_length)
+        with torch.no_grad():
+            if args.beam_search:
+                captions = model.generate(
+                    imgs, sample=False, num_beams=args.num_beams, max_length=args.max_length, min_length=args.min_length
+                )
+            else:
+                captions = model.generate(
+                    imgs, sample=True, top_p=args.top_p, max_length=args.max_length, min_length=args.min_length
+                )

-    for (image_path, _), caption in zip(path_imgs, captions):
-      with open(os.path.splitext(image_path)[0] + args.caption_extension, "wt", encoding='utf-8') as f:
-        f.write(caption + "\n")
-        if args.debug:
-          print(image_path, caption)
+        for (image_path, _), caption in zip(path_imgs, captions):
+            with open(os.path.splitext(image_path)[0] + args.caption_extension, "wt", encoding="utf-8") as f:
+                f.write(caption + "\n")
+                if args.debug:
+                    logger.info(f'{image_path} {caption}')

-  # 読み込みの高速化のためにDataLoaderを使うオプション
-  if args.max_data_loader_n_workers is not None:
-    dataset = ImageLoadingTransformDataset(image_paths)
-    data = torch.utils.data.DataLoader(dataset, batch_size=args.batch_size, shuffle=False,
-                                      num_workers=args.max_data_loader_n_workers, collate_fn=collate_fn_remove_corrupted, drop_last=False)
-  else:
-    data = [[(None, ip)] for ip in image_paths]
+    # 読み込みの高速化のためにDataLoaderを使うオプション
+    if args.max_data_loader_n_workers is not None:
+        dataset = ImageLoadingTransformDataset(image_paths)
+        data = torch.utils.data.DataLoader(
+            dataset,
+            batch_size=args.batch_size,
+            shuffle=False,
+            num_workers=args.max_data_loader_n_workers,
+            collate_fn=collate_fn_remove_corrupted,
+            drop_last=False,
+        )
+    else:
+        data = [[(None, ip)] for ip in image_paths]

-  b_imgs = []
-  for data_entry in tqdm(data, smoothing=0.0):
-    for data in data_entry:
-      if data is None:
-        continue
+    b_imgs = []
+    for data_entry in tqdm(data, smoothing=0.0):
+        for data in data_entry:
+            if data is None:
+                continue

-      img_tensor, image_path = data
-      if img_tensor is None:
-        try:
-          raw_image = Image.open(image_path)
-          if raw_image.mode != 'RGB':
-            raw_image = raw_image.convert("RGB")
-          img_tensor = IMAGE_TRANSFORM(raw_image)
-        except Exception as e:
-          print(f"Could not load image path / 画像を読み込めません: {image_path}, error: {e}")
-          continue
+            img_tensor, image_path = data
+            if img_tensor is None:
+                try:
+                    raw_image = Image.open(image_path)
+                    if raw_image.mode != "RGB":
+                        raw_image = raw_image.convert("RGB")
+                    img_tensor = IMAGE_TRANSFORM(raw_image)
+                except Exception as e:
+                    logger.error(f"Could not load image path / 画像を読み込めません: {image_path}, error: {e}")
+                    continue

-      b_imgs.append((image_path, img_tensor))
-      if len(b_imgs) >= args.batch_size:
+            b_imgs.append((image_path, img_tensor))
+            if len(b_imgs) >= args.batch_size:
+                run_batch(b_imgs)
+                b_imgs.clear()
+    if len(b_imgs) > 0:
        run_batch(b_imgs)
-        b_imgs.clear()
-  if len(b_imgs) > 0:
-    run_batch(b_imgs)

-  print("done!")
+    logger.info("done!")


-if __name__ == '__main__':
-  parser = argparse.ArgumentParser()
-  parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
-  parser.add_argument("--caption_weights", type=str, default="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption.pth",
-                      help="BLIP caption weights (model_large_caption.pth) / BLIP captionの重みファイル(model_large_caption.pth)")
-  parser.add_argument("--caption_extention", type=str, default=None,
-                      help="extension of caption file (for backward compatibility) / 出力されるキャプションファイルの拡張子（スペルミスしていたのを残してあります）")
-  parser.add_argument("--caption_extension", type=str, default=".caption", help="extension of caption file / 出力されるキャプションファイルの拡張子")
-  parser.add_argument("--beam_search", action="store_true",
-                      help="use beam search (default Nucleus sampling) / beam searchを使う（このオプション未指定時はNucleus sampling）")
-  parser.add_argument("--batch_size", type=int, default=1, help="batch size in inference / 推論時のバッチサイズ")
-  parser.add_argument("--max_data_loader_n_workers", type=int, default=None,
-                      help="enable image reading by DataLoader with this number of workers (faster) / DataLoaderによる画像読み込みを有効にしてこのワーカー数を適用する（読み込みを高速化）")
-  parser.add_argument("--num_beams", type=int, default=1, help="num of beams in beam search /beam search時のビーム数（多いと精度が上がるが時間がかかる）")
-  parser.add_argument("--top_p", type=float, default=0.9, help="top_p in Nucleus sampling / Nucleus sampling時のtop_p")
-  parser.add_argument("--max_length", type=int, default=75, help="max length of caption / captionの最大長")
-  parser.add_argument("--min_length", type=int, default=5, help="min length of caption / captionの最小長")
-  parser.add_argument('--seed', default=42, type=int, help='seed for reproducibility / 再現性を確保するための乱数seed')
-  parser.add_argument("--debug", action="store_true", help="debug mode")
+def setup_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
+    parser.add_argument(
+        "--caption_weights",
+        type=str,
+        default="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption.pth",
+        help="BLIP caption weights (model_large_caption.pth) / BLIP captionの重みファイル(model_large_caption.pth)",
+    )
+    parser.add_argument(
+        "--caption_extention",
+        type=str,
+        default=None,
+        help="extension of caption file (for backward compatibility) / 出力されるキャプションファイルの拡張子（スペルミスしていたのを残してあります）",
+    )
+    parser.add_argument("--caption_extension", type=str, default=".caption", help="extension of caption file / 出力されるキャプションファイルの拡張子")
+    parser.add_argument(
+        "--beam_search",
+        action="store_true",
+        help="use beam search (default Nucleus sampling) / beam searchを使う（このオプション未指定時はNucleus sampling）",
+    )
+    parser.add_argument("--batch_size", type=int, default=1, help="batch size in inference / 推論時のバッチサイズ")
+    parser.add_argument(
+        "--max_data_loader_n_workers",
+        type=int,
+        default=None,
+        help="enable image reading by DataLoader with this number of workers (faster) / DataLoaderによる画像読み込みを有効にしてこのワーカー数を適用する（読み込みを高速化）",
+    )
+    parser.add_argument("--num_beams", type=int, default=1, help="num of beams in beam search /beam search時のビーム数（多いと精度が上がるが時間がかかる）")
+    parser.add_argument("--top_p", type=float, default=0.9, help="top_p in Nucleus sampling / Nucleus sampling時のtop_p")
+    parser.add_argument("--max_length", type=int, default=75, help="max length of caption / captionの最大長")
+    parser.add_argument("--min_length", type=int, default=5, help="min length of caption / captionの最小長")
+    parser.add_argument("--seed", default=42, type=int, help="seed for reproducibility / 再現性を確保するための乱数seed")
+    parser.add_argument("--debug", action="store_true", help="debug mode")
+    parser.add_argument("--recursive", action="store_true", help="search for images in subfolders recursively / サブフォルダを再帰的に検索する")

-  args = parser.parse_args()
+    return parser

-  # スペルミスしていたオプションを復元する
-  if args.caption_extention is not None:
-    args.caption_extension = args.caption_extention

-  main(args)
+if __name__ == "__main__":
+    parser = setup_parser()
+
+    args = parser.parse_args()
+
+    # スペルミスしていたオプションを復元する
+    if args.caption_extention is not None:
+        args.caption_extension = args.caption_extention
+
+    main(args)
--- a/finetune/make_captions_by_git.py
+++ b/finetune/make_captions_by_git.py
@@ -2,144 +2,182 @@ import argparse
 import os
 import re

+from pathlib import Path
 from PIL import Image
 from tqdm import tqdm
+
 import torch
+from library.device_utils import init_ipex, get_preferred_device
+init_ipex()
+
 from transformers import AutoProcessor, AutoModelForCausalLM
 from transformers.generation.utils import GenerationMixin

 import library.train_util as train_util
+from library.utils import setup_logging
+setup_logging()
+import logging
+logger = logging.getLogger(__name__)

-
-DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

 PATTERN_REPLACE = [
    re.compile(r'(has|with|and) the (words?|letters?|name) (" ?[^"]*"|\w+)( ?(is )?(on|in) (the |her |their |him )?\w+)?'),
    re.compile(r'(with a sign )?that says ?(" ?[^"]*"|\w+)( ?on it)?'),
    re.compile(r"(with a sign )?that says ?(' ?(i'm)?[^']*'|\w+)( ?on it)?"),
-    re.compile(r'with the number \d+ on (it|\w+ \w+)'),
+    re.compile(r"with the number \d+ on (it|\w+ \w+)"),
    re.compile(r'with the words "'),
-    re.compile(r'word \w+ on it'),
-    re.compile(r'that says the word \w+ on it'),
-    re.compile('that says\'the word "( on it)?'),
+    re.compile(r"word \w+ on it"),
+    re.compile(r"that says the word \w+ on it"),
+    re.compile("that says'the word \"( on it)?"),
 ]

 # 誤検知しまくりの with the word xxxx を消す


 def remove_words(captions, debug):
-  removed_caps = []
-  for caption in captions:
-    cap = caption
-    for pat in PATTERN_REPLACE:
-      cap = pat.sub("", cap)
-    if debug and cap != caption:
-      print(caption)
-      print(cap)
-    removed_caps.append(cap)
-  return removed_caps
+    removed_caps = []
+    for caption in captions:
+        cap = caption
+        for pat in PATTERN_REPLACE:
+            cap = pat.sub("", cap)
+        if debug and cap != caption:
+            logger.info(caption)
+            logger.info(cap)
+        removed_caps.append(cap)
+    return removed_caps


 def collate_fn_remove_corrupted(batch):
-  """Collate function that allows to remove corrupted examples in the
-  dataloader. It expects that the dataloader returns 'None' when that occurs.
-  The 'None's in the batch are removed.
-  """
-  # Filter out all the Nones (corrupted examples)
-  batch = list(filter(lambda x: x is not None, batch))
-  return batch
+    """Collate function that allows to remove corrupted examples in the
+    dataloader. It expects that the dataloader returns 'None' when that occurs.
+    The 'None's in the batch are removed.
+    """
+    # Filter out all the Nones (corrupted examples)
+    batch = list(filter(lambda x: x is not None, batch))
+    return batch


 def main(args):
-  # GITにバッチサイズが1より大きくても動くようにパッチを当てる: transformers 4.26.0用
-  org_prepare_input_ids_for_generation = GenerationMixin._prepare_input_ids_for_generation
-  curr_batch_size = [args.batch_size]         # ループの最後で件数がbatch_size未満になるので入れ替えられるように
+    r"""
+    transformers 4.30.2で、バッチサイズ>1でも動くようになったので、以下コメントアウト

-  # input_idsがバッチサイズと同じ件数である必要がある：バッチサイズはこの関数から参照できないので外から渡す
-  # ここより上で置き換えようとするとすごく大変
-  def _prepare_input_ids_for_generation_patch(self, bos_token_id, encoder_outputs):
-    input_ids = org_prepare_input_ids_for_generation(self, bos_token_id, encoder_outputs)
-    if input_ids.size()[0] != curr_batch_size[0]:
-      input_ids = input_ids.repeat(curr_batch_size[0], 1)
-    return input_ids
-  GenerationMixin._prepare_input_ids_for_generation = _prepare_input_ids_for_generation_patch
+    # GITにバッチサイズが1より大きくても動くようにパッチを当てる: transformers 4.26.0用
+    org_prepare_input_ids_for_generation = GenerationMixin._prepare_input_ids_for_generation
+    curr_batch_size = [args.batch_size]  # ループの最後で件数がbatch_size未満になるので入れ替えられるように

-  print(f"load images from {args.train_data_dir}")
-  image_paths = train_util.glob_images(args.train_data_dir)
-  print(f"found {len(image_paths)} images.")
+    # input_idsがバッチサイズと同じ件数である必要がある：バッチサイズはこの関数から参照できないので外から渡す
+    # ここより上で置き換えようとするとすごく大変
+    def _prepare_input_ids_for_generation_patch(self, bos_token_id, encoder_outputs):
+        input_ids = org_prepare_input_ids_for_generation(self, bos_token_id, encoder_outputs)
+        if input_ids.size()[0] != curr_batch_size[0]:
+            input_ids = input_ids.repeat(curr_batch_size[0], 1)
+        return input_ids

-  # できればcacheに依存せず明示的にダウンロードしたい
-  print(f"loading GIT: {args.model_id}")
-  git_processor = AutoProcessor.from_pretrained(args.model_id)
-  git_model = AutoModelForCausalLM.from_pretrained(args.model_id).to(DEVICE)
-  print("GIT loaded")
+    GenerationMixin._prepare_input_ids_for_generation = _prepare_input_ids_for_generation_patch
+    """

-  # captioningする
-  def run_batch(path_imgs):
-    imgs = [im for _, im in path_imgs]
+    logger.info(f"load images from {args.train_data_dir}")
+    train_data_dir_path = Path(args.train_data_dir)
+    image_paths = train_util.glob_images_pathlib(train_data_dir_path, args.recursive)
+    logger.info(f"found {len(image_paths)} images.")

-    curr_batch_size[0] = len(path_imgs)
-    inputs = git_processor(images=imgs, return_tensors="pt").to(DEVICE)           # 画像はpil形式
-    generated_ids = git_model.generate(pixel_values=inputs.pixel_values, max_length=args.max_length)
-    captions = git_processor.batch_decode(generated_ids, skip_special_tokens=True)
+    # できればcacheに依存せず明示的にダウンロードしたい
+    logger.info(f"loading GIT: {args.model_id}")
+    git_processor = AutoProcessor.from_pretrained(args.model_id)
+    git_model = AutoModelForCausalLM.from_pretrained(args.model_id).to(DEVICE)
+    logger.info("GIT loaded")

-    if args.remove_words:
-      captions = remove_words(captions, args.debug)
+    # captioningする
+    def run_batch(path_imgs):
+        imgs = [im for _, im in path_imgs]

-    for (image_path, _), caption in zip(path_imgs, captions):
-      with open(os.path.splitext(image_path)[0] + args.caption_extension, "wt", encoding='utf-8') as f:
-        f.write(caption + "\n")
-        if args.debug:
-          print(image_path, caption)
+        # curr_batch_size[0] = len(path_imgs)
+        inputs = git_processor(images=imgs, return_tensors="pt").to(DEVICE)  # 画像はpil形式
+        generated_ids = git_model.generate(pixel_values=inputs.pixel_values, max_length=args.max_length)
+        captions = git_processor.batch_decode(generated_ids, skip_special_tokens=True)

-  # 読み込みの高速化のためにDataLoaderを使うオプション
-  if args.max_data_loader_n_workers is not None:
-    dataset = train_util.ImageLoadingDataset(image_paths)
-    data = torch.utils.data.DataLoader(dataset, batch_size=args.batch_size, shuffle=False,
-                                       num_workers=args.max_data_loader_n_workers, collate_fn=collate_fn_remove_corrupted, drop_last=False)
-  else:
-    data = [[(None, ip)] for ip in image_paths]
+        if args.remove_words:
+            captions = remove_words(captions, args.debug)

-  b_imgs = []
-  for data_entry in tqdm(data, smoothing=0.0):
-    for data in data_entry:
-      if data is None:
-        continue
+        for (image_path, _), caption in zip(path_imgs, captions):
+            with open(os.path.splitext(image_path)[0] + args.caption_extension, "wt", encoding="utf-8") as f:
+                f.write(caption + "\n")
+                if args.debug:
+                    logger.info(f"{image_path} {caption}")

-      image, image_path = data
-      if image is None:
-        try:
-          image = Image.open(image_path)
-          if image.mode != 'RGB':
-            image = image.convert("RGB")
-        except Exception as e:
-          print(f"Could not load image path / 画像を読み込めません: {image_path}, error: {e}")
-          continue
+    # 読み込みの高速化のためにDataLoaderを使うオプション
+    if args.max_data_loader_n_workers is not None:
+        dataset = train_util.ImageLoadingDataset(image_paths)
+        data = torch.utils.data.DataLoader(
+            dataset,
+            batch_size=args.batch_size,
+            shuffle=False,
+            num_workers=args.max_data_loader_n_workers,
+            collate_fn=collate_fn_remove_corrupted,
+            drop_last=False,
+        )
+    else:
+        data = [[(None, ip)] for ip in image_paths]

-      b_imgs.append((image_path, image))
-      if len(b_imgs) >= args.batch_size:
+    b_imgs = []
+    for data_entry in tqdm(data, smoothing=0.0):
+        for data in data_entry:
+            if data is None:
+                continue
+
+            image, image_path = data
+            if image is None:
+                try:
+                    image = Image.open(image_path)
+                    if image.mode != "RGB":
+                        image = image.convert("RGB")
+                except Exception as e:
+                    logger.error(f"Could not load image path / 画像を読み込めません: {image_path}, error: {e}")
+                    continue
+
+            b_imgs.append((image_path, image))
+            if len(b_imgs) >= args.batch_size:
+                run_batch(b_imgs)
+                b_imgs.clear()
+
+    if len(b_imgs) > 0:
        run_batch(b_imgs)
-        b_imgs.clear()

-  if len(b_imgs) > 0:
-    run_batch(b_imgs)
-
-  print("done!")
+    logger.info("done!")


-if __name__ == '__main__':
-  parser = argparse.ArgumentParser()
-  parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
-  parser.add_argument("--caption_extension", type=str, default=".caption", help="extension of caption file / 出力されるキャプションファイルの拡張子")
-  parser.add_argument("--model_id", type=str, default="microsoft/git-large-textcaps",
-                      help="model id for GIT in Hugging Face / 使用するGITのHugging FaceのモデルID")
-  parser.add_argument("--batch_size", type=int, default=1, help="batch size in inference / 推論時のバッチサイズ")
-  parser.add_argument("--max_data_loader_n_workers", type=int, default=None,
-                      help="enable image reading by DataLoader with this number of workers (faster) / DataLoaderによる画像読み込みを有効にしてこのワーカー数を適用する（読み込みを高速化）")
-  parser.add_argument("--max_length", type=int, default=50, help="max length of caption / captionの最大長")
-  parser.add_argument("--remove_words", action="store_true",
-                      help="remove like `with the words xxx` from caption / `with the words xxx`のような部分をキャプションから削除する")
-  parser.add_argument("--debug", action="store_true", help="debug mode")
+def setup_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
+    parser.add_argument("--caption_extension", type=str, default=".caption", help="extension of caption file / 出力されるキャプションファイルの拡張子")
+    parser.add_argument(
+        "--model_id",
+        type=str,
+        default="microsoft/git-large-textcaps",
+        help="model id for GIT in Hugging Face / 使用するGITのHugging FaceのモデルID",
+    )
+    parser.add_argument("--batch_size", type=int, default=1, help="batch size in inference / 推論時のバッチサイズ")
+    parser.add_argument(
+        "--max_data_loader_n_workers",
+        type=int,
+        default=None,
+        help="enable image reading by DataLoader with this number of workers (faster) / DataLoaderによる画像読み込みを有効にしてこのワーカー数を適用する（読み込みを高速化）",
+    )
+    parser.add_argument("--max_length", type=int, default=50, help="max length of caption / captionの最大長")
+    parser.add_argument(
+        "--remove_words",
+        action="store_true",
+        help="remove like `with the words xxx` from caption / `with the words xxx`のような部分をキャプションから削除する",
+    )
+    parser.add_argument("--debug", action="store_true", help="debug mode")
+    parser.add_argument("--recursive", action="store_true", help="search for images in subfolders recursively / サブフォルダを再帰的に検索する")

-  args = parser.parse_args()
-  main(args)
+    return parser
+
+
+if __name__ == "__main__":
+    parser = setup_parser()
+
+    args = parser.parse_args()
+    main(args)
--- a/finetune/merge_captions_to_metadata.py
+++ b/finetune/merge_captions_to_metadata.py
@@ -4,64 +4,97 @@ from pathlib import Path
 from typing import List
 from tqdm import tqdm
 import library.train_util as train_util
+import os
+from library.utils import setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)


 def main(args):
-  assert not args.recursive or (args.recursive and args.full_path), "recursive requires full_path / recursiveはfull_pathと同時に指定してください"
+    assert not args.recursive or (
+        args.recursive and args.full_path
+    ), "recursive requires full_path / recursiveはfull_pathと同時に指定してください"

-  train_data_dir_path = Path(args.train_data_dir)
-  image_paths: List[Path] = train_util.glob_images_pathlib(train_data_dir_path, args.recursive)
-  print(f"found {len(image_paths)} images.")
+    train_data_dir_path = Path(args.train_data_dir)
+    image_paths: List[Path] = train_util.glob_images_pathlib(train_data_dir_path, args.recursive)
+    logger.info(f"found {len(image_paths)} images.")

-  if args.in_json is None and Path(args.out_json).is_file():
-    args.in_json = args.out_json
+    if args.in_json is None and Path(args.out_json).is_file():
+        args.in_json = args.out_json

-  if args.in_json is not None:
-    print(f"loading existing metadata: {args.in_json}")
-    metadata = json.loads(Path(args.in_json).read_text(encoding='utf-8'))
-    print("captions for existing images will be overwritten / 既存の画像のキャプションは上書きされます")
-  else:
-    print("new metadata will be created / 新しいメタデータファイルが作成されます")
-    metadata = {}
+    if args.in_json is not None:
+        logger.info(f"loading existing metadata: {args.in_json}")
+        metadata = json.loads(Path(args.in_json).read_text(encoding="utf-8"))
+        logger.warning("captions for existing images will be overwritten / 既存の画像のキャプションは上書きされます")
+    else:
+        logger.info("new metadata will be created / 新しいメタデータファイルが作成されます")
+        metadata = {}

-  print("merge caption texts to metadata json.")
-  for image_path in tqdm(image_paths):
-    caption_path = image_path.with_suffix(args.caption_extension)
-    caption = caption_path.read_text(encoding='utf-8').strip()
+    logger.info("merge caption texts to metadata json.")
+    for image_path in tqdm(image_paths):
+        caption_path = image_path.with_suffix(args.caption_extension)
+        caption = caption_path.read_text(encoding="utf-8").strip()

-    image_key = str(image_path) if args.full_path else image_path.stem
-    if image_key not in metadata:
-      metadata[image_key] = {}
+        if not os.path.exists(caption_path):
+            caption_path = os.path.join(image_path, args.caption_extension)

-    metadata[image_key]['caption'] = caption
-    if args.debug:
-      print(image_key, caption)
+        image_key = str(image_path) if args.full_path else image_path.stem
+        if image_key not in metadata:
+            metadata[image_key] = {}

-  # metadataを書き出して終わり
-  print(f"writing metadata: {args.out_json}")
-  Path(args.out_json).write_text(json.dumps(metadata, indent=2), encoding='utf-8')
-  print("done!")
+        metadata[image_key]["caption"] = caption
+        if args.debug:
+            logger.info(f"{image_key} {caption}")
+
+    # metadataを書き出して終わり
+    logger.info(f"writing metadata: {args.out_json}")
+    Path(args.out_json).write_text(json.dumps(metadata, indent=2), encoding="utf-8")
+    logger.info("done!")


-if __name__ == '__main__':
-  parser = argparse.ArgumentParser()
-  parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
-  parser.add_argument("out_json", type=str, help="metadata file to output / メタデータファイル書き出し先")
-  parser.add_argument("--in_json", type=str,
-                      help="metadata file to input (if omitted and out_json exists, existing out_json is read) / 読み込むメタデータファイル（省略時、out_jsonが存在すればそれを読み込む）")
-  parser.add_argument("--caption_extention", type=str, default=None,
-                      help="extension of caption file (for backward compatibility) / 読み込むキャプションファイルの拡張子（スペルミスしていたのを残してあります）")
-  parser.add_argument("--caption_extension", type=str, default=".caption", help="extension of caption file / 読み込むキャプションファイルの拡張子")
-  parser.add_argument("--full_path", action="store_true",
-                      help="use full path as image-key in metadata (supports multiple directories) / メタデータで画像キーをフルパスにする（複数の学習画像ディレクトリに対応）")
-  parser.add_argument("--recursive", action="store_true",
-                      help="recursively look for training tags in all child folders of train_data_dir / train_data_dirのすべての子フォルダにある学習タグを再帰的に探す")
-  parser.add_argument("--debug", action="store_true", help="debug mode")
+def setup_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
+    parser.add_argument("out_json", type=str, help="metadata file to output / メタデータファイル書き出し先")
+    parser.add_argument(
+        "--in_json",
+        type=str,
+        help="metadata file to input (if omitted and out_json exists, existing out_json is read) / 読み込むメタデータファイル（省略時、out_jsonが存在すればそれを読み込む）",
+    )
+    parser.add_argument(
+        "--caption_extention",
+        type=str,
+        default=None,
+        help="extension of caption file (for backward compatibility) / 読み込むキャプションファイルの拡張子（スペルミスしていたのを残してあります）",
+    )
+    parser.add_argument(
+        "--caption_extension", type=str, default=".caption", help="extension of caption file / 読み込むキャプションファイルの拡張子"
+    )
+    parser.add_argument(
+        "--full_path",
+        action="store_true",
+        help="use full path as image-key in metadata (supports multiple directories) / メタデータで画像キーをフルパスにする（複数の学習画像ディレクトリに対応）",
+    )
+    parser.add_argument(
+        "--recursive",
+        action="store_true",
+        help="recursively look for training tags in all child folders of train_data_dir / train_data_dirのすべての子フォルダにある学習タグを再帰的に探す",
+    )
+    parser.add_argument("--debug", action="store_true", help="debug mode")

-  args = parser.parse_args()
+    return parser

-  # スペルミスしていたオプションを復元する
-  if args.caption_extention is not None:
-    args.caption_extension = args.caption_extention

-  main(args)
+if __name__ == "__main__":
+    parser = setup_parser()
+
+    args = parser.parse_args()
+
+    # スペルミスしていたオプションを復元する
+    if args.caption_extention is not None:
+        args.caption_extension = args.caption_extention
+
+    main(args)
--- a/finetune/merge_dd_tags_to_metadata.py
+++ b/finetune/merge_dd_tags_to_metadata.py
@@ -4,59 +4,90 @@ from pathlib import Path
 from typing import List
 from tqdm import tqdm
 import library.train_util as train_util
+import os
+from library.utils import setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)


 def main(args):
-  assert not args.recursive or (args.recursive and args.full_path), "recursive requires full_path / recursiveはfull_pathと同時に指定してください"
+    assert not args.recursive or (
+        args.recursive and args.full_path
+    ), "recursive requires full_path / recursiveはfull_pathと同時に指定してください"

-  train_data_dir_path = Path(args.train_data_dir)
-  image_paths: List[Path] = train_util.glob_images_pathlib(train_data_dir_path, args.recursive)
-  print(f"found {len(image_paths)} images.")
+    train_data_dir_path = Path(args.train_data_dir)
+    image_paths: List[Path] = train_util.glob_images_pathlib(train_data_dir_path, args.recursive)
+    logger.info(f"found {len(image_paths)} images.")

-  if args.in_json is None and Path(args.out_json).is_file():
-    args.in_json = args.out_json
+    if args.in_json is None and Path(args.out_json).is_file():
+        args.in_json = args.out_json

-  if args.in_json is not None:
-    print(f"loading existing metadata: {args.in_json}")
-    metadata = json.loads(Path(args.in_json).read_text(encoding='utf-8'))
-    print("tags data for existing images will be overwritten / 既存の画像のタグは上書きされます")
-  else:
-    print("new metadata will be created / 新しいメタデータファイルが作成されます")
-    metadata = {}
+    if args.in_json is not None:
+        logger.info(f"loading existing metadata: {args.in_json}")
+        metadata = json.loads(Path(args.in_json).read_text(encoding="utf-8"))
+        logger.warning("tags data for existing images will be overwritten / 既存の画像のタグは上書きされます")
+    else:
+        logger.info("new metadata will be created / 新しいメタデータファイルが作成されます")
+        metadata = {}

-  print("merge tags to metadata json.")
-  for image_path in tqdm(image_paths):
-    tags_path = image_path.with_suffix(args.caption_extension)
-    tags = tags_path.read_text(encoding='utf-8').strip()
+    logger.info("merge tags to metadata json.")
+    for image_path in tqdm(image_paths):
+        tags_path = image_path.with_suffix(args.caption_extension)
+        tags = tags_path.read_text(encoding="utf-8").strip()

-    image_key = str(image_path) if args.full_path else image_path.stem
-    if image_key not in metadata:
-      metadata[image_key] = {}
+        if not os.path.exists(tags_path):
+            tags_path = os.path.join(image_path, args.caption_extension)

-    metadata[image_key]['tags'] = tags
-    if args.debug:
-      print(image_key, tags)
+        image_key = str(image_path) if args.full_path else image_path.stem
+        if image_key not in metadata:
+            metadata[image_key] = {}

-  # metadataを書き出して終わり
-  print(f"writing metadata: {args.out_json}")
-  Path(args.out_json).write_text(json.dumps(metadata, indent=2), encoding='utf-8')
+        metadata[image_key]["tags"] = tags
+        if args.debug:
+            logger.info(f"{image_key} {tags}")

-  print("done!")
+    # metadataを書き出して終わり
+    logger.info(f"writing metadata: {args.out_json}")
+    Path(args.out_json).write_text(json.dumps(metadata, indent=2), encoding="utf-8")
+
+    logger.info("done!")


-if __name__ == '__main__':
-  parser = argparse.ArgumentParser()
-  parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
-  parser.add_argument("out_json", type=str, help="metadata file to output / メタデータファイル書き出し先")
-  parser.add_argument("--in_json", type=str,
-                      help="metadata file to input (if omitted and out_json exists, existing out_json is read) / 読み込むメタデータファイル（省略時、out_jsonが存在すればそれを読み込む）")
-  parser.add_argument("--full_path", action="store_true",
-                      help="use full path as image-key in metadata (supports multiple directories) / メタデータで画像キーをフルパスにする（複数の学習画像ディレクトリに対応）")
-  parser.add_argument("--recursive", action="store_true",
-                      help="recursively look for training tags in all child folders of train_data_dir / train_data_dirのすべての子フォルダにある学習タグを再帰的に探す")
-  parser.add_argument("--caption_extension", type=str, default=".txt",
-                      help="extension of caption (tag) file / 読み込むキャプション（タグ）ファイルの拡張子")
-  parser.add_argument("--debug", action="store_true", help="debug mode, print tags")
+def setup_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
+    parser.add_argument("out_json", type=str, help="metadata file to output / メタデータファイル書き出し先")
+    parser.add_argument(
+        "--in_json",
+        type=str,
+        help="metadata file to input (if omitted and out_json exists, existing out_json is read) / 読み込むメタデータファイル（省略時、out_jsonが存在すればそれを読み込む）",
+    )
+    parser.add_argument(
+        "--full_path",
+        action="store_true",
+        help="use full path as image-key in metadata (supports multiple directories) / メタデータで画像キーをフルパスにする（複数の学習画像ディレクトリに対応）",
+    )
+    parser.add_argument(
+        "--recursive",
+        action="store_true",
+        help="recursively look for training tags in all child folders of train_data_dir / train_data_dirのすべての子フォルダにある学習タグを再帰的に探す",
+    )
+    parser.add_argument(
+        "--caption_extension",
+        type=str,
+        default=".txt",
+        help="extension of caption (tag) file / 読み込むキャプション（タグ）ファイルの拡張子",
+    )
+    parser.add_argument("--debug", action="store_true", help="debug mode, print tags")

-  args = parser.parse_args()
-  main(args)
+    return parser
+
+
+if __name__ == "__main__":
+    parser = setup_parser()
+
+    args = parser.parse_args()
+    main(args)
--- a/finetune/prepare_buckets_latents.py
+++ b/finetune/prepare_buckets_latents.py
@@ -2,17 +2,30 @@ import argparse
 import os
 import json

+from pathlib import Path
+from typing import List
 from tqdm import tqdm
 import numpy as np
 from PIL import Image
 import cv2
+
 import torch
+from library.device_utils import init_ipex, get_preferred_device
+
+init_ipex()
+
 from torchvision import transforms

 import library.model_util as model_util
 import library.train_util as train_util
+from library.utils import setup_logging

-DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+DEVICE = get_preferred_device()

 IMAGE_TRANSFORMS = transforms.Compose(
    [
@@ -23,221 +36,251 @@ IMAGE_TRANSFORMS = transforms.Compose(


 def collate_fn_remove_corrupted(batch):
-  """Collate function that allows to remove corrupted examples in the
-  dataloader. It expects that the dataloader returns 'None' when that occurs.
-  The 'None's in the batch are removed.
-  """
-  # Filter out all the Nones (corrupted examples)
-  batch = list(filter(lambda x: x is not None, batch))
-  return batch
+    """Collate function that allows to remove corrupted examples in the
+    dataloader. It expects that the dataloader returns 'None' when that occurs.
+    The 'None's in the batch are removed.
+    """
+    # Filter out all the Nones (corrupted examples)
+    batch = list(filter(lambda x: x is not None, batch))
+    return batch


-def get_latents(vae, images, weight_dtype):
-  img_tensors = [IMAGE_TRANSFORMS(image) for image in images]
-  img_tensors = torch.stack(img_tensors)
-  img_tensors = img_tensors.to(DEVICE, weight_dtype)
-  with torch.no_grad():
-    latents = vae.encode(img_tensors).latent_dist.sample().float().to("cpu").numpy()
-  return latents
+def get_npz_filename(data_dir, image_key, is_full_path, recursive):
+    if is_full_path:
+        base_name = os.path.splitext(os.path.basename(image_key))[0]
+        relative_path = os.path.relpath(os.path.dirname(image_key), data_dir)
+    else:
+        base_name = image_key
+        relative_path = ""

-
-def get_npz_filename_wo_ext(data_dir, image_key, is_full_path, flip):
-  if is_full_path:
-    base_name = os.path.splitext(os.path.basename(image_key))[0]
-  else:
-    base_name = image_key
-  if flip:
-    base_name += '_flip'
-  return os.path.join(data_dir, base_name)
+    if recursive and relative_path:
+        return os.path.join(data_dir, relative_path, base_name) + ".npz"
+    else:
+        return os.path.join(data_dir, base_name) + ".npz"


 def main(args):
-  image_paths = train_util.glob_images(args.train_data_dir)
-  print(f"found {len(image_paths)} images.")
+    # assert args.bucket_reso_steps % 8 == 0, f"bucket_reso_steps must be divisible by 8 / bucket_reso_stepは8で割り切れる必要があります"
+    if args.bucket_reso_steps % 8 > 0:
+        logger.warning(f"resolution of buckets in training time is a multiple of 8 / 学習時の各bucketの解像度は8単位になります")
+    if args.bucket_reso_steps % 32 > 0:
+        logger.warning(
+            f"WARNING: bucket_reso_steps is not divisible by 32. It is not working with SDXL / bucket_reso_stepsが32で割り切れません。SDXLでは動作しません"
+        )

-  if os.path.exists(args.in_json):
-    print(f"loading existing metadata: {args.in_json}")
-    with open(args.in_json, "rt", encoding='utf-8') as f:
-      metadata = json.load(f)
-  else:
-    print(f"no metadata / メタデータファイルがありません: {args.in_json}")
-    return
+    train_data_dir_path = Path(args.train_data_dir)
+    image_paths: List[str] = [str(p) for p in train_util.glob_images_pathlib(train_data_dir_path, args.recursive)]
+    logger.info(f"found {len(image_paths)} images.")

-  weight_dtype = torch.float32
-  if args.mixed_precision == "fp16":
-    weight_dtype = torch.float16
-  elif args.mixed_precision == "bf16":
-    weight_dtype = torch.bfloat16
-
-  vae = model_util.load_vae(args.model_name_or_path, weight_dtype)
-  vae.eval()
-  vae.to(DEVICE, dtype=weight_dtype)
-
-  # bucketのサイズを計算する
-  max_reso = tuple([int(t) for t in args.max_resolution.split(',')])
-  assert len(max_reso) == 2, f"illegal resolution (not 'width,height') / 画像サイズに誤りがあります。'幅,高さ'で指定してください: {args.max_resolution}"
-
-  bucket_resos, bucket_aspect_ratios = model_util.make_bucket_resolutions(
-      max_reso, args.min_bucket_reso, args.max_bucket_reso)
-
-  # 画像をひとつずつ適切なbucketに割り当てながらlatentを計算する
-  bucket_aspect_ratios = np.array(bucket_aspect_ratios)
-  buckets_imgs = [[] for _ in range(len(bucket_resos))]
-  bucket_counts = [0 for _ in range(len(bucket_resos))]
-  img_ar_errors = []
-
-  def process_batch(is_last):
-    for j in range(len(buckets_imgs)):
-      bucket = buckets_imgs[j]
-      if (is_last and len(bucket) > 0) or len(bucket) >= args.batch_size:
-        latents = get_latents(vae, [img for _, _, img in bucket], weight_dtype)
-
-        for (image_key, _, _), latent in zip(bucket, latents):
-          npz_file_name = get_npz_filename_wo_ext(args.train_data_dir, image_key, args.full_path, False)
-          np.savez(npz_file_name, latent)
-
-        # flip
-        if args.flip_aug:
-          latents = get_latents(vae, [img[:, ::-1].copy() for _, _, img in bucket], weight_dtype)   # copyがないとTensor変換できない
-
-          for (image_key, _, _), latent in zip(bucket, latents):
-            npz_file_name = get_npz_filename_wo_ext(args.train_data_dir, image_key, args.full_path, True)
-            np.savez(npz_file_name, latent)
-
-        bucket.clear()
-
-  # 読み込みの高速化のためにDataLoaderを使うオプション
-  if args.max_data_loader_n_workers is not None:
-    dataset = train_util.ImageLoadingDataset(image_paths)
-    data = torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=False,
-                                       num_workers=args.max_data_loader_n_workers, collate_fn=collate_fn_remove_corrupted, drop_last=False)
-  else:
-    data = [[(None, ip)] for ip in image_paths]
-
-  for data_entry in tqdm(data, smoothing=0.0):
-    if data_entry[0] is None:
-      continue
-
-    img_tensor, image_path = data_entry[0]
-    if img_tensor is not None:
-      image = transforms.functional.to_pil_image(img_tensor)
+    if os.path.exists(args.in_json):
+        logger.info(f"loading existing metadata: {args.in_json}")
+        with open(args.in_json, "rt", encoding="utf-8") as f:
+            metadata = json.load(f)
    else:
-      try:
-        image = Image.open(image_path)
-        if image.mode != 'RGB':
-          image = image.convert("RGB")
-      except Exception as e:
-        print(f"Could not load image path / 画像を読み込めません: {image_path}, error: {e}")
-        continue
+        logger.error(f"no metadata / メタデータファイルがありません: {args.in_json}")
+        return

-    image_key = image_path if args.full_path else os.path.splitext(os.path.basename(image_path))[0]
-    if image_key not in metadata:
-      metadata[image_key] = {}
+    weight_dtype = torch.float32
+    if args.mixed_precision == "fp16":
+        weight_dtype = torch.float16
+    elif args.mixed_precision == "bf16":
+        weight_dtype = torch.bfloat16

-    # 本当はこの部分もDataSetに持っていけば高速化できるがいろいろ大変
-    aspect_ratio = image.width / image.height
-    ar_errors = bucket_aspect_ratios - aspect_ratio
-    bucket_id = np.abs(ar_errors).argmin()
-    reso = bucket_resos[bucket_id]
-    ar_error = ar_errors[bucket_id]
-    img_ar_errors.append(abs(ar_error))
+    vae = model_util.load_vae(args.model_name_or_path, weight_dtype)
+    vae.eval()
+    vae.to(DEVICE, dtype=weight_dtype)

-    # どのサイズにリサイズするか→トリミングする方向で
-    if ar_error <= 0:                   # 横が長い→縦を合わせる
-      scale = reso[1] / image.height
+    # bucketのサイズを計算する
+    max_reso = tuple([int(t) for t in args.max_resolution.split(",")])
+    assert (
+        len(max_reso) == 2
+    ), f"illegal resolution (not 'width,height') / 画像サイズに誤りがあります。'幅,高さ'で指定してください: {args.max_resolution}"
+
+    bucket_manager = train_util.BucketManager(
+        args.bucket_no_upscale, max_reso, args.min_bucket_reso, args.max_bucket_reso, args.bucket_reso_steps
+    )
+    if not args.bucket_no_upscale:
+        bucket_manager.make_buckets()
    else:
-      scale = reso[0] / image.width
+        logger.warning(
+            "min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます"
+        )

-    resized_size = (int(image.width * scale + .5), int(image.height * scale + .5))
+    # 画像をひとつずつ適切なbucketに割り当てながらlatentを計算する
+    img_ar_errors = []

-    # print(image.width, image.height, bucket_id, bucket_resos[bucket_id], ar_errors[bucket_id], resized_size,
-    #       bucket_resos[bucket_id][0] - resized_size[0], bucket_resos[bucket_id][1] - resized_size[1])
+    def process_batch(is_last):
+        for bucket in bucket_manager.buckets:
+            if (is_last and len(bucket) > 0) or len(bucket) >= args.batch_size:
+                train_util.cache_batch_latents(vae, True, bucket, args.flip_aug, args.alpha_mask, False)
+                bucket.clear()

-    assert resized_size[0] == reso[0] or resized_size[1] == reso[
-        1], f"internal error, resized size not match: {reso}, {resized_size}, {image.width}, {image.height}"
-    assert resized_size[0] >= reso[0] and resized_size[1] >= reso[
-        1], f"internal error, resized size too small: {reso}, {resized_size}, {image.width}, {image.height}"
+    # 読み込みの高速化のためにDataLoaderを使うオプション
+    if args.max_data_loader_n_workers is not None:
+        dataset = train_util.ImageLoadingDataset(image_paths)
+        data = torch.utils.data.DataLoader(
+            dataset,
+            batch_size=1,
+            shuffle=False,
+            num_workers=args.max_data_loader_n_workers,
+            collate_fn=collate_fn_remove_corrupted,
+            drop_last=False,
+        )
+    else:
+        data = [[(None, ip)] for ip in image_paths]

-    # 既に存在するファイルがあればshapeを確認して同じならskipする
-    if args.skip_existing:
-      npz_files = [get_npz_filename_wo_ext(args.train_data_dir, image_key, args.full_path, False) + ".npz"]
-      if args.flip_aug:
-        npz_files.append(get_npz_filename_wo_ext(args.train_data_dir, image_key, args.full_path, True) + ".npz")
+    bucket_counts = {}
+    for data_entry in tqdm(data, smoothing=0.0):
+        if data_entry[0] is None:
+            continue

-      found = True
-      for npz_file in npz_files:
-        if not os.path.exists(npz_file):
-          found = False
-          break
+        img_tensor, image_path = data_entry[0]
+        if img_tensor is not None:
+            image = transforms.functional.to_pil_image(img_tensor)
+        else:
+            try:
+                image = Image.open(image_path)
+                if image.mode != "RGB":
+                    image = image.convert("RGB")
+            except Exception as e:
+                logger.error(f"Could not load image path / 画像を読み込めません: {image_path}, error: {e}")
+                continue

-        dat = np.load(npz_file)['arr_0']
-        if dat.shape[1] != reso[1] // 8 or dat.shape[2] != reso[0] // 8:     # latentsのshapeを確認
-          found = False
-          break
-      if found:
-        continue
+        image_key = image_path if args.full_path else os.path.splitext(os.path.basename(image_path))[0]
+        if image_key not in metadata:
+            metadata[image_key] = {}

-    # 画像をリサイズしてトリミングする
-    # PILにinter_areaがないのでcv2で……
-    image = np.array(image)
-    image = cv2.resize(image, resized_size, interpolation=cv2.INTER_AREA)
-    if resized_size[0] > reso[0]:
-      trim_size = resized_size[0] - reso[0]
-      image = image[:, trim_size//2:trim_size//2 + reso[0]]
-    elif resized_size[1] > reso[1]:
-      trim_size = resized_size[1] - reso[1]
-      image = image[trim_size//2:trim_size//2 + reso[1]]
-    assert image.shape[0] == reso[1] and image.shape[1] == reso[0], f"internal error, illegal trimmed size: {image.shape}, {reso}"
+        # 本当はこのあとの部分もDataSetに持っていけば高速化できるがいろいろ大変

-    # # debug
-    # cv2.imwrite(f"r:\\test\\img_{i:05d}.jpg", image[:, :, ::-1])
+        reso, resized_size, ar_error = bucket_manager.select_bucket(image.width, image.height)
+        img_ar_errors.append(abs(ar_error))
+        bucket_counts[reso] = bucket_counts.get(reso, 0) + 1

-    # バッチへ追加
-    buckets_imgs[bucket_id].append((image_key, reso, image))
-    bucket_counts[bucket_id] += 1
-    metadata[image_key]['train_resolution'] = reso
+        # メタデータに記録する解像度はlatent単位とするので、8単位で切り捨て
+        metadata[image_key]["train_resolution"] = (reso[0] - reso[0] % 8, reso[1] - reso[1] % 8)

-    # バッチを推論するか判定して推論する
-    process_batch(False)
+        if not args.bucket_no_upscale:
+            # upscaleを行わないときには、resize後のサイズは、bucketのサイズと、縦横どちらかが同じであることを確認する
+            assert (
+                resized_size[0] == reso[0] or resized_size[1] == reso[1]
+            ), f"internal error, resized size not match: {reso}, {resized_size}, {image.width}, {image.height}"
+            assert (
+                resized_size[0] >= reso[0] and resized_size[1] >= reso[1]
+            ), f"internal error, resized size too small: {reso}, {resized_size}, {image.width}, {image.height}"

-  # 残りを処理する
-  process_batch(True)
+        assert (
+            resized_size[0] >= reso[0] and resized_size[1] >= reso[1]
+        ), f"internal error resized size is small: {resized_size}, {reso}"

-  for i, (reso, count) in enumerate(zip(bucket_resos, bucket_counts)):
-    print(f"bucket {i} {reso}: {count}")
-  img_ar_errors = np.array(img_ar_errors)
-  print(f"mean ar error: {np.mean(img_ar_errors)}")
+        # 既に存在するファイルがあればshape等を確認して同じならskipする
+        npz_file_name = get_npz_filename(args.train_data_dir, image_key, args.full_path, args.recursive)
+        if args.skip_existing:
+            if train_util.is_disk_cached_latents_is_expected(reso, npz_file_name, args.flip_aug):
+                continue

-  # metadataを書き出して終わり
-  print(f"writing metadata: {args.out_json}")
-  with open(args.out_json, "wt", encoding='utf-8') as f:
-    json.dump(metadata, f, indent=2)
-  print("done!")
+        # バッチへ追加
+        image_info = train_util.ImageInfo(image_key, 1, "", False, image_path)
+        image_info.latents_npz = npz_file_name
+        image_info.bucket_reso = reso
+        image_info.resized_size = resized_size
+        image_info.image = image
+        bucket_manager.add_image(reso, image_info)
+
+        # バッチを推論するか判定して推論する
+        process_batch(False)
+
+    # 残りを処理する
+    process_batch(True)
+
+    bucket_manager.sort()
+    for i, reso in enumerate(bucket_manager.resos):
+        count = bucket_counts.get(reso, 0)
+        if count > 0:
+            logger.info(f"bucket {i} {reso}: {count}")
+    img_ar_errors = np.array(img_ar_errors)
+    logger.info(f"mean ar error: {np.mean(img_ar_errors)}")
+
+    # metadataを書き出して終わり
+    logger.info(f"writing metadata: {args.out_json}")
+    with open(args.out_json, "wt", encoding="utf-8") as f:
+        json.dump(metadata, f, indent=2)
+    logger.info("done!")


-if __name__ == '__main__':
-  parser = argparse.ArgumentParser()
-  parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
-  parser.add_argument("in_json", type=str, help="metadata file to input / 読み込むメタデータファイル")
-  parser.add_argument("out_json", type=str, help="metadata file to output / メタデータファイル書き出し先")
-  parser.add_argument("model_name_or_path", type=str, help="model name or path to encode latents / latentを取得するためのモデル")
-  parser.add_argument("--v2", action='store_true',
-                      help='not used (for backward compatibility) / 使用されません（互換性のため残してあります）')
-  parser.add_argument("--batch_size", type=int, default=1, help="batch size in inference / 推論時のバッチサイズ")
-  parser.add_argument("--max_data_loader_n_workers", type=int, default=None,
-                      help="enable image reading by DataLoader with this number of workers (faster) / DataLoaderによる画像読み込みを有効にしてこのワーカー数を適用する（読み込みを高速化）")
-  parser.add_argument("--max_resolution", type=str, default="512,512",
-                      help="max resolution in fine tuning (width,height) / fine tuning時の最大画像サイズ 「幅,高さ」（使用メモリ量に関係します）")
-  parser.add_argument("--min_bucket_reso", type=int, default=256, help="minimum resolution for buckets / bucketの最小解像度")
-  parser.add_argument("--max_bucket_reso", type=int, default=1024, help="maximum resolution for buckets / bucketの最小解像度")
-  parser.add_argument("--mixed_precision", type=str, default="no",
-                      choices=["no", "fp16", "bf16"], help="use mixed precision / 混合精度を使う場合、その精度")
-  parser.add_argument("--full_path", action="store_true",
-                      help="use full path as image-key in metadata (supports multiple directories) / メタデータで画像キーをフルパスにする（複数の学習画像ディレクトリに対応）")
-  parser.add_argument("--flip_aug", action="store_true",
-                      help="flip augmentation, save latents for flipped images / 左右反転した画像もlatentを取得、保存する")
-  parser.add_argument("--skip_existing", action="store_true",
-                      help="skip images if npz already exists (both normal and flipped exists if flip_aug is enabled) / npzが既に存在する画像をスキップする（flip_aug有効時は通常、反転の両方が存在する画像をスキップ）")
+def setup_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
+    parser.add_argument("in_json", type=str, help="metadata file to input / 読み込むメタデータファイル")
+    parser.add_argument("out_json", type=str, help="metadata file to output / メタデータファイル書き出し先")
+    parser.add_argument("model_name_or_path", type=str, help="model name or path to encode latents / latentを取得するためのモデル")
+    parser.add_argument(
+        "--v2", action="store_true", help="not used (for backward compatibility) / 使用されません（互換性のため残してあります）"
+    )
+    parser.add_argument("--batch_size", type=int, default=1, help="batch size in inference / 推論時のバッチサイズ")
+    parser.add_argument(
+        "--max_data_loader_n_workers",
+        type=int,
+        default=None,
+        help="enable image reading by DataLoader with this number of workers (faster) / DataLoaderによる画像読み込みを有効にしてこのワーカー数を適用する（読み込みを高速化）",
+    )
+    parser.add_argument(
+        "--max_resolution",
+        type=str,
+        default="512,512",
+        help="max resolution in fine tuning (width,height) / fine tuning時の最大画像サイズ 「幅,高さ」（使用メモリ量に関係します）",
+    )
+    parser.add_argument("--min_bucket_reso", type=int, default=256, help="minimum resolution for buckets / bucketの最小解像度")
+    parser.add_argument("--max_bucket_reso", type=int, default=1024, help="maximum resolution for buckets / bucketの最大解像度")
+    parser.add_argument(
+        "--bucket_reso_steps",
+        type=int,
+        default=64,
+        help="steps of resolution for buckets, divisible by 8 is recommended / bucketの解像度の単位、8で割り切れる値を推奨します",
+    )
+    parser.add_argument(
+        "--bucket_no_upscale",
+        action="store_true",
+        help="make bucket for each image without upscaling / 画像を拡大せずbucketを作成します",
+    )
+    parser.add_argument(
+        "--mixed_precision",
+        type=str,
+        default="no",
+        choices=["no", "fp16", "bf16"],
+        help="use mixed precision / 混合精度を使う場合、その精度",
+    )
+    parser.add_argument(
+        "--full_path",
+        action="store_true",
+        help="use full path as image-key in metadata (supports multiple directories) / メタデータで画像キーをフルパスにする（複数の学習画像ディレクトリに対応）",
+    )
+    parser.add_argument(
+        "--flip_aug",
+        action="store_true",
+        help="flip augmentation, save latents for flipped images / 左右反転した画像もlatentを取得、保存する",
+    )
+    parser.add_argument(
+        "--alpha_mask",
+        type=str,
+        default="",
+        help="save alpha mask for images for loss calculation / 損失計算用に画像のアルファマスクを保存する",
+    )
+    parser.add_argument(
+        "--skip_existing",
+        action="store_true",
+        help="skip images if npz already exists (both normal and flipped exists if flip_aug is enabled) / npzが既に存在する画像をスキップする（flip_aug有効時は通常、反転の両方が存在する画像をスキップ）",
+    )
+    parser.add_argument(
+        "--recursive",
+        action="store_true",
+        help="recursively look for training tags in all child folders of train_data_dir / train_data_dirのすべての子フォルダにある学習タグを再帰的に探す",
+    )

-  args = parser.parse_args()
-  main(args)
+    return parser
+
+
+if __name__ == "__main__":
+    parser = setup_parser()
+
+    args = parser.parse_args()
+    main(args)
--- a/finetune/tag_images_by_wd14_tagger.py
+++ b/finetune/tag_images_by_wd14_tagger.py
@@ -1,200 +1,516 @@
 import argparse
 import csv
-import glob
 import os
+from pathlib import Path

-from PIL import Image
 import cv2
-from tqdm import tqdm
 import numpy as np
-from tensorflow.keras.models import load_model
-from huggingface_hub import hf_hub_download
 import torch
+from huggingface_hub import hf_hub_download
+from PIL import Image
+from tqdm import tqdm

 import library.train_util as train_util
+from library.utils import setup_logging, resize_image
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)

 # from wd14 tagger
 IMAGE_SIZE = 448

 # wd-v1-4-swinv2-tagger-v2 / wd-v1-4-vit-tagger / wd-v1-4-vit-tagger-v2/ wd-v1-4-convnext-tagger / wd-v1-4-convnext-tagger-v2
-DEFAULT_WD14_TAGGER_REPO = 'SmilingWolf/wd-v1-4-convnext-tagger-v2'
+DEFAULT_WD14_TAGGER_REPO = "SmilingWolf/wd-v1-4-convnext-tagger-v2"
 FILES = ["keras_metadata.pb", "saved_model.pb", "selected_tags.csv"]
+FILES_ONNX = ["model.onnx"]
 SUB_DIR = "variables"
 SUB_DIR_FILES = ["variables.data-00000-of-00001", "variables.index"]
 CSV_FILE = FILES[-1]


 def preprocess_image(image):
-  image = np.array(image)
-  image = image[:, :, ::-1]                         # RGB->BGR
+    image = np.array(image)
+    image = image[:, :, ::-1]  # RGB->BGR

-  # pad to square
-  size = max(image.shape[0:2])
-  pad_x = size - image.shape[1]
-  pad_y = size - image.shape[0]
-  pad_l = pad_x // 2
-  pad_t = pad_y // 2
-  image = np.pad(image, ((pad_t, pad_y - pad_t), (pad_l, pad_x - pad_l), (0, 0)), mode='constant', constant_values=255)
+    # pad to square
+    size = max(image.shape[0:2])
+    pad_x = size - image.shape[1]
+    pad_y = size - image.shape[0]
+    pad_l = pad_x // 2
+    pad_t = pad_y // 2
+    image = np.pad(image, ((pad_t, pad_y - pad_t), (pad_l, pad_x - pad_l), (0, 0)), mode="constant", constant_values=255)

-  interp = cv2.INTER_AREA if size > IMAGE_SIZE else cv2.INTER_LANCZOS4
-  image = cv2.resize(image, (IMAGE_SIZE, IMAGE_SIZE), interpolation=interp)
+    image = resize_image(image, image.shape[0], image.shape[1], IMAGE_SIZE, IMAGE_SIZE)

-  image = image.astype(np.float32)
-  return image
+    image = image.astype(np.float32)
+    return image


 class ImageLoadingPrepDataset(torch.utils.data.Dataset):
-  def __init__(self, image_paths):
-    self.images = image_paths
+    def __init__(self, image_paths):
+        self.images = image_paths

-  def __len__(self):
-    return len(self.images)
+    def __len__(self):
+        return len(self.images)

-  def __getitem__(self, idx):
-    img_path = self.images[idx]
+    def __getitem__(self, idx):
+        img_path = str(self.images[idx])

-    try:
-      image = Image.open(img_path).convert("RGB")
-      image = preprocess_image(image)
-      tensor = torch.tensor(image)
-    except Exception as e:
-      print(f"Could not load image path / 画像を読み込めません: {img_path}, error: {e}")
-      return None
+        try:
+            image = Image.open(img_path).convert("RGB")
+            image = preprocess_image(image)
+            # tensor = torch.tensor(image) # これ Tensor に変換する必要ないな……(;･∀･)
+        except Exception as e:
+            logger.error(f"Could not load image path / 画像を読み込めません: {img_path}, error: {e}")
+            return None

-    return (tensor, img_path)
+        return (image, img_path)


 def collate_fn_remove_corrupted(batch):
-  """Collate function that allows to remove corrupted examples in the
-  dataloader. It expects that the dataloader returns 'None' when that occurs.
-  The 'None's in the batch are removed.
-  """
-  # Filter out all the Nones (corrupted examples)
-  batch = list(filter(lambda x: x is not None, batch))
-  return batch
+    """Collate function that allows to remove corrupted examples in the
+    dataloader. It expects that the dataloader returns 'None' when that occurs.
+    The 'None's in the batch are removed.
+    """
+    # Filter out all the Nones (corrupted examples)
+    batch = list(filter(lambda x: x is not None, batch))
+    return batch


 def main(args):
-  # hf_hub_downloadをそのまま使うとsymlink関係で問題があるらしいので、キャッシュディレクトリとforce_filenameを指定してなんとかする
-  # depreacatedの警告が出るけどなくなったらその時
-  # https://github.com/toriato/stable-diffusion-webui-wd14-tagger/issues/22
-  if not os.path.exists(args.model_dir) or args.force_download:
-    print(f"downloading wd14 tagger model from hf_hub. id: {args.repo_id}")
-    for file in FILES:
-      hf_hub_download(args.repo_id, file, cache_dir=args.model_dir, force_download=True, force_filename=file)
-    for file in SUB_DIR_FILES:
-      hf_hub_download(args.repo_id, file, subfolder=SUB_DIR, cache_dir=os.path.join(
-          args.model_dir, SUB_DIR), force_download=True, force_filename=file)
-  else:
-    print("using existing wd14 tagger model")
+    # model location is model_dir + repo_id
+    # repo id may be like "user/repo" or "user/repo/branch", so we need to remove slash
+    model_location = os.path.join(args.model_dir, args.repo_id.replace("/", "_"))

-  # 画像を読み込む
-  image_paths = train_util.glob_images(args.train_data_dir)
-  print(f"found {len(image_paths)} images.")
+    # hf_hub_downloadをそのまま使うとsymlink関係で問題があるらしいので、キャッシュディレクトリとforce_filenameを指定してなんとかする
+    # depreacatedの警告が出るけどなくなったらその時
+    # https://github.com/toriato/stable-diffusion-webui-wd14-tagger/issues/22
+    if not os.path.exists(model_location) or args.force_download:
+        os.makedirs(args.model_dir, exist_ok=True)
+        logger.info(f"downloading wd14 tagger model from hf_hub. id: {args.repo_id}")
+        files = FILES
+        if args.onnx:
+            files = ["selected_tags.csv"]
+            files += FILES_ONNX
+        else:
+            for file in SUB_DIR_FILES:
+                hf_hub_download(
+                    repo_id=args.repo_id,
+                    filename=file,
+                    subfolder=SUB_DIR,
+                    local_dir=os.path.join(model_location, SUB_DIR),
+                    force_download=True,
+                )
+        for file in files:
+            hf_hub_download(
+                repo_id=args.repo_id,
+                filename=file,
+                local_dir=model_location,
+                force_download=True,
+            )
+    else:
+        logger.info("using existing wd14 tagger model")

-  print("loading model and labels")
-  model = load_model(args.model_dir)
+    # モデルを読み込む
+    if args.onnx:
+        import onnx
+        import onnxruntime as ort

-  # label_names = pd.read_csv("2022_0000_0899_6549/selected_tags.csv")
-  # 依存ライブラリを増やしたくないので自力で読むよ
-  with open(os.path.join(args.model_dir, CSV_FILE), "r", encoding="utf-8") as f:
-    reader = csv.reader(f)
-    l = [row for row in reader]
-    header = l[0]             # tag_id,name,category,count
-    rows = l[1:]
-  assert header[0] == 'tag_id' and header[1] == 'name' and header[2] == 'category', f"unexpected csv format: {header}"
+        onnx_path = f"{model_location}/model.onnx"
+        logger.info("Running wd14 tagger with onnx")
+        logger.info(f"loading onnx model: {onnx_path}")

-  tags = [row[1] for row in rows[1:] if row[2] == '0']      # categoryが0、つまり通常のタグのみ
+        if not os.path.exists(onnx_path):
+            raise Exception(
+                f"onnx model not found: {onnx_path}, please redownload the model with --force_download"
+                + " / onnxモデルが見つかりませんでした。--force_downloadで再ダウンロードしてください"
+            )

-  # 推論する
-  def run_batch(path_imgs):
-    imgs = np.array([im for _, im in path_imgs])
-
-    probs = model(imgs, training=False)
-    probs = probs.numpy()
-
-    for (image_path, _), prob in zip(path_imgs, probs):
-      # 最初の4つはratingなので無視する
-      # # First 4 labels are actually ratings: pick one with argmax
-      # ratings_names = label_names[:4]
-      # rating_index = ratings_names["probs"].argmax()
-      # found_rating = ratings_names[rating_index: rating_index + 1][["name", "probs"]]
-
-      # それ以降はタグなのでconfidenceがthresholdより高いものを追加する
-      # Everything else is tags: pick any where prediction confidence > threshold
-      tag_text = ""
-      for i, p in enumerate(prob[4:]):                # numpyとか使うのが良いけど、まあそれほど数も多くないのでループで
-        if p >= args.thresh and i < len(tags):
-          tag_text += ", " + tags[i]
-
-      if len(tag_text) > 0:
-        tag_text = tag_text[2:]                   # 最初の ", " を消す
-
-      with open(os.path.splitext(image_path)[0] + args.caption_extension, "wt", encoding='utf-8') as f:
-        f.write(tag_text + '\n')
-        if args.debug:
-          print(image_path, tag_text)
-
-  # 読み込みの高速化のためにDataLoaderを使うオプション
-  if args.max_data_loader_n_workers is not None:
-    dataset = ImageLoadingPrepDataset(image_paths)
-    data = torch.utils.data.DataLoader(dataset, batch_size=args.batch_size, shuffle=False,
-                                       num_workers=args.max_data_loader_n_workers, collate_fn=collate_fn_remove_corrupted, drop_last=False)
-  else:
-    data = [[(None, ip)] for ip in image_paths]
-
-  b_imgs = []
-  for data_entry in tqdm(data, smoothing=0.0):
-    for data in data_entry:
-      if data is None:
-        continue
-
-      image, image_path = data
-      if image is not None:
-        image = image.detach().numpy()
-      else:
+        model = onnx.load(onnx_path)
+        input_name = model.graph.input[0].name
        try:
-          image = Image.open(image_path)
-          if image.mode != 'RGB':
-            image = image.convert("RGB")
-          image = preprocess_image(image)
-        except Exception as e:
-          print(f"Could not load image path / 画像を読み込めません: {image_path}, error: {e}")
-          continue
-      b_imgs.append((image_path, image))
+            batch_size = model.graph.input[0].type.tensor_type.shape.dim[0].dim_value
+        except Exception:
+            batch_size = model.graph.input[0].type.tensor_type.shape.dim[0].dim_param

-      if len(b_imgs) >= args.batch_size:
+        if args.batch_size != batch_size and not isinstance(batch_size, str) and batch_size > 0:
+            # some rebatch model may use 'N' as dynamic axes
+            logger.warning(
+                f"Batch size {args.batch_size} doesn't match onnx model batch size {batch_size}, use model batch size {batch_size}"
+            )
+            args.batch_size = batch_size
+
+        del model
+
+        if "OpenVINOExecutionProvider" in ort.get_available_providers():
+            # requires provider options for gpu support
+            # fp16 causes nonsense outputs
+            ort_sess = ort.InferenceSession(
+                onnx_path,
+                providers=(["OpenVINOExecutionProvider"]),
+                provider_options=[{'device_type' : "GPU", "precision": "FP32"}],
+            )
+        else:
+            ort_sess = ort.InferenceSession(
+                onnx_path,
+                providers=(
+                    ["CUDAExecutionProvider"] if "CUDAExecutionProvider" in ort.get_available_providers() else
+                    ["ROCMExecutionProvider"] if "ROCMExecutionProvider" in ort.get_available_providers() else
+                    ["CPUExecutionProvider"]
+                ),
+            )
+    else:
+        from tensorflow.keras.models import load_model
+
+        model = load_model(f"{model_location}")
+
+    # label_names = pd.read_csv("2022_0000_0899_6549/selected_tags.csv")
+    # 依存ライブラリを増やしたくないので自力で読むよ
+
+    with open(os.path.join(model_location, CSV_FILE), "r", encoding="utf-8") as f:
+        reader = csv.reader(f)
+        line = [row for row in reader]
+        header = line[0]  # tag_id,name,category,count
+        rows = line[1:]
+    assert header[0] == "tag_id" and header[1] == "name" and header[2] == "category", f"unexpected csv format: {header}"
+
+    rating_tags = [row[1] for row in rows[0:] if row[2] == "9"]
+    general_tags = [row[1] for row in rows[0:] if row[2] == "0"]
+    character_tags = [row[1] for row in rows[0:] if row[2] == "4"]
+
+    # preprocess tags in advance
+    if args.character_tag_expand:
+        for i, tag in enumerate(character_tags):
+            if tag.endswith(")"):
+                # chara_name_(series) -> chara_name, series
+                # chara_name_(costume)_(series) -> chara_name_(costume), series
+                tags = tag.split("(")
+                character_tag = "(".join(tags[:-1])
+                if character_tag.endswith("_"):
+                    character_tag = character_tag[:-1]
+                series_tag = tags[-1].replace(")", "")
+                character_tags[i] = character_tag + args.caption_separator + series_tag
+
+    if args.remove_underscore:
+        rating_tags = [tag.replace("_", " ") if len(tag) > 3 else tag for tag in rating_tags]
+        general_tags = [tag.replace("_", " ") if len(tag) > 3 else tag for tag in general_tags]
+        character_tags = [tag.replace("_", " ") if len(tag) > 3 else tag for tag in character_tags]
+
+    if args.tag_replacement is not None:
+        # escape , and ; in tag_replacement: wd14 tag names may contain , and ;
+        escaped_tag_replacements = args.tag_replacement.replace("\\,", "@@@@").replace("\\;", "####")
+        tag_replacements = escaped_tag_replacements.split(";")
+        for tag_replacement in tag_replacements:
+            tags = tag_replacement.split(",")  # source, target
+            assert len(tags) == 2, f"tag replacement must be in the format of `source,target` / タグの置換は `置換元,置換先` の形式で指定してください: {args.tag_replacement}"
+
+            source, target = [tag.replace("@@@@", ",").replace("####", ";") for tag in tags]
+            logger.info(f"replacing tag: {source} -> {target}")
+
+            if source in general_tags:
+                general_tags[general_tags.index(source)] = target
+            elif source in character_tags:
+                character_tags[character_tags.index(source)] = target
+            elif source in rating_tags:
+                rating_tags[rating_tags.index(source)] = target
+
+    # 画像を読み込む
+    train_data_dir_path = Path(args.train_data_dir)
+    image_paths = train_util.glob_images_pathlib(train_data_dir_path, args.recursive)
+    logger.info(f"found {len(image_paths)} images.")
+
+    tag_freq = {}
+
+    caption_separator = args.caption_separator
+    stripped_caption_separator = caption_separator.strip()
+    undesired_tags = args.undesired_tags.split(stripped_caption_separator)
+    undesired_tags = set([tag.strip() for tag in undesired_tags if tag.strip() != ""])
+
+    always_first_tags = None
+    if args.always_first_tags is not None:
+        always_first_tags = [tag for tag in args.always_first_tags.split(stripped_caption_separator) if tag.strip() != ""]
+
+    def run_batch(path_imgs):
+        imgs = np.array([im for _, im in path_imgs])
+
+        if args.onnx:
+            # if len(imgs) < args.batch_size:
+            #     imgs = np.concatenate([imgs, np.zeros((args.batch_size - len(imgs), IMAGE_SIZE, IMAGE_SIZE, 3))], axis=0)
+            probs = ort_sess.run(None, {input_name: imgs})[0]  # onnx output numpy
+            probs = probs[: len(path_imgs)]
+        else:
+            probs = model(imgs, training=False)
+            probs = probs.numpy()
+
+        for (image_path, _), prob in zip(path_imgs, probs):
+            combined_tags = []
+            rating_tag_text = ""
+            character_tag_text = ""
+            general_tag_text = ""
+
+            # 最初の4つ以降はタグなのでconfidenceがthreshold以上のものを追加する
+            # First 4 labels are ratings, the rest are tags: pick any where prediction confidence >= threshold
+            for i, p in enumerate(prob[4:]):
+                if i < len(general_tags) and p >= args.general_threshold:
+                    tag_name = general_tags[i]
+
+                    if tag_name not in undesired_tags:
+                        tag_freq[tag_name] = tag_freq.get(tag_name, 0) + 1
+                        general_tag_text += caption_separator + tag_name
+                        combined_tags.append(tag_name)
+                elif i >= len(general_tags) and p >= args.character_threshold:
+                    tag_name = character_tags[i - len(general_tags)]
+
+                    if tag_name not in undesired_tags:
+                        tag_freq[tag_name] = tag_freq.get(tag_name, 0) + 1
+                        character_tag_text += caption_separator + tag_name
+                        if args.character_tags_first: # insert to the beginning
+                            combined_tags.insert(0, tag_name)
+                        else:
+                            combined_tags.append(tag_name)
+
+            # 最初の4つはratingなのでargmaxで選ぶ
+            # First 4 labels are actually ratings: pick one with argmax
+            if args.use_rating_tags or args.use_rating_tags_as_last_tag:
+                ratings_probs = prob[:4]
+                rating_index = ratings_probs.argmax()
+                found_rating = rating_tags[rating_index]
+
+                if found_rating not in undesired_tags:
+                    tag_freq[found_rating] = tag_freq.get(found_rating, 0) + 1
+                    rating_tag_text = found_rating
+                    if args.use_rating_tags:
+                        combined_tags.insert(0, found_rating) # insert to the beginning
+                    else:
+                        combined_tags.append(found_rating)
+
+            # 一番最初に置くタグを指定する
+            # Always put some tags at the beginning
+            if always_first_tags is not None:
+                for tag in always_first_tags:
+                    if tag in combined_tags:
+                        combined_tags.remove(tag)
+                        combined_tags.insert(0, tag)
+
+            # 先頭のカンマを取る
+            if len(general_tag_text) > 0:
+                general_tag_text = general_tag_text[len(caption_separator) :]
+            if len(character_tag_text) > 0:
+                character_tag_text = character_tag_text[len(caption_separator) :]
+
+            caption_file = os.path.splitext(image_path)[0] + args.caption_extension
+
+            tag_text = caption_separator.join(combined_tags)
+
+            if args.append_tags:
+                # Check if file exists
+                if os.path.exists(caption_file):
+                    with open(caption_file, "rt", encoding="utf-8") as f:
+                        # Read file and remove new lines
+                        existing_content = f.read().strip("\n")  # Remove newlines
+
+                    # Split the content into tags and store them in a list
+                    existing_tags = [tag.strip() for tag in existing_content.split(stripped_caption_separator) if tag.strip()]
+
+                    # Check and remove repeating tags in tag_text
+                    new_tags = [tag for tag in combined_tags if tag not in existing_tags]
+
+                    # Create new tag_text
+                    tag_text = caption_separator.join(existing_tags + new_tags)
+
+            with open(caption_file, "wt", encoding="utf-8") as f:
+                f.write(tag_text + "\n")
+                if args.debug:
+                    logger.info("")
+                    logger.info(f"{image_path}:")
+                    logger.info(f"\tRating tags: {rating_tag_text}")
+                    logger.info(f"\tCharacter tags: {character_tag_text}")
+                    logger.info(f"\tGeneral tags: {general_tag_text}")
+
+    # 読み込みの高速化のためにDataLoaderを使うオプション
+    if args.max_data_loader_n_workers is not None:
+        dataset = ImageLoadingPrepDataset(image_paths)
+        data = torch.utils.data.DataLoader(
+            dataset,
+            batch_size=args.batch_size,
+            shuffle=False,
+            num_workers=args.max_data_loader_n_workers,
+            collate_fn=collate_fn_remove_corrupted,
+            drop_last=False,
+        )
+    else:
+        data = [[(None, ip)] for ip in image_paths]
+
+    b_imgs = []
+    for data_entry in tqdm(data, smoothing=0.0):
+        for data in data_entry:
+            if data is None:
+                continue
+
+            image, image_path = data
+            if image is None:
+                try:
+                    image = Image.open(image_path)
+                    if image.mode != "RGB":
+                        image = image.convert("RGB")
+                    image = preprocess_image(image)
+                except Exception as e:
+                    logger.error(f"Could not load image path / 画像を読み込めません: {image_path}, error: {e}")
+                    continue
+            b_imgs.append((image_path, image))
+
+            if len(b_imgs) >= args.batch_size:
+                b_imgs = [(str(image_path), image) for image_path, image in b_imgs]  # Convert image_path to string
+                run_batch(b_imgs)
+                b_imgs.clear()
+
+    if len(b_imgs) > 0:
+        b_imgs = [(str(image_path), image) for image_path, image in b_imgs]  # Convert image_path to string
        run_batch(b_imgs)
-        b_imgs.clear()

-  if len(b_imgs) > 0:
-    run_batch(b_imgs)
+    if args.frequency_tags:
+        sorted_tags = sorted(tag_freq.items(), key=lambda x: x[1], reverse=True)
+        print("Tag frequencies:")
+        for tag, freq in sorted_tags:
+            print(f"{tag}: {freq}")

-  print("done!")
+    logger.info("done!")


-if __name__ == '__main__':
-  parser = argparse.ArgumentParser()
-  parser.add_argument("train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ")
-  parser.add_argument("--repo_id", type=str, default=DEFAULT_WD14_TAGGER_REPO,
-                      help="repo id for wd14 tagger on Hugging Face / Hugging Faceのwd14 taggerのリポジトリID")
-  parser.add_argument("--model_dir", type=str, default="wd14_tagger_model",
-                      help="directory to store wd14 tagger model / wd14 taggerのモデルを格納するディレクトリ")
-  parser.add_argument("--force_download", action='store_true',
-                      help="force downloading wd14 tagger models / wd14 taggerのモデルを再ダウンロードします")
-  parser.add_argument("--thresh", type=float, default=0.35, help="threshold of confidence to add a tag / タグを追加するか判定する閾値")
-  parser.add_argument("--batch_size", type=int, default=1, help="batch size in inference / 推論時のバッチサイズ")
-  parser.add_argument("--max_data_loader_n_workers", type=int, default=None,
-                      help="enable image reading by DataLoader with this number of workers (faster) / DataLoaderによる画像読み込みを有効にしてこのワーカー数を適用する（読み込みを高速化）")
-  parser.add_argument("--caption_extention", type=str, default=None,
-                      help="extension of caption file (for backward compatibility) / 出力されるキャプションファイルの拡張子（スペルミスしていたのを残してあります）")
-  parser.add_argument("--caption_extension", type=str, default=".txt", help="extension of caption file / 出力されるキャプションファイルの拡張子")
-  parser.add_argument("--debug", action="store_true", help="debug mode")
+def setup_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "train_data_dir", type=str, help="directory for train images / 学習画像データのディレクトリ"
+    )
+    parser.add_argument(
+        "--repo_id",
+        type=str,
+        default=DEFAULT_WD14_TAGGER_REPO,
+        help="repo id for wd14 tagger on Hugging Face / Hugging Faceのwd14 taggerのリポジトリID",
+    )
+    parser.add_argument(
+        "--model_dir",
+        type=str,
+        default="wd14_tagger_model",
+        help="directory to store wd14 tagger model / wd14 taggerのモデルを格納するディレクトリ",
+    )
+    parser.add_argument(
+        "--force_download",
+        action="store_true",
+        help="force downloading wd14 tagger models / wd14 taggerのモデルを再ダウンロードします",
+    )
+    parser.add_argument(
+        "--batch_size", type=int, default=1, help="batch size in inference / 推論時のバッチサイズ"
+    )
+    parser.add_argument(
+        "--max_data_loader_n_workers",
+        type=int,
+        default=None,
+        help="enable image reading by DataLoader with this number of workers (faster) / DataLoaderによる画像読み込みを有効にしてこのワーカー数を適用する（読み込みを高速化）",
+    )
+    parser.add_argument(
+        "--caption_extention",
+        type=str,
+        default=None,
+        help="extension of caption file (for backward compatibility) / 出力されるキャプションファイルの拡張子（スペルミスしていたのを残してあります）",
+    )
+    parser.add_argument(
+        "--caption_extension", type=str, default=".txt", help="extension of caption file / 出力されるキャプションファイルの拡張子"
+    )
+    parser.add_argument(
+        "--thresh", type=float, default=0.35, help="threshold of confidence to add a tag / タグを追加するか判定する閾値"
+    )
+    parser.add_argument(
+        "--general_threshold",
+        type=float,
+        default=None,
+        help="threshold of confidence to add a tag for general category, same as --thresh if omitted / generalカテゴリのタグを追加するための確信度の閾値、省略時は --thresh と同じ",
+    )
+    parser.add_argument(
+        "--character_threshold",
+        type=float,
+        default=None,
+        help="threshold of confidence to add a tag for character category, same as --thres if omitted / characterカテゴリのタグを追加するための確信度の閾値、省略時は --thresh と同じ",
+    )
+    parser.add_argument(
+        "--recursive", action="store_true", help="search for images in subfolders recursively / サブフォルダを再帰的に検索する"
+    )
+    parser.add_argument(
+        "--remove_underscore",
+        action="store_true",
+        help="replace underscores with spaces in the output tags / 出力されるタグのアンダースコアをスペースに置き換える",
+    )
+    parser.add_argument(
+        "--debug", action="store_true", help="debug mode"
+    )
+    parser.add_argument(
+        "--undesired_tags",
+        type=str,
+        default="",
+        help="comma-separated list of undesired tags to remove from the output / 出力から除外したいタグのカンマ区切りのリスト",
+    )
+    parser.add_argument(
+        "--frequency_tags", action="store_true", help="Show frequency of tags for images / タグの出現頻度を表示する"
+    )
+    parser.add_argument(
+        "--onnx", action="store_true", help="use onnx model for inference / onnxモデルを推論に使用する"
+    )
+    parser.add_argument(
+        "--append_tags", action="store_true", help="Append captions instead of overwriting / 上書きではなくキャプションを追記する"
+    )
+    parser.add_argument(
+        "--use_rating_tags", action="store_true", help="Adds rating tags as the first tag / レーティングタグを最初のタグとして追加する",
+    )
+    parser.add_argument(
+        "--use_rating_tags_as_last_tag", action="store_true", help="Adds rating tags as the last tag / レーティングタグを最後のタグとして追加する",
+    )
+    parser.add_argument(
+        "--character_tags_first", action="store_true", help="Always inserts character tags before the general tags / characterタグを常にgeneralタグの前に出力する",
+    )
+    parser.add_argument(
+        "--always_first_tags",
+        type=str,
+        default=None,
+        help="comma-separated list of tags to always put at the beginning, e.g. `1girl,1boy`"
+        + " / 必ず先頭に置くタグのカンマ区切りリスト、例 : `1girl,1boy`",
+    )
+    parser.add_argument(
+        "--caption_separator",
+        type=str,
+        default=", ",
+        help="Separator for captions, include space if needed / キャプションの区切り文字、必要ならスペースを含めてください",
+    )
+    parser.add_argument(
+        "--tag_replacement",
+        type=str,
+        default=None,
+        help="tag replacement in the format of `source1,target1;source2,target2; ...`. Escape `,` and `;` with `\`. e.g. `tag1,tag2;tag3,tag4`"
+        + " / タグの置換を `置換元1,置換先1;置換元2,置換先2; ...`で指定する。`\` で `,` と `;` をエスケープできる。例: `tag1,tag2;tag3,tag4`",
+    )
+    parser.add_argument(
+        "--character_tag_expand",
+        action="store_true",
+        help="expand tag tail parenthesis to another tag for character tags. `chara_name_(series)` becomes `chara_name, series`"
+        + " / キャラクタタグの末尾の括弧を別のタグに展開する。`chara_name_(series)` は `chara_name, series` になる",
+    )

-  args = parser.parse_args()
+    return parser

-  # スペルミスしていたオプションを復元する
-  if args.caption_extention is not None:
-    args.caption_extension = args.caption_extention

-  main(args)
+if __name__ == "__main__":
+    parser = setup_parser()
+
+    args = parser.parse_args()
+
+    # スペルミスしていたオプションを復元する
+    if args.caption_extention is not None:
+        args.caption_extension = args.caption_extention
+
+    if args.general_threshold is None:
+        args.general_threshold = args.thresh
+    if args.character_threshold is None:
+        args.character_threshold = args.thresh
+
+    main(args)
--- a/flux_minimal_inference.py
+++ b/flux_minimal_inference.py
@@ -0,0 +1,576 @@
+# Minimum Inference Code for FLUX
+
+import argparse
+import datetime
+import math
+import os
+import random
+from typing import Callable, List, Optional
+import einops
+import numpy as np
+
+import torch
+from tqdm import tqdm
+from PIL import Image
+import accelerate
+from transformers import CLIPTextModel
+from safetensors.torch import load_file
+
+from library import device_utils
+from library.device_utils import init_ipex, get_preferred_device
+from networks import oft_flux
+
+init_ipex()
+
+
+from library.utils import setup_logging, str_to_dtype
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+import networks.lora_flux as lora_flux
+from library import flux_models, flux_utils, sd3_utils, strategy_flux
+
+
+def time_shift(mu: float, sigma: float, t: torch.Tensor):
+    return math.exp(mu) / (math.exp(mu) + (1 / t - 1) ** sigma)
+
+
+def get_lin_function(x1: float = 256, y1: float = 0.5, x2: float = 4096, y2: float = 1.15) -> Callable[[float], float]:
+    m = (y2 - y1) / (x2 - x1)
+    b = y1 - m * x1
+    return lambda x: m * x + b
+
+
+def get_schedule(
+    num_steps: int,
+    image_seq_len: int,
+    base_shift: float = 0.5,
+    max_shift: float = 1.15,
+    shift: bool = True,
+) -> list[float]:
+    # extra step for zero
+    timesteps = torch.linspace(1, 0, num_steps + 1)
+
+    # shifting the schedule to favor high timesteps for higher signal images
+    if shift:
+        # eastimate mu based on linear estimation between two points
+        mu = get_lin_function(y1=base_shift, y2=max_shift)(image_seq_len)
+        timesteps = time_shift(mu, 1.0, timesteps)
+
+    return timesteps.tolist()
+
+
+def denoise(
+    model: flux_models.Flux,
+    img: torch.Tensor,
+    img_ids: torch.Tensor,
+    txt: torch.Tensor,
+    txt_ids: torch.Tensor,
+    vec: torch.Tensor,
+    timesteps: list[float],
+    guidance: float = 4.0,
+    t5_attn_mask: Optional[torch.Tensor] = None,
+    neg_txt: Optional[torch.Tensor] = None,
+    neg_vec: Optional[torch.Tensor] = None,
+    neg_t5_attn_mask: Optional[torch.Tensor] = None,
+    cfg_scale: Optional[float] = None,
+):
+    # this is ignored for schnell
+    logger.info(f"guidance: {guidance}, cfg_scale: {cfg_scale}")
+    guidance_vec = torch.full((img.shape[0],), guidance, device=img.device, dtype=img.dtype)
+
+    # prepare classifier free guidance
+    if neg_txt is not None and neg_vec is not None:
+        b_img_ids = torch.cat([img_ids, img_ids], dim=0)
+        b_txt_ids = torch.cat([txt_ids, txt_ids], dim=0)
+        b_txt = torch.cat([neg_txt, txt], dim=0)
+        b_vec = torch.cat([neg_vec, vec], dim=0)
+        if t5_attn_mask is not None and neg_t5_attn_mask is not None:
+            b_t5_attn_mask = torch.cat([neg_t5_attn_mask, t5_attn_mask], dim=0)
+        else:
+            b_t5_attn_mask = None
+    else:
+        b_img_ids = img_ids
+        b_txt_ids = txt_ids
+        b_txt = txt
+        b_vec = vec
+        b_t5_attn_mask = t5_attn_mask
+
+    for t_curr, t_prev in zip(tqdm(timesteps[:-1]), timesteps[1:]):
+        t_vec = torch.full((b_img_ids.shape[0],), t_curr, dtype=img.dtype, device=img.device)
+
+        # classifier free guidance
+        if neg_txt is not None and neg_vec is not None:
+            b_img = torch.cat([img, img], dim=0)
+        else:
+            b_img = img
+
+        pred = model(
+            img=b_img,
+            img_ids=b_img_ids,
+            txt=b_txt,
+            txt_ids=b_txt_ids,
+            y=b_vec,
+            timesteps=t_vec,
+            guidance=guidance_vec,
+            txt_attention_mask=b_t5_attn_mask,
+        )
+
+        # classifier free guidance
+        if neg_txt is not None and neg_vec is not None:
+            pred_uncond, pred = torch.chunk(pred, 2, dim=0)
+            pred = pred_uncond + cfg_scale * (pred - pred_uncond)
+
+        img = img + (t_prev - t_curr) * pred
+
+    return img
+
+
+def do_sample(
+    accelerator: Optional[accelerate.Accelerator],
+    model: flux_models.Flux,
+    img: torch.Tensor,
+    img_ids: torch.Tensor,
+    l_pooled: torch.Tensor,
+    t5_out: torch.Tensor,
+    txt_ids: torch.Tensor,
+    num_steps: int,
+    guidance: float,
+    t5_attn_mask: Optional[torch.Tensor],
+    is_schnell: bool,
+    device: torch.device,
+    flux_dtype: torch.dtype,
+    neg_l_pooled: Optional[torch.Tensor] = None,
+    neg_t5_out: Optional[torch.Tensor] = None,
+    neg_t5_attn_mask: Optional[torch.Tensor] = None,
+    cfg_scale: Optional[float] = None,
+):
+    logger.info(f"num_steps: {num_steps}")
+    timesteps = get_schedule(num_steps, img.shape[1], shift=not is_schnell)
+
+    # denoise initial noise
+    if accelerator:
+        with accelerator.autocast(), torch.no_grad():
+            x = denoise(
+                model,
+                img,
+                img_ids,
+                t5_out,
+                txt_ids,
+                l_pooled,
+                timesteps,
+                guidance,
+                t5_attn_mask,
+                neg_t5_out,
+                neg_l_pooled,
+                neg_t5_attn_mask,
+                cfg_scale,
+            )
+    else:
+        with torch.autocast(device_type=device.type, dtype=flux_dtype), torch.no_grad():
+            x = denoise(
+                model,
+                img,
+                img_ids,
+                t5_out,
+                txt_ids,
+                l_pooled,
+                timesteps,
+                guidance,
+                t5_attn_mask,
+                neg_t5_out,
+                neg_l_pooled,
+                neg_t5_attn_mask,
+                cfg_scale,
+            )
+
+    return x
+
+
+def generate_image(
+    model,
+    clip_l: CLIPTextModel,
+    t5xxl,
+    ae,
+    prompt: str,
+    seed: Optional[int],
+    image_width: int,
+    image_height: int,
+    steps: Optional[int],
+    guidance: float,
+    negative_prompt: Optional[str],
+    cfg_scale: float,
+):
+    seed = seed if seed is not None else random.randint(0, 2**32 - 1)
+    logger.info(f"Seed: {seed}")
+
+    # make first noise with packed shape
+    # original: b,16,2*h//16,2*w//16, packed: b,h//16*w//16,16*2*2
+    packed_latent_height, packed_latent_width = math.ceil(image_height / 16), math.ceil(image_width / 16)
+    noise_dtype = torch.float32 if is_fp8(dtype) else dtype
+    noise = torch.randn(
+        1,
+        packed_latent_height * packed_latent_width,
+        16 * 2 * 2,
+        device=device,
+        dtype=noise_dtype,
+        generator=torch.Generator(device=device).manual_seed(seed),
+    )
+
+    # prepare img and img ids
+
+    # this is needed only for img2img
+    # img = rearrange(img, "b c (h ph) (w pw) -> b (h w) (c ph pw)", ph=2, pw=2)
+    # if img.shape[0] == 1 and bs > 1:
+    #     img = repeat(img, "1 ... -> bs ...", bs=bs)
+
+    # txt2img only needs img_ids
+    img_ids = flux_utils.prepare_img_ids(1, packed_latent_height, packed_latent_width)
+
+    # prepare fp8 models
+    if is_fp8(clip_l_dtype) and (not hasattr(clip_l, "fp8_prepared") or not clip_l.fp8_prepared):
+        logger.info(f"prepare CLIP-L for fp8: set to {clip_l_dtype}, set embeddings to {torch.bfloat16}")
+        clip_l.to(clip_l_dtype)  # fp8
+        clip_l.text_model.embeddings.to(dtype=torch.bfloat16)
+        clip_l.fp8_prepared = True
+
+    if is_fp8(t5xxl_dtype) and (not hasattr(t5xxl, "fp8_prepared") or not t5xxl.fp8_prepared):
+        logger.info(f"prepare T5xxl for fp8: set to {t5xxl_dtype}")
+
+        def prepare_fp8(text_encoder, target_dtype):
+            def forward_hook(module):
+                def forward(hidden_states):
+                    hidden_gelu = module.act(module.wi_0(hidden_states))
+                    hidden_linear = module.wi_1(hidden_states)
+                    hidden_states = hidden_gelu * hidden_linear
+                    hidden_states = module.dropout(hidden_states)
+
+                    hidden_states = module.wo(hidden_states)
+                    return hidden_states
+
+                return forward
+
+            for module in text_encoder.modules():
+                if module.__class__.__name__ in ["T5LayerNorm", "Embedding"]:
+                    # print("set", module.__class__.__name__, "to", target_dtype)
+                    module.to(target_dtype)
+                if module.__class__.__name__ in ["T5DenseGatedActDense"]:
+                    # print("set", module.__class__.__name__, "hooks")
+                    module.forward = forward_hook(module)
+
+        t5xxl.to(t5xxl_dtype)
+        prepare_fp8(t5xxl.encoder, torch.bfloat16)
+        t5xxl.fp8_prepared = True
+
+    # prepare embeddings
+    logger.info("Encoding prompts...")
+    clip_l = clip_l.to(device)
+    t5xxl = t5xxl.to(device)
+
+    def encode(prpt: str):
+        tokens_and_masks = tokenize_strategy.tokenize(prpt)
+        with torch.no_grad():
+            if is_fp8(clip_l_dtype):
+                with accelerator.autocast():
+                    l_pooled, _, _, _ = encoding_strategy.encode_tokens(tokenize_strategy, [clip_l, None], tokens_and_masks)
+            else:
+                with torch.autocast(device_type=device.type, dtype=clip_l_dtype):
+                    l_pooled, _, _, _ = encoding_strategy.encode_tokens(tokenize_strategy, [clip_l, None], tokens_and_masks)
+
+            if is_fp8(t5xxl_dtype):
+                with accelerator.autocast():
+                    _, t5_out, txt_ids, t5_attn_mask = encoding_strategy.encode_tokens(
+                        tokenize_strategy, [clip_l, t5xxl], tokens_and_masks, args.apply_t5_attn_mask
+                    )
+            else:
+                with torch.autocast(device_type=device.type, dtype=t5xxl_dtype):
+                    _, t5_out, txt_ids, t5_attn_mask = encoding_strategy.encode_tokens(
+                        tokenize_strategy, [None, t5xxl], tokens_and_masks, args.apply_t5_attn_mask
+                    )
+        return l_pooled, t5_out, txt_ids, t5_attn_mask
+
+    l_pooled, t5_out, txt_ids, t5_attn_mask = encode(prompt)
+    if negative_prompt:
+        neg_l_pooled, neg_t5_out, _, neg_t5_attn_mask = encode(negative_prompt)
+    else:
+        neg_l_pooled, neg_t5_out, neg_t5_attn_mask = None, None, None
+
+    # NaN check
+    if torch.isnan(l_pooled).any():
+        raise ValueError("NaN in l_pooled")
+    if torch.isnan(t5_out).any():
+        raise ValueError("NaN in t5_out")
+
+    if args.offload:
+        clip_l = clip_l.cpu()
+        t5xxl = t5xxl.cpu()
+    # del clip_l, t5xxl
+    device_utils.clean_memory()
+
+    # generate image
+    logger.info("Generating image...")
+    model = model.to(device)
+    if steps is None:
+        steps = 4 if is_schnell else 50
+
+    img_ids = img_ids.to(device)
+    t5_attn_mask = t5_attn_mask.to(device) if args.apply_t5_attn_mask else None
+
+    x = do_sample(
+        accelerator,
+        model,
+        noise,
+        img_ids,
+        l_pooled,
+        t5_out,
+        txt_ids,
+        steps,
+        guidance,
+        t5_attn_mask,
+        is_schnell,
+        device,
+        flux_dtype,
+        neg_l_pooled,
+        neg_t5_out,
+        neg_t5_attn_mask,
+        cfg_scale,
+    )
+    if args.offload:
+        model = model.cpu()
+    # del model
+    device_utils.clean_memory()
+
+    # unpack
+    x = x.float()
+    x = einops.rearrange(x, "b (h w) (c ph pw) -> b c (h ph) (w pw)", h=packed_latent_height, w=packed_latent_width, ph=2, pw=2)
+
+    # decode
+    logger.info("Decoding image...")
+    ae = ae.to(device)
+    with torch.no_grad():
+        if is_fp8(ae_dtype):
+            with accelerator.autocast():
+                x = ae.decode(x)
+        else:
+            with torch.autocast(device_type=device.type, dtype=ae_dtype):
+                x = ae.decode(x)
+    if args.offload:
+        ae = ae.cpu()
+
+    x = x.clamp(-1, 1)
+    x = x.permute(0, 2, 3, 1)
+    img = Image.fromarray((127.5 * (x + 1.0)).float().cpu().numpy().astype(np.uint8)[0])
+
+    # save image
+    output_dir = args.output_dir
+    os.makedirs(output_dir, exist_ok=True)
+    output_path = os.path.join(output_dir, f"{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.png")
+    img.save(output_path)
+
+    logger.info(f"Saved image to {output_path}")
+
+
+if __name__ == "__main__":
+    target_height = 768  # 1024
+    target_width = 1360  # 1024
+
+    # steps = 50  # 28  # 50
+    # guidance_scale = 5
+    # seed = 1  # None  # 1
+
+    device = get_preferred_device()
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--ckpt_path", type=str, required=True)
+    parser.add_argument("--clip_l", type=str, required=False)
+    parser.add_argument("--t5xxl", type=str, required=False)
+    parser.add_argument("--ae", type=str, required=False)
+    parser.add_argument("--apply_t5_attn_mask", action="store_true")
+    parser.add_argument("--prompt", type=str, default="A photo of a cat")
+    parser.add_argument("--output_dir", type=str, default=".")
+    parser.add_argument("--dtype", type=str, default="bfloat16", help="base dtype")
+    parser.add_argument("--clip_l_dtype", type=str, default=None, help="dtype for clip_l")
+    parser.add_argument("--ae_dtype", type=str, default=None, help="dtype for ae")
+    parser.add_argument("--t5xxl_dtype", type=str, default=None, help="dtype for t5xxl")
+    parser.add_argument("--flux_dtype", type=str, default=None, help="dtype for flux")
+    parser.add_argument("--seed", type=int, default=None)
+    parser.add_argument("--steps", type=int, default=None, help="Number of steps. Default is 4 for schnell, 50 for dev")
+    parser.add_argument("--guidance", type=float, default=3.5)
+    parser.add_argument("--negative_prompt", type=str, default=None)
+    parser.add_argument("--cfg_scale", type=float, default=1.0)
+    parser.add_argument("--offload", action="store_true", help="Offload to CPU")
+    parser.add_argument(
+        "--lora_weights",
+        type=str,
+        nargs="*",
+        default=[],
+        help="LoRA weights, only supports networks.lora_flux and lora_oft, each argument is a `path;multiplier` (semi-colon separated)",
+    )
+    parser.add_argument("--merge_lora_weights", action="store_true", help="Merge LoRA weights to model")
+    parser.add_argument("--width", type=int, default=target_width)
+    parser.add_argument("--height", type=int, default=target_height)
+    parser.add_argument("--interactive", action="store_true")
+    args = parser.parse_args()
+
+    seed = args.seed
+    steps = args.steps
+    guidance_scale = args.guidance
+
+    def is_fp8(dt):
+        return dt in [torch.float8_e4m3fn, torch.float8_e4m3fnuz, torch.float8_e5m2, torch.float8_e5m2fnuz]
+
+    dtype = str_to_dtype(args.dtype)
+    clip_l_dtype = str_to_dtype(args.clip_l_dtype, dtype)
+    t5xxl_dtype = str_to_dtype(args.t5xxl_dtype, dtype)
+    ae_dtype = str_to_dtype(args.ae_dtype, dtype)
+    flux_dtype = str_to_dtype(args.flux_dtype, dtype)
+
+    logger.info(f"Dtypes for clip_l, t5xxl, ae, flux: {clip_l_dtype}, {t5xxl_dtype}, {ae_dtype}, {flux_dtype}")
+
+    loading_device = "cpu" if args.offload else device
+
+    use_fp8 = [is_fp8(d) for d in [dtype, clip_l_dtype, t5xxl_dtype, ae_dtype, flux_dtype]]
+    if any(use_fp8):
+        accelerator = accelerate.Accelerator(mixed_precision="bf16")
+    else:
+        accelerator = None
+
+    # load clip_l
+    logger.info(f"Loading clip_l from {args.clip_l}...")
+    clip_l = flux_utils.load_clip_l(args.clip_l, clip_l_dtype, loading_device)
+    clip_l.eval()
+
+    logger.info(f"Loading t5xxl from {args.t5xxl}...")
+    t5xxl = flux_utils.load_t5xxl(args.t5xxl, t5xxl_dtype, loading_device)
+    t5xxl.eval()
+
+    # if is_fp8(clip_l_dtype):
+    #     clip_l = accelerator.prepare(clip_l)
+    # if is_fp8(t5xxl_dtype):
+    #     t5xxl = accelerator.prepare(t5xxl)
+
+    # DiT
+    is_schnell, model = flux_utils.load_flow_model(args.ckpt_path, None, loading_device)
+    model.eval()
+    logger.info(f"Casting model to {flux_dtype}")
+    model.to(flux_dtype)  # make sure model is dtype
+    # if is_fp8(flux_dtype):
+    #     model = accelerator.prepare(model)
+    #     if args.offload:
+    #         model = model.to("cpu")
+
+    t5xxl_max_length = 256 if is_schnell else 512
+    tokenize_strategy = strategy_flux.FluxTokenizeStrategy(t5xxl_max_length)
+    encoding_strategy = strategy_flux.FluxTextEncodingStrategy()
+
+    # AE
+    ae = flux_utils.load_ae(args.ae, ae_dtype, loading_device)
+    ae.eval()
+    # if is_fp8(ae_dtype):
+    #     ae = accelerator.prepare(ae)
+
+    # LoRA
+    lora_models: List[lora_flux.LoRANetwork] = []
+    for weights_file in args.lora_weights:
+        if ";" in weights_file:
+            weights_file, multiplier = weights_file.split(";")
+            multiplier = float(multiplier)
+        else:
+            multiplier = 1.0
+
+        weights_sd = load_file(weights_file)
+        is_lora = is_oft = False
+        for key in weights_sd.keys():
+            if key.startswith("lora"):
+                is_lora = True
+            if key.startswith("oft"):
+                is_oft = True
+            if is_lora or is_oft:
+                break
+
+        module = lora_flux if is_lora else oft_flux
+        lora_model, _ = module.create_network_from_weights(multiplier, None, ae, [clip_l, t5xxl], model, weights_sd, True)
+
+        if args.merge_lora_weights:
+            lora_model.merge_to([clip_l, t5xxl], model, weights_sd)
+        else:
+            lora_model.apply_to([clip_l, t5xxl], model)
+            info = lora_model.load_state_dict(weights_sd, strict=True)
+            logger.info(f"Loaded LoRA weights from {weights_file}: {info}")
+            lora_model.eval()
+            lora_model.to(device)
+
+        lora_models.append(lora_model)
+
+    if not args.interactive:
+        generate_image(
+            model,
+            clip_l,
+            t5xxl,
+            ae,
+            args.prompt,
+            args.seed,
+            args.width,
+            args.height,
+            args.steps,
+            args.guidance,
+            args.negative_prompt,
+            args.cfg_scale,
+        )
+    else:
+        # loop for interactive
+        width = target_width
+        height = target_height
+        steps = None
+        guidance = args.guidance
+        cfg_scale = args.cfg_scale
+
+        while True:
+            print(
+                "Enter prompt (empty to exit). Options: --w <width> --h <height> --s <steps> --d <seed> --g <guidance> --m <multipliers for LoRA>"
+                " --n <negative prompt>, `-` for empty negative prompt --c <cfg_scale>"
+            )
+            prompt = input()
+            if prompt == "":
+                break
+
+            # parse options
+            options = prompt.split("--")
+            prompt = options[0].strip()
+            seed = None
+            negative_prompt = None
+            for opt in options[1:]:
+                try:
+                    opt = opt.strip()
+                    if opt.startswith("w"):
+                        width = int(opt[1:].strip())
+                    elif opt.startswith("h"):
+                        height = int(opt[1:].strip())
+                    elif opt.startswith("s"):
+                        steps = int(opt[1:].strip())
+                    elif opt.startswith("d"):
+                        seed = int(opt[1:].strip())
+                    elif opt.startswith("g"):
+                        guidance = float(opt[1:].strip())
+                    elif opt.startswith("m"):
+                        mutipliers = opt[1:].strip().split(",")
+                        if len(mutipliers) != len(lora_models):
+                            logger.error(f"Invalid number of multipliers, expected {len(lora_models)}")
+                            continue
+                        for i, lora_model in enumerate(lora_models):
+                            lora_model.set_multiplier(float(mutipliers[i]))
+                    elif opt.startswith("n"):
+                        negative_prompt = opt[1:].strip()
+                        if negative_prompt == "-":
+                            negative_prompt = ""
+                    elif opt.startswith("c"):
+                        cfg_scale = float(opt[1:].strip())
+                except ValueError as e:
+                    logger.error(f"Invalid option: {opt}, {e}")
+
+            generate_image(model, clip_l, t5xxl, ae, prompt, seed, width, height, steps, guidance, negative_prompt, cfg_scale)
+
+    logger.info("Done!")
--- a/flux_train.py
+++ b/flux_train.py
@@ -0,0 +1,850 @@
+# training with captions
+
+# Swap blocks between CPU and GPU:
+# This implementation is inspired by and based on the work of 2kpr.
+# Many thanks to 2kpr for the original concept and implementation of memory-efficient offloading.
+# The original idea has been adapted and extended to fit the current project's needs.
+
+# Key features:
+# - CPU offloading during forward and backward passes
+# - Use of fused optimizer and grad_hook for efficient gradient processing
+# - Per-block fused optimizer instances
+
+import argparse
+from concurrent.futures import ThreadPoolExecutor
+import copy
+import math
+import os
+from multiprocessing import Value
+import time
+from typing import List, Optional, Tuple, Union
+import toml
+
+from tqdm import tqdm
+
+import torch
+import torch.nn as nn
+from library import utils
+from library.device_utils import init_ipex, clean_memory_on_device
+
+init_ipex()
+
+from accelerate.utils import set_seed
+from library import deepspeed_utils, flux_train_utils, flux_utils, strategy_base, strategy_flux
+from library.sd3_train_utils import FlowMatchEulerDiscreteScheduler
+
+import library.train_util as train_util
+
+from library.utils import setup_logging, add_logging_arguments
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+import library.config_util as config_util
+
+# import library.sdxl_train_util as sdxl_train_util
+from library.config_util import (
+    ConfigSanitizer,
+    BlueprintGenerator,
+)
+from library.custom_train_functions import apply_masked_loss, add_custom_train_arguments
+
+
+def train(args):
+    train_util.verify_training_args(args)
+    train_util.prepare_dataset_args(args, True)
+    # sdxl_train_util.verify_sdxl_training_args(args)
+    deepspeed_utils.prepare_deepspeed_args(args)
+    setup_logging(args, reset=True)
+
+    # temporary: backward compatibility for deprecated options. remove in the future
+    if not args.skip_cache_check:
+        args.skip_cache_check = args.skip_latents_validity_check
+
+    # assert (
+    #     not args.weighted_captions
+    # ), "weighted_captions is not supported currently / weighted_captionsは現在サポートされていません"
+    if args.cache_text_encoder_outputs_to_disk and not args.cache_text_encoder_outputs:
+        logger.warning(
+            "cache_text_encoder_outputs_to_disk is enabled, so cache_text_encoder_outputs is also enabled / cache_text_encoder_outputs_to_diskが有効になっているため、cache_text_encoder_outputsも有効になります"
+        )
+        args.cache_text_encoder_outputs = True
+
+    if args.cpu_offload_checkpointing and not args.gradient_checkpointing:
+        logger.warning(
+            "cpu_offload_checkpointing is enabled, so gradient_checkpointing is also enabled / cpu_offload_checkpointingが有効になっているため、gradient_checkpointingも有効になります"
+        )
+        args.gradient_checkpointing = True
+
+    assert (
+        args.blocks_to_swap is None or args.blocks_to_swap == 0
+    ) or not args.cpu_offload_checkpointing, (
+        "blocks_to_swap is not supported with cpu_offload_checkpointing / blocks_to_swapはcpu_offload_checkpointingと併用できません"
+    )
+
+    cache_latents = args.cache_latents
+    use_dreambooth_method = args.in_json is None
+
+    if args.seed is not None:
+        set_seed(args.seed)  # 乱数系列を初期化する
+
+    # prepare caching strategy: this must be set before preparing dataset. because dataset may use this strategy for initialization.
+    if args.cache_latents:
+        latents_caching_strategy = strategy_flux.FluxLatentsCachingStrategy(
+            args.cache_latents_to_disk, args.vae_batch_size, args.skip_cache_check
+        )
+        strategy_base.LatentsCachingStrategy.set_strategy(latents_caching_strategy)
+
+    # データセットを準備する
+    if args.dataset_class is None:
+        blueprint_generator = BlueprintGenerator(ConfigSanitizer(True, True, args.masked_loss, True))
+        if args.dataset_config is not None:
+            logger.info(f"Load dataset config from {args.dataset_config}")
+            user_config = config_util.load_user_config(args.dataset_config)
+            ignored = ["train_data_dir", "in_json"]
+            if any(getattr(args, attr) is not None for attr in ignored):
+                logger.warning(
+                    "ignore following options because config file is found: {0} / 設定ファイルが利用されるため以下のオプションは無視されます: {0}".format(
+                        ", ".join(ignored)
+                    )
+                )
+        else:
+            if use_dreambooth_method:
+                logger.info("Using DreamBooth method.")
+                user_config = {
+                    "datasets": [
+                        {
+                            "subsets": config_util.generate_dreambooth_subsets_config_by_subdirs(
+                                args.train_data_dir, args.reg_data_dir
+                            )
+                        }
+                    ]
+                }
+            else:
+                logger.info("Training with captions.")
+                user_config = {
+                    "datasets": [
+                        {
+                            "subsets": [
+                                {
+                                    "image_dir": args.train_data_dir,
+                                    "metadata_file": args.in_json,
+                                }
+                            ]
+                        }
+                    ]
+                }
+
+        blueprint = blueprint_generator.generate(user_config, args)
+        train_dataset_group, val_dataset_group = config_util.generate_dataset_group_by_blueprint(blueprint.dataset_group)
+    else:
+        train_dataset_group = train_util.load_arbitrary_dataset(args)
+        val_dataset_group = None
+
+    current_epoch = Value("i", 0)
+    current_step = Value("i", 0)
+    ds_for_collator = train_dataset_group if args.max_data_loader_n_workers == 0 else None
+    collator = train_util.collator_class(current_epoch, current_step, ds_for_collator)
+
+    train_dataset_group.verify_bucket_reso_steps(16)  # TODO これでいいか確認
+
+    _, is_schnell, _, _ = flux_utils.analyze_checkpoint_state(args.pretrained_model_name_or_path)
+    if args.debug_dataset:
+        if args.cache_text_encoder_outputs:
+            strategy_base.TextEncoderOutputsCachingStrategy.set_strategy(
+                strategy_flux.FluxTextEncoderOutputsCachingStrategy(
+                    args.cache_text_encoder_outputs_to_disk, args.text_encoder_batch_size, args.skip_cache_check, False
+                )
+            )
+        t5xxl_max_token_length = (
+            args.t5xxl_max_token_length if args.t5xxl_max_token_length is not None else (256 if is_schnell else 512)
+        )
+        strategy_base.TokenizeStrategy.set_strategy(strategy_flux.FluxTokenizeStrategy(t5xxl_max_token_length))
+
+        train_dataset_group.set_current_strategies()
+        train_util.debug_dataset(train_dataset_group, True)
+        return
+    if len(train_dataset_group) == 0:
+        logger.error(
+            "No data found. Please verify the metadata file and train_data_dir option. / 画像がありません。メタデータおよびtrain_data_dirオプションを確認してください。"
+        )
+        return
+
+    if cache_latents:
+        assert (
+            train_dataset_group.is_latent_cacheable()
+        ), "when caching latents, either color_aug or random_crop cannot be used / latentをキャッシュするときはcolor_augとrandom_cropは使えません"
+
+    if args.cache_text_encoder_outputs:
+        assert (
+            train_dataset_group.is_text_encoder_output_cacheable()
+        ), "when caching text encoder output, either caption_dropout_rate, shuffle_caption, token_warmup_step or caption_tag_dropout_rate cannot be used / text encoderの出力をキャッシュするときはcaption_dropout_rate, shuffle_caption, token_warmup_step, caption_tag_dropout_rateは使えません"
+
+    # acceleratorを準備する
+    logger.info("prepare accelerator")
+    accelerator = train_util.prepare_accelerator(args)
+
+    # mixed precisionに対応した型を用意しておき適宜castする
+    weight_dtype, save_dtype = train_util.prepare_dtype(args)
+
+    # モデルを読み込む
+
+    # load VAE for caching latents
+    ae = None
+    if cache_latents:
+        ae = flux_utils.load_ae(args.ae, weight_dtype, "cpu", args.disable_mmap_load_safetensors)
+        ae.to(accelerator.device, dtype=weight_dtype)
+        ae.requires_grad_(False)
+        ae.eval()
+
+        train_dataset_group.new_cache_latents(ae, accelerator)
+
+        ae.to("cpu")  # if no sampling, vae can be deleted
+        clean_memory_on_device(accelerator.device)
+
+        accelerator.wait_for_everyone()
+
+    # prepare tokenize strategy
+    if args.t5xxl_max_token_length is None:
+        if is_schnell:
+            t5xxl_max_token_length = 256
+        else:
+            t5xxl_max_token_length = 512
+    else:
+        t5xxl_max_token_length = args.t5xxl_max_token_length
+
+    flux_tokenize_strategy = strategy_flux.FluxTokenizeStrategy(t5xxl_max_token_length)
+    strategy_base.TokenizeStrategy.set_strategy(flux_tokenize_strategy)
+
+    # load clip_l, t5xxl for caching text encoder outputs
+    clip_l = flux_utils.load_clip_l(args.clip_l, weight_dtype, "cpu", args.disable_mmap_load_safetensors)
+    t5xxl = flux_utils.load_t5xxl(args.t5xxl, weight_dtype, "cpu", args.disable_mmap_load_safetensors)
+    clip_l.eval()
+    t5xxl.eval()
+    clip_l.requires_grad_(False)
+    t5xxl.requires_grad_(False)
+
+    text_encoding_strategy = strategy_flux.FluxTextEncodingStrategy(args.apply_t5_attn_mask)
+    strategy_base.TextEncodingStrategy.set_strategy(text_encoding_strategy)
+
+    # cache text encoder outputs
+    sample_prompts_te_outputs = None
+    if args.cache_text_encoder_outputs:
+        # Text Encodes are eval and no grad here
+        clip_l.to(accelerator.device)
+        t5xxl.to(accelerator.device)
+
+        text_encoder_caching_strategy = strategy_flux.FluxTextEncoderOutputsCachingStrategy(
+            args.cache_text_encoder_outputs_to_disk, args.text_encoder_batch_size, False, False, args.apply_t5_attn_mask
+        )
+        strategy_base.TextEncoderOutputsCachingStrategy.set_strategy(text_encoder_caching_strategy)
+
+        with accelerator.autocast():
+            train_dataset_group.new_cache_text_encoder_outputs([clip_l, t5xxl], accelerator)
+
+        # cache sample prompt's embeddings to free text encoder's memory
+        if args.sample_prompts is not None:
+            logger.info(f"cache Text Encoder outputs for sample prompt: {args.sample_prompts}")
+
+            text_encoding_strategy: strategy_flux.FluxTextEncodingStrategy = strategy_base.TextEncodingStrategy.get_strategy()
+
+            prompts = train_util.load_prompts(args.sample_prompts)
+            sample_prompts_te_outputs = {}  # key: prompt, value: text encoder outputs
+            with accelerator.autocast(), torch.no_grad():
+                for prompt_dict in prompts:
+                    for p in [prompt_dict.get("prompt", ""), prompt_dict.get("negative_prompt", "")]:
+                        if p not in sample_prompts_te_outputs:
+                            logger.info(f"cache Text Encoder outputs for prompt: {p}")
+                            tokens_and_masks = flux_tokenize_strategy.tokenize(p)
+                            sample_prompts_te_outputs[p] = text_encoding_strategy.encode_tokens(
+                                flux_tokenize_strategy, [clip_l, t5xxl], tokens_and_masks, args.apply_t5_attn_mask
+                            )
+
+        accelerator.wait_for_everyone()
+
+        # now we can delete Text Encoders to free memory
+        clip_l = None
+        t5xxl = None
+        clean_memory_on_device(accelerator.device)
+
+    # load FLUX
+    _, flux = flux_utils.load_flow_model(
+        args.pretrained_model_name_or_path, weight_dtype, "cpu", args.disable_mmap_load_safetensors
+    )
+
+    if args.gradient_checkpointing:
+        flux.enable_gradient_checkpointing(cpu_offload=args.cpu_offload_checkpointing)
+
+    flux.requires_grad_(True)
+
+    # block swap
+
+    # backward compatibility
+    if args.blocks_to_swap is None:
+        blocks_to_swap = args.double_blocks_to_swap or 0
+        if args.single_blocks_to_swap is not None:
+            blocks_to_swap += args.single_blocks_to_swap // 2
+        if blocks_to_swap > 0:
+            logger.warning(
+                "double_blocks_to_swap and single_blocks_to_swap are deprecated. Use blocks_to_swap instead."
+                " / double_blocks_to_swapとsingle_blocks_to_swapは非推奨です。blocks_to_swapを使ってください。"
+            )
+            logger.info(
+                f"double_blocks_to_swap={args.double_blocks_to_swap} and single_blocks_to_swap={args.single_blocks_to_swap} are converted to blocks_to_swap={blocks_to_swap}."
+            )
+            args.blocks_to_swap = blocks_to_swap
+        del blocks_to_swap
+
+    is_swapping_blocks = args.blocks_to_swap is not None and args.blocks_to_swap > 0
+    if is_swapping_blocks:
+        # Swap blocks between CPU and GPU to reduce memory usage, in forward and backward passes.
+        # This idea is based on 2kpr's great work. Thank you!
+        logger.info(f"enable block swap: blocks_to_swap={args.blocks_to_swap}")
+        flux.enable_block_swap(args.blocks_to_swap, accelerator.device)
+
+    if not cache_latents:
+        # load VAE here if not cached
+        ae = flux_utils.load_ae(args.ae, weight_dtype, "cpu")
+        ae.requires_grad_(False)
+        ae.eval()
+        ae.to(accelerator.device, dtype=weight_dtype)
+
+    training_models = []
+    params_to_optimize = []
+    training_models.append(flux)
+    name_and_params = list(flux.named_parameters())
+    # single param group for now
+    params_to_optimize.append({"params": [p for _, p in name_and_params], "lr": args.learning_rate})
+    param_names = [[n for n, _ in name_and_params]]
+
+    # calculate number of trainable parameters
+    n_params = 0
+    for group in params_to_optimize:
+        for p in group["params"]:
+            n_params += p.numel()
+
+    accelerator.print(f"number of trainable parameters: {n_params}")
+
+    # 学習に必要なクラスを準備する
+    accelerator.print("prepare optimizer, data loader etc.")
+
+    if args.blockwise_fused_optimizers:
+        # fused backward pass: https://pytorch.org/tutorials/intermediate/optimizer_step_in_backward_tutorial.html
+        # Instead of creating an optimizer for all parameters as in the tutorial, we create an optimizer for each block of parameters.
+        # This balances memory usage and management complexity.
+
+        # split params into groups. currently different learning rates are not supported
+        grouped_params = []
+        param_group = {}
+        for group in params_to_optimize:
+            named_parameters = list(flux.named_parameters())
+            assert len(named_parameters) == len(group["params"]), "number of parameters does not match"
+            for p, np in zip(group["params"], named_parameters):
+                # determine target layer and block index for each parameter
+                block_type = "other"  # double, single or other
+                if np[0].startswith("double_blocks"):
+                    block_index = int(np[0].split(".")[1])
+                    block_type = "double"
+                elif np[0].startswith("single_blocks"):
+                    block_index = int(np[0].split(".")[1])
+                    block_type = "single"
+                else:
+                    block_index = -1
+
+                param_group_key = (block_type, block_index)
+                if param_group_key not in param_group:
+                    param_group[param_group_key] = []
+                param_group[param_group_key].append(p)
+
+        block_types_and_indices = []
+        for param_group_key, param_group in param_group.items():
+            block_types_and_indices.append(param_group_key)
+            grouped_params.append({"params": param_group, "lr": args.learning_rate})
+
+            num_params = 0
+            for p in param_group:
+                num_params += p.numel()
+            accelerator.print(f"block {param_group_key}: {num_params} parameters")
+
+        # prepare optimizers for each group
+        optimizers = []
+        for group in grouped_params:
+            _, _, optimizer = train_util.get_optimizer(args, trainable_params=[group])
+            optimizers.append(optimizer)
+        optimizer = optimizers[0]  # avoid error in the following code
+
+        logger.info(f"using {len(optimizers)} optimizers for blockwise fused optimizers")
+
+        if train_util.is_schedulefree_optimizer(optimizers[0], args):
+            raise ValueError("Schedule-free optimizer is not supported with blockwise fused optimizers")
+        optimizer_train_fn = lambda: None  # dummy function
+        optimizer_eval_fn = lambda: None  # dummy function
+    else:
+        _, _, optimizer = train_util.get_optimizer(args, trainable_params=params_to_optimize)
+        optimizer_train_fn, optimizer_eval_fn = train_util.get_optimizer_train_eval_fn(optimizer, args)
+
+    # prepare dataloader
+    # strategies are set here because they cannot be referenced in another process. Copy them with the dataset
+    # some strategies can be None
+    train_dataset_group.set_current_strategies()
+
+    # DataLoaderのプロセス数：0 は persistent_workers が使えないので注意
+    n_workers = min(args.max_data_loader_n_workers, os.cpu_count())  # cpu_count or max_data_loader_n_workers
+    train_dataloader = torch.utils.data.DataLoader(
+        train_dataset_group,
+        batch_size=1,
+        shuffle=True,
+        collate_fn=collator,
+        num_workers=n_workers,
+        persistent_workers=args.persistent_data_loader_workers,
+    )
+
+    # 学習ステップ数を計算する
+    if args.max_train_epochs is not None:
+        args.max_train_steps = args.max_train_epochs * math.ceil(
+            len(train_dataloader) / accelerator.num_processes / args.gradient_accumulation_steps
+        )
+        accelerator.print(
+            f"override steps. steps for {args.max_train_epochs} epochs is / 指定エポックまでのステップ数: {args.max_train_steps}"
+        )
+
+    # データセット側にも学習ステップを送信
+    train_dataset_group.set_max_train_steps(args.max_train_steps)
+
+    # lr schedulerを用意する
+    if args.blockwise_fused_optimizers:
+        # prepare lr schedulers for each optimizer
+        lr_schedulers = [train_util.get_scheduler_fix(args, optimizer, accelerator.num_processes) for optimizer in optimizers]
+        lr_scheduler = lr_schedulers[0]  # avoid error in the following code
+    else:
+        lr_scheduler = train_util.get_scheduler_fix(args, optimizer, accelerator.num_processes)
+
+    # 実験的機能：勾配も含めたfp16/bf16学習を行う　モデル全体をfp16/bf16にする
+    if args.full_fp16:
+        assert (
+            args.mixed_precision == "fp16"
+        ), "full_fp16 requires mixed precision='fp16' / full_fp16を使う場合はmixed_precision='fp16'を指定してください。"
+        accelerator.print("enable full fp16 training.")
+        flux.to(weight_dtype)
+        if clip_l is not None:
+            clip_l.to(weight_dtype)
+            t5xxl.to(weight_dtype)  # TODO check works with fp16 or not
+    elif args.full_bf16:
+        assert (
+            args.mixed_precision == "bf16"
+        ), "full_bf16 requires mixed precision='bf16' / full_bf16を使う場合はmixed_precision='bf16'を指定してください。"
+        accelerator.print("enable full bf16 training.")
+        flux.to(weight_dtype)
+        if clip_l is not None:
+            clip_l.to(weight_dtype)
+            t5xxl.to(weight_dtype)
+
+    # if we don't cache text encoder outputs, move them to device
+    if not args.cache_text_encoder_outputs:
+        clip_l.to(accelerator.device)
+        t5xxl.to(accelerator.device)
+
+    clean_memory_on_device(accelerator.device)
+
+    if args.deepspeed:
+        ds_model = deepspeed_utils.prepare_deepspeed_model(args, mmdit=flux)
+        # most of ZeRO stage uses optimizer partitioning, so we have to prepare optimizer and ds_model at the same time. # pull/1139#issuecomment-1986790007
+        ds_model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
+            ds_model, optimizer, train_dataloader, lr_scheduler
+        )
+        training_models = [ds_model]
+
+    else:
+        # accelerator does some magic
+        # if we doesn't swap blocks, we can move the model to device
+        flux = accelerator.prepare(flux, device_placement=[not is_swapping_blocks])
+        if is_swapping_blocks:
+            accelerator.unwrap_model(flux).move_to_device_except_swap_blocks(accelerator.device)  # reduce peak memory usage
+        optimizer, train_dataloader, lr_scheduler = accelerator.prepare(optimizer, train_dataloader, lr_scheduler)
+
+    # 実験的機能：勾配も含めたfp16学習を行う　PyTorchにパッチを当ててfp16でのgrad scaleを有効にする
+    if args.full_fp16:
+        # During deepseed training, accelerate not handles fp16/bf16|mixed precision directly via scaler. Let deepspeed engine do.
+        # -> But we think it's ok to patch accelerator even if deepspeed is enabled.
+        train_util.patch_accelerator_for_fp16_training(accelerator)
+
+    # resumeする
+    train_util.resume_from_local_or_hf_if_specified(accelerator, args)
+
+    if args.fused_backward_pass:
+        # use fused optimizer for backward pass: other optimizers will be supported in the future
+        import library.adafactor_fused
+
+        library.adafactor_fused.patch_adafactor_fused(optimizer)
+
+        for param_group, param_name_group in zip(optimizer.param_groups, param_names):
+            for parameter, param_name in zip(param_group["params"], param_name_group):
+                if parameter.requires_grad:
+
+                    def create_grad_hook(p_name, p_group):
+                        def grad_hook(tensor: torch.Tensor):
+                            if accelerator.sync_gradients and args.max_grad_norm != 0.0:
+                                accelerator.clip_grad_norm_(tensor, args.max_grad_norm)
+                            optimizer.step_param(tensor, p_group)
+                            tensor.grad = None
+
+                        return grad_hook
+
+                    parameter.register_post_accumulate_grad_hook(create_grad_hook(param_name, param_group))
+
+    elif args.blockwise_fused_optimizers:
+        # prepare for additional optimizers and lr schedulers
+        for i in range(1, len(optimizers)):
+            optimizers[i] = accelerator.prepare(optimizers[i])
+            lr_schedulers[i] = accelerator.prepare(lr_schedulers[i])
+
+        # counters are used to determine when to step the optimizer
+        global optimizer_hooked_count
+        global num_parameters_per_group
+        global parameter_optimizer_map
+
+        optimizer_hooked_count = {}
+        num_parameters_per_group = [0] * len(optimizers)
+        parameter_optimizer_map = {}
+
+        for opt_idx, optimizer in enumerate(optimizers):
+            for param_group in optimizer.param_groups:
+                for parameter in param_group["params"]:
+                    if parameter.requires_grad:
+
+                        def grad_hook(parameter: torch.Tensor):
+                            if accelerator.sync_gradients and args.max_grad_norm != 0.0:
+                                accelerator.clip_grad_norm_(parameter, args.max_grad_norm)
+
+                            i = parameter_optimizer_map[parameter]
+                            optimizer_hooked_count[i] += 1
+                            if optimizer_hooked_count[i] == num_parameters_per_group[i]:
+                                optimizers[i].step()
+                                optimizers[i].zero_grad(set_to_none=True)
+
+                        parameter.register_post_accumulate_grad_hook(grad_hook)
+                        parameter_optimizer_map[parameter] = opt_idx
+                        num_parameters_per_group[opt_idx] += 1
+
+    # epoch数を計算する
+    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
+    num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)
+    if (args.save_n_epoch_ratio is not None) and (args.save_n_epoch_ratio > 0):
+        args.save_every_n_epochs = math.floor(num_train_epochs / args.save_n_epoch_ratio) or 1
+
+    # 学習する
+    # total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
+    accelerator.print("running training / 学習開始")
+    accelerator.print(f"  num examples / サンプル数: {train_dataset_group.num_train_images}")
+    accelerator.print(f"  num batches per epoch / 1epochのバッチ数: {len(train_dataloader)}")
+    accelerator.print(f"  num epochs / epoch数: {num_train_epochs}")
+    accelerator.print(
+        f"  batch size per device / バッチサイズ: {', '.join([str(d.batch_size) for d in train_dataset_group.datasets])}"
+    )
+    # accelerator.print(
+    #     f"  total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ（並列学習、勾配合計含む）: {total_batch_size}"
+    # )
+    accelerator.print(f"  gradient accumulation steps / 勾配を合計するステップ数 = {args.gradient_accumulation_steps}")
+    accelerator.print(f"  total optimization steps / 学習ステップ数: {args.max_train_steps}")
+
+    progress_bar = tqdm(range(args.max_train_steps), smoothing=0, disable=not accelerator.is_local_main_process, desc="steps")
+    global_step = 0
+
+    noise_scheduler = FlowMatchEulerDiscreteScheduler(num_train_timesteps=1000, shift=args.discrete_flow_shift)
+    noise_scheduler_copy = copy.deepcopy(noise_scheduler)
+
+    if accelerator.is_main_process:
+        init_kwargs = {}
+        if args.wandb_run_name:
+            init_kwargs["wandb"] = {"name": args.wandb_run_name}
+        if args.log_tracker_config is not None:
+            init_kwargs = toml.load(args.log_tracker_config)
+        accelerator.init_trackers(
+            "finetuning" if args.log_tracker_name is None else args.log_tracker_name,
+            config=train_util.get_sanitized_config_or_none(args),
+            init_kwargs=init_kwargs,
+        )
+
+    if is_swapping_blocks:
+        accelerator.unwrap_model(flux).prepare_block_swap_before_forward()
+
+    # For --sample_at_first
+    optimizer_eval_fn()
+    flux_train_utils.sample_images(accelerator, args, 0, global_step, flux, ae, [clip_l, t5xxl], sample_prompts_te_outputs)
+    optimizer_train_fn()
+    if len(accelerator.trackers) > 0:
+        # log empty object to commit the sample images to wandb
+        accelerator.log({}, step=0)
+
+    loss_recorder = train_util.LossRecorder()
+    epoch = 0  # avoid error when max_train_steps is 0
+    for epoch in range(num_train_epochs):
+        accelerator.print(f"\nepoch {epoch+1}/{num_train_epochs}")
+        current_epoch.value = epoch + 1
+
+        for m in training_models:
+            m.train()
+
+        for step, batch in enumerate(train_dataloader):
+            current_step.value = global_step
+
+            if args.blockwise_fused_optimizers:
+                optimizer_hooked_count = {i: 0 for i in range(len(optimizers))}  # reset counter for each step
+
+            with accelerator.accumulate(*training_models):
+                if "latents" in batch and batch["latents"] is not None:
+                    latents = batch["latents"].to(accelerator.device, dtype=weight_dtype)
+                else:
+                    with torch.no_grad():
+                        # encode images to latents. images are [-1, 1]
+                        latents = ae.encode(batch["images"].to(ae.dtype)).to(accelerator.device, dtype=weight_dtype)
+
+                    # NaNが含まれていれば警告を表示し0に置き換える
+                    if torch.any(torch.isnan(latents)):
+                        accelerator.print("NaN found in latents, replacing with zeros")
+                        latents = torch.nan_to_num(latents, 0, out=latents)
+
+                text_encoder_outputs_list = batch.get("text_encoder_outputs_list", None)
+                if text_encoder_outputs_list is not None:
+                    text_encoder_conds = text_encoder_outputs_list
+                else:
+                    # not cached or training, so get from text encoders
+                    tokens_and_masks = batch["input_ids_list"]
+                    with torch.no_grad():
+                        input_ids = [ids.to(accelerator.device) for ids in batch["input_ids_list"]]
+                        text_encoder_conds = text_encoding_strategy.encode_tokens(
+                            flux_tokenize_strategy, [clip_l, t5xxl], input_ids, args.apply_t5_attn_mask
+                        )
+                        if args.full_fp16:
+                            text_encoder_conds = [c.to(weight_dtype) for c in text_encoder_conds]
+
+                # TODO support some features for noise implemented in get_noise_noisy_latents_and_timesteps
+
+                # Sample noise that we'll add to the latents
+                noise = torch.randn_like(latents)
+                bsz = latents.shape[0]
+
+                # get noisy model input and timesteps
+                noisy_model_input, timesteps, sigmas = flux_train_utils.get_noisy_model_input_and_timesteps(
+                    args, noise_scheduler_copy, latents, noise, accelerator.device, weight_dtype
+                )
+
+                # pack latents and get img_ids
+                packed_noisy_model_input = flux_utils.pack_latents(noisy_model_input)  # b, c, h*2, w*2 -> b, h*w, c*4
+                packed_latent_height, packed_latent_width = noisy_model_input.shape[2] // 2, noisy_model_input.shape[3] // 2
+                img_ids = flux_utils.prepare_img_ids(bsz, packed_latent_height, packed_latent_width).to(device=accelerator.device)
+
+                # get guidance: ensure args.guidance_scale is float
+                guidance_vec = torch.full((bsz,), float(args.guidance_scale), device=accelerator.device)
+
+                # call model
+                l_pooled, t5_out, txt_ids, t5_attn_mask = text_encoder_conds
+                if not args.apply_t5_attn_mask:
+                    t5_attn_mask = None
+
+                with accelerator.autocast():
+                    # YiYi notes: divide it by 1000 for now because we scale it by 1000 in the transformer model (we should not keep it but I want to keep the inputs same for the model for testing)
+                    model_pred = flux(
+                        img=packed_noisy_model_input,
+                        img_ids=img_ids,
+                        txt=t5_out,
+                        txt_ids=txt_ids,
+                        y=l_pooled,
+                        timesteps=timesteps / 1000,
+                        guidance=guidance_vec,
+                        txt_attention_mask=t5_attn_mask,
+                    )
+
+                # unpack latents
+                model_pred = flux_utils.unpack_latents(model_pred, packed_latent_height, packed_latent_width)
+
+                # apply model prediction type
+                model_pred, weighting = flux_train_utils.apply_model_prediction_type(args, model_pred, noisy_model_input, sigmas)
+
+                # flow matching loss: this is different from SD3
+                target = noise - latents
+
+                # calculate loss
+                huber_c = train_util.get_huber_threshold_if_needed(args, timesteps, noise_scheduler)
+                loss = train_util.conditional_loss(model_pred.float(), target.float(), args.loss_type, "none", huber_c)
+                if weighting is not None:
+                    loss = loss * weighting
+                if args.masked_loss or ("alpha_masks" in batch and batch["alpha_masks"] is not None):
+                    loss = apply_masked_loss(loss, batch)
+                loss = loss.mean([1, 2, 3])
+
+                loss_weights = batch["loss_weights"]  # 各sampleごとのweight
+                loss = loss * loss_weights
+                loss = loss.mean()
+
+                # backward
+                accelerator.backward(loss)
+
+                if not (args.fused_backward_pass or args.blockwise_fused_optimizers):
+                    if accelerator.sync_gradients and args.max_grad_norm != 0.0:
+                        params_to_clip = []
+                        for m in training_models:
+                            params_to_clip.extend(m.parameters())
+                        accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
+
+                    optimizer.step()
+                    lr_scheduler.step()
+                    optimizer.zero_grad(set_to_none=True)
+                else:
+                    # optimizer.step() and optimizer.zero_grad() are called in the optimizer hook
+                    lr_scheduler.step()
+                    if args.blockwise_fused_optimizers:
+                        for i in range(1, len(optimizers)):
+                            lr_schedulers[i].step()
+
+            # Checks if the accelerator has performed an optimization step behind the scenes
+            if accelerator.sync_gradients:
+                progress_bar.update(1)
+                global_step += 1
+
+                optimizer_eval_fn()
+                flux_train_utils.sample_images(
+                    accelerator, args, None, global_step, flux, ae, [clip_l, t5xxl], sample_prompts_te_outputs
+                )
+
+                # 指定ステップごとにモデルを保存
+                if args.save_every_n_steps is not None and global_step % args.save_every_n_steps == 0:
+                    accelerator.wait_for_everyone()
+                    if accelerator.is_main_process:
+                        flux_train_utils.save_flux_model_on_epoch_end_or_stepwise(
+                            args,
+                            False,
+                            accelerator,
+                            save_dtype,
+                            epoch,
+                            num_train_epochs,
+                            global_step,
+                            accelerator.unwrap_model(flux),
+                        )
+                optimizer_train_fn()
+
+            current_loss = loss.detach().item()  # 平均なのでbatch sizeは関係ないはず
+            if len(accelerator.trackers) > 0:
+                logs = {"loss": current_loss}
+                train_util.append_lr_to_logs(logs, lr_scheduler, args.optimizer_type, including_unet=True)
+
+                accelerator.log(logs, step=global_step)
+
+            loss_recorder.add(epoch=epoch, step=step, loss=current_loss)
+            avr_loss: float = loss_recorder.moving_average
+            logs = {"avr_loss": avr_loss}  # , "lr": lr_scheduler.get_last_lr()[0]}
+            progress_bar.set_postfix(**logs)
+
+            if global_step >= args.max_train_steps:
+                break
+
+        if len(accelerator.trackers) > 0:
+            logs = {"loss/epoch": loss_recorder.moving_average}
+            accelerator.log(logs, step=epoch + 1)
+
+        accelerator.wait_for_everyone()
+
+        optimizer_eval_fn()
+        if args.save_every_n_epochs is not None:
+            if accelerator.is_main_process:
+                flux_train_utils.save_flux_model_on_epoch_end_or_stepwise(
+                    args,
+                    True,
+                    accelerator,
+                    save_dtype,
+                    epoch,
+                    num_train_epochs,
+                    global_step,
+                    accelerator.unwrap_model(flux),
+                )
+
+        flux_train_utils.sample_images(
+            accelerator, args, epoch + 1, global_step, flux, ae, [clip_l, t5xxl], sample_prompts_te_outputs
+        )
+        optimizer_train_fn()
+
+    is_main_process = accelerator.is_main_process
+    # if is_main_process:
+    flux = accelerator.unwrap_model(flux)
+
+    accelerator.end_training()
+    optimizer_eval_fn()
+
+    if args.save_state or args.save_state_on_train_end:
+        train_util.save_state_on_train_end(args, accelerator)
+
+    del accelerator  # この後メモリを使うのでこれは消す
+
+    if is_main_process:
+        flux_train_utils.save_flux_model_on_train_end(args, save_dtype, epoch, global_step, flux)
+        logger.info("model saved.")
+
+
+def setup_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser()
+
+    add_logging_arguments(parser)
+    train_util.add_sd_models_arguments(parser)  # TODO split this
+    train_util.add_dataset_arguments(parser, True, True, True)
+    train_util.add_training_arguments(parser, False)
+    train_util.add_masked_loss_arguments(parser)
+    deepspeed_utils.add_deepspeed_arguments(parser)
+    train_util.add_sd_saving_arguments(parser)
+    train_util.add_optimizer_arguments(parser)
+    config_util.add_config_arguments(parser)
+    add_custom_train_arguments(parser)  # TODO remove this from here
+    train_util.add_dit_training_arguments(parser)
+    flux_train_utils.add_flux_train_arguments(parser)
+
+    parser.add_argument(
+        "--mem_eff_save",
+        action="store_true",
+        help="[EXPERIMENTAL] use memory efficient custom model saving method / メモリ効率の良い独自のモデル保存方法を使う",
+    )
+
+    parser.add_argument(
+        "--fused_optimizer_groups",
+        type=int,
+        default=None,
+        help="**this option is not working** will be removed in the future / このオプションは動作しません。将来削除されます",
+    )
+    parser.add_argument(
+        "--blockwise_fused_optimizers",
+        action="store_true",
+        help="enable blockwise optimizers for fused backward pass and optimizer step / fused backward passとoptimizer step のためブロック単位のoptimizerを有効にする",
+    )
+    parser.add_argument(
+        "--skip_latents_validity_check",
+        action="store_true",
+        help="[Deprecated] use 'skip_cache_check' instead / 代わりに 'skip_cache_check' を使用してください",
+    )
+    parser.add_argument(
+        "--double_blocks_to_swap",
+        type=int,
+        default=None,
+        help="[Deprecated] use 'blocks_to_swap' instead / 代わりに 'blocks_to_swap' を使用してください",
+    )
+    parser.add_argument(
+        "--single_blocks_to_swap",
+        type=int,
+        default=None,
+        help="[Deprecated] use 'blocks_to_swap' instead / 代わりに 'blocks_to_swap' を使用してください",
+    )
+    parser.add_argument(
+        "--cpu_offload_checkpointing",
+        action="store_true",
+        help="[EXPERIMENTAL] enable offloading of tensors to CPU during checkpointing / チェックポイント時にテンソルをCPUにオフロードする",
+    )
+    return parser
+
+
+if __name__ == "__main__":
+    parser = setup_parser()
+
+    args = parser.parse_args()
+    train_util.verify_command_line_training_args(args)
+    args = train_util.read_config_from_file(args, parser)
+
+    train(args)
--- a/flux_train_control_net.py
+++ b/flux_train_control_net.py
@@ -0,0 +1,878 @@
+# training with captions
+
+# Swap blocks between CPU and GPU:
+# This implementation is inspired by and based on the work of 2kpr.
+# Many thanks to 2kpr for the original concept and implementation of memory-efficient offloading.
+# The original idea has been adapted and extended to fit the current project's needs.
+
+# Key features:
+# - CPU offloading during forward and backward passes
+# - Use of fused optimizer and grad_hook for efficient gradient processing
+# - Per-block fused optimizer instances
+
+import argparse
+import copy
+import math
+import os
+import time
+from concurrent.futures import ThreadPoolExecutor
+from multiprocessing import Value
+from typing import List, Optional, Tuple, Union
+
+import toml
+import torch
+import torch.nn as nn
+from tqdm import tqdm
+
+from library import utils
+from library.device_utils import clean_memory_on_device, init_ipex
+
+init_ipex()
+
+from accelerate.utils import set_seed
+
+import library.train_util as train_util
+from library import (
+    deepspeed_utils,
+    flux_train_utils,
+    flux_utils,
+    strategy_base,
+    strategy_flux,
+)
+from library.sd3_train_utils import FlowMatchEulerDiscreteScheduler
+from library.utils import add_logging_arguments, setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+import library.config_util as config_util
+
+# import library.sdxl_train_util as sdxl_train_util
+from library.config_util import (
+    BlueprintGenerator,
+    ConfigSanitizer,
+)
+from library.custom_train_functions import add_custom_train_arguments, apply_masked_loss
+
+
+def train(args):
+    train_util.verify_training_args(args)
+    train_util.prepare_dataset_args(args, True)
+    # sdxl_train_util.verify_sdxl_training_args(args)
+    deepspeed_utils.prepare_deepspeed_args(args)
+    setup_logging(args, reset=True)
+
+    # temporary: backward compatibility for deprecated options. remove in the future
+    if not args.skip_cache_check:
+        args.skip_cache_check = args.skip_latents_validity_check
+
+    # assert (
+    #     not args.weighted_captions
+    # ), "weighted_captions is not supported currently / weighted_captionsは現在サポートされていません"
+    if args.cache_text_encoder_outputs_to_disk and not args.cache_text_encoder_outputs:
+        logger.warning(
+            "cache_text_encoder_outputs_to_disk is enabled, so cache_text_encoder_outputs is also enabled / cache_text_encoder_outputs_to_diskが有効になっているため、cache_text_encoder_outputsも有効になります"
+        )
+        args.cache_text_encoder_outputs = True
+
+    if args.cpu_offload_checkpointing and not args.gradient_checkpointing:
+        logger.warning(
+            "cpu_offload_checkpointing is enabled, so gradient_checkpointing is also enabled / cpu_offload_checkpointingが有効になっているため、gradient_checkpointingも有効になります"
+        )
+        args.gradient_checkpointing = True
+
+    assert (
+        args.blocks_to_swap is None or args.blocks_to_swap == 0
+    ) or not args.cpu_offload_checkpointing, (
+        "blocks_to_swap is not supported with cpu_offload_checkpointing / blocks_to_swapはcpu_offload_checkpointingと併用できません"
+    )
+
+    cache_latents = args.cache_latents
+
+    if args.seed is not None:
+        set_seed(args.seed)  # 乱数系列を初期化する
+
+    # prepare caching strategy: this must be set before preparing dataset. because dataset may use this strategy for initialization.
+    if args.cache_latents:
+        latents_caching_strategy = strategy_flux.FluxLatentsCachingStrategy(
+            args.cache_latents_to_disk, args.vae_batch_size, args.skip_cache_check
+        )
+        strategy_base.LatentsCachingStrategy.set_strategy(latents_caching_strategy)
+
+    # データセットを準備する
+    if args.dataset_class is None:
+        blueprint_generator = BlueprintGenerator(ConfigSanitizer(False, False, True, True))
+        if args.dataset_config is not None:
+            logger.info(f"Load dataset config from {args.dataset_config}")
+            user_config = config_util.load_user_config(args.dataset_config)
+            ignored = ["train_data_dir", "conditioning_data_dir"]
+            if any(getattr(args, attr) is not None for attr in ignored):
+                logger.warning(
+                    "ignore following options because config file is found: {0} / 設定ファイルが利用されるため以下のオプションは無視されます: {0}".format(
+                        ", ".join(ignored)
+                    )
+                )
+        else:
+            user_config = {
+                "datasets": [
+                    {
+                        "subsets": config_util.generate_controlnet_subsets_config_by_subdirs(
+                            args.train_data_dir, args.conditioning_data_dir, args.caption_extension
+                        )
+                    }
+                ]
+            }
+
+        blueprint = blueprint_generator.generate(user_config, args)
+        train_dataset_group, val_dataset_group = config_util.generate_dataset_group_by_blueprint(blueprint.dataset_group)
+    else:
+        train_dataset_group = train_util.load_arbitrary_dataset(args)
+        val_dataset_group = None
+
+    current_epoch = Value("i", 0)
+    current_step = Value("i", 0)
+    ds_for_collator = train_dataset_group if args.max_data_loader_n_workers == 0 else None
+    collator = train_util.collator_class(current_epoch, current_step, ds_for_collator)
+
+    train_dataset_group.verify_bucket_reso_steps(16)  # TODO これでいいか確認
+
+    _, is_schnell, _, _ = flux_utils.analyze_checkpoint_state(args.pretrained_model_name_or_path)
+    if args.debug_dataset:
+        if args.cache_text_encoder_outputs:
+            strategy_base.TextEncoderOutputsCachingStrategy.set_strategy(
+                strategy_flux.FluxTextEncoderOutputsCachingStrategy(
+                    args.cache_text_encoder_outputs_to_disk, args.text_encoder_batch_size, args.skip_cache_check, False
+                )
+            )
+        t5xxl_max_token_length = (
+            args.t5xxl_max_token_length if args.t5xxl_max_token_length is not None else (256 if is_schnell else 512)
+        )
+        strategy_base.TokenizeStrategy.set_strategy(strategy_flux.FluxTokenizeStrategy(t5xxl_max_token_length))
+
+        train_dataset_group.set_current_strategies()
+        train_util.debug_dataset(train_dataset_group, True)
+        return
+    if len(train_dataset_group) == 0:
+        logger.error(
+            "No data found. Please verify the metadata file and train_data_dir option. / 画像がありません。メタデータおよびtrain_data_dirオプションを確認してください。"
+        )
+        return
+
+    if cache_latents:
+        assert (
+            train_dataset_group.is_latent_cacheable()
+        ), "when caching latents, either color_aug or random_crop cannot be used / latentをキャッシュするときはcolor_augとrandom_cropは使えません"
+
+    if args.cache_text_encoder_outputs:
+        assert (
+            train_dataset_group.is_text_encoder_output_cacheable()
+        ), "when caching text encoder output, either caption_dropout_rate, shuffle_caption, token_warmup_step or caption_tag_dropout_rate cannot be used / text encoderの出力をキャッシュするときはcaption_dropout_rate, shuffle_caption, token_warmup_step, caption_tag_dropout_rateは使えません"
+
+    # acceleratorを準備する
+    logger.info("prepare accelerator")
+    accelerator = train_util.prepare_accelerator(args)
+
+    # mixed precisionに対応した型を用意しておき適宜castする
+    weight_dtype, save_dtype = train_util.prepare_dtype(args)
+
+    # モデルを読み込む
+
+    # load VAE for caching latents
+    ae = None
+    if cache_latents:
+        ae = flux_utils.load_ae(args.ae, weight_dtype, "cpu", args.disable_mmap_load_safetensors)
+        ae.to(accelerator.device, dtype=weight_dtype)
+        ae.requires_grad_(False)
+        ae.eval()
+
+        train_dataset_group.new_cache_latents(ae, accelerator)
+
+        ae.to("cpu")  # if no sampling, vae can be deleted
+        clean_memory_on_device(accelerator.device)
+
+        accelerator.wait_for_everyone()
+
+    # prepare tokenize strategy
+    if args.t5xxl_max_token_length is None:
+        if is_schnell:
+            t5xxl_max_token_length = 256
+        else:
+            t5xxl_max_token_length = 512
+    else:
+        t5xxl_max_token_length = args.t5xxl_max_token_length
+
+    flux_tokenize_strategy = strategy_flux.FluxTokenizeStrategy(t5xxl_max_token_length)
+    strategy_base.TokenizeStrategy.set_strategy(flux_tokenize_strategy)
+
+    # load clip_l, t5xxl for caching text encoder outputs
+    clip_l = flux_utils.load_clip_l(args.clip_l, weight_dtype, "cpu", args.disable_mmap_load_safetensors)
+    t5xxl = flux_utils.load_t5xxl(args.t5xxl, weight_dtype, "cpu", args.disable_mmap_load_safetensors)
+    clip_l.eval()
+    t5xxl.eval()
+    clip_l.requires_grad_(False)
+    t5xxl.requires_grad_(False)
+
+    text_encoding_strategy = strategy_flux.FluxTextEncodingStrategy(args.apply_t5_attn_mask)
+    strategy_base.TextEncodingStrategy.set_strategy(text_encoding_strategy)
+
+    # cache text encoder outputs
+    sample_prompts_te_outputs = None
+    if args.cache_text_encoder_outputs:
+        # Text Encodes are eval and no grad here
+        clip_l.to(accelerator.device)
+        t5xxl.to(accelerator.device)
+
+        text_encoder_caching_strategy = strategy_flux.FluxTextEncoderOutputsCachingStrategy(
+            args.cache_text_encoder_outputs_to_disk, args.text_encoder_batch_size, False, False, args.apply_t5_attn_mask
+        )
+        strategy_base.TextEncoderOutputsCachingStrategy.set_strategy(text_encoder_caching_strategy)
+
+        with accelerator.autocast():
+            train_dataset_group.new_cache_text_encoder_outputs([clip_l, t5xxl], accelerator)
+
+        # cache sample prompt's embeddings to free text encoder's memory
+        if args.sample_prompts is not None:
+            logger.info(f"cache Text Encoder outputs for sample prompt: {args.sample_prompts}")
+
+            text_encoding_strategy: strategy_flux.FluxTextEncodingStrategy = strategy_base.TextEncodingStrategy.get_strategy()
+
+            prompts = train_util.load_prompts(args.sample_prompts)
+            sample_prompts_te_outputs = {}  # key: prompt, value: text encoder outputs
+            with accelerator.autocast(), torch.no_grad():
+                for prompt_dict in prompts:
+                    for p in [prompt_dict.get("prompt", ""), prompt_dict.get("negative_prompt", "")]:
+                        if p not in sample_prompts_te_outputs:
+                            logger.info(f"cache Text Encoder outputs for prompt: {p}")
+                            tokens_and_masks = flux_tokenize_strategy.tokenize(p)
+                            sample_prompts_te_outputs[p] = text_encoding_strategy.encode_tokens(
+                                flux_tokenize_strategy, [clip_l, t5xxl], tokens_and_masks, args.apply_t5_attn_mask
+                            )
+
+        accelerator.wait_for_everyone()
+
+        # now we can delete Text Encoders to free memory
+        clip_l = None
+        t5xxl = None
+        clean_memory_on_device(accelerator.device)
+
+    # load FLUX
+    is_schnell, flux = flux_utils.load_flow_model(
+        args.pretrained_model_name_or_path, weight_dtype, "cpu", args.disable_mmap_load_safetensors
+    )
+    flux.requires_grad_(False)
+
+    # load controlnet
+    controlnet_dtype = torch.float32 if args.deepspeed else weight_dtype
+    controlnet = flux_utils.load_controlnet(
+        args.controlnet_model_name_or_path, is_schnell, controlnet_dtype, accelerator.device, args.disable_mmap_load_safetensors
+    )
+    controlnet.train()
+
+    if args.gradient_checkpointing:
+        if not args.deepspeed:
+            flux.enable_gradient_checkpointing(cpu_offload=args.cpu_offload_checkpointing)
+        controlnet.enable_gradient_checkpointing(cpu_offload=args.cpu_offload_checkpointing)
+
+    # block swap
+
+    # backward compatibility
+    if args.blocks_to_swap is None:
+        blocks_to_swap = args.double_blocks_to_swap or 0
+        if args.single_blocks_to_swap is not None:
+            blocks_to_swap += args.single_blocks_to_swap // 2
+        if blocks_to_swap > 0:
+            logger.warning(
+                "double_blocks_to_swap and single_blocks_to_swap are deprecated. Use blocks_to_swap instead."
+                " / double_blocks_to_swapとsingle_blocks_to_swapは非推奨です。blocks_to_swapを使ってください。"
+            )
+            logger.info(
+                f"double_blocks_to_swap={args.double_blocks_to_swap} and single_blocks_to_swap={args.single_blocks_to_swap} are converted to blocks_to_swap={blocks_to_swap}."
+            )
+            args.blocks_to_swap = blocks_to_swap
+        del blocks_to_swap
+
+    is_swapping_blocks = args.blocks_to_swap is not None and args.blocks_to_swap > 0
+    if is_swapping_blocks:
+        # Swap blocks between CPU and GPU to reduce memory usage, in forward and backward passes.
+        # This idea is based on 2kpr's great work. Thank you!
+        logger.info(f"enable block swap: blocks_to_swap={args.blocks_to_swap}")
+        flux.enable_block_swap(args.blocks_to_swap, accelerator.device)
+        flux.move_to_device_except_swap_blocks(accelerator.device)  # reduce peak memory usage
+        # ControlNet only has two blocks, so we can keep it on GPU
+        # controlnet.enable_block_swap(args.blocks_to_swap, accelerator.device)
+    else:
+        flux.to(accelerator.device)
+
+    if not cache_latents:
+        # load VAE here if not cached
+        ae = flux_utils.load_ae(args.ae, weight_dtype, "cpu")
+        ae.requires_grad_(False)
+        ae.eval()
+        ae.to(accelerator.device, dtype=weight_dtype)
+
+    training_models = []
+    params_to_optimize = []
+    training_models.append(controlnet)
+    name_and_params = list(controlnet.named_parameters())
+    # single param group for now
+    params_to_optimize.append({"params": [p for _, p in name_and_params], "lr": args.learning_rate})
+    param_names = [[n for n, _ in name_and_params]]
+
+    # calculate number of trainable parameters
+    n_params = 0
+    for group in params_to_optimize:
+        for p in group["params"]:
+            n_params += p.numel()
+
+    accelerator.print(f"number of trainable parameters: {n_params}")
+
+    # 学習に必要なクラスを準備する
+    accelerator.print("prepare optimizer, data loader etc.")
+
+    if args.blockwise_fused_optimizers:
+        # fused backward pass: https://pytorch.org/tutorials/intermediate/optimizer_step_in_backward_tutorial.html
+        # Instead of creating an optimizer for all parameters as in the tutorial, we create an optimizer for each block of parameters.
+        # This balances memory usage and management complexity.
+
+        # split params into groups. currently different learning rates are not supported
+        grouped_params = []
+        param_group = {}
+        for group in params_to_optimize:
+            named_parameters = list(controlnet.named_parameters())
+            assert len(named_parameters) == len(group["params"]), "number of parameters does not match"
+            for p, np in zip(group["params"], named_parameters):
+                # determine target layer and block index for each parameter
+                block_type = "other"  # double, single or other
+                if np[0].startswith("double_blocks"):
+                    block_index = int(np[0].split(".")[1])
+                    block_type = "double"
+                elif np[0].startswith("single_blocks"):
+                    block_index = int(np[0].split(".")[1])
+                    block_type = "single"
+                else:
+                    block_index = -1
+
+                param_group_key = (block_type, block_index)
+                if param_group_key not in param_group:
+                    param_group[param_group_key] = []
+                param_group[param_group_key].append(p)
+
+        block_types_and_indices = []
+        for param_group_key, param_group in param_group.items():
+            block_types_and_indices.append(param_group_key)
+            grouped_params.append({"params": param_group, "lr": args.learning_rate})
+
+            num_params = 0
+            for p in param_group:
+                num_params += p.numel()
+            accelerator.print(f"block {param_group_key}: {num_params} parameters")
+
+        # prepare optimizers for each group
+        optimizers = []
+        for group in grouped_params:
+            _, _, optimizer = train_util.get_optimizer(args, trainable_params=[group])
+            optimizers.append(optimizer)
+        optimizer = optimizers[0]  # avoid error in the following code
+
+        logger.info(f"using {len(optimizers)} optimizers for blockwise fused optimizers")
+
+        if train_util.is_schedulefree_optimizer(optimizers[0], args):
+            raise ValueError("Schedule-free optimizer is not supported with blockwise fused optimizers")
+        optimizer_train_fn = lambda: None  # dummy function
+        optimizer_eval_fn = lambda: None  # dummy function
+    else:
+        _, _, optimizer = train_util.get_optimizer(args, trainable_params=params_to_optimize)
+        optimizer_train_fn, optimizer_eval_fn = train_util.get_optimizer_train_eval_fn(optimizer, args)
+
+    # prepare dataloader
+    # strategies are set here because they cannot be referenced in another process. Copy them with the dataset
+    # some strategies can be None
+    train_dataset_group.set_current_strategies()
+
+    # DataLoaderのプロセス数：0 は persistent_workers が使えないので注意
+    n_workers = min(args.max_data_loader_n_workers, os.cpu_count())  # cpu_count or max_data_loader_n_workers
+    train_dataloader = torch.utils.data.DataLoader(
+        train_dataset_group,
+        batch_size=1,
+        shuffle=True,
+        collate_fn=collator,
+        num_workers=n_workers,
+        persistent_workers=args.persistent_data_loader_workers,
+    )
+
+    # 学習ステップ数を計算する
+    if args.max_train_epochs is not None:
+        args.max_train_steps = args.max_train_epochs * math.ceil(
+            len(train_dataloader) / accelerator.num_processes / args.gradient_accumulation_steps
+        )
+        accelerator.print(
+            f"override steps. steps for {args.max_train_epochs} epochs is / 指定エポックまでのステップ数: {args.max_train_steps}"
+        )
+
+    # データセット側にも学習ステップを送信
+    train_dataset_group.set_max_train_steps(args.max_train_steps)
+
+    # lr schedulerを用意する
+    if args.blockwise_fused_optimizers:
+        # prepare lr schedulers for each optimizer
+        lr_schedulers = [train_util.get_scheduler_fix(args, optimizer, accelerator.num_processes) for optimizer in optimizers]
+        lr_scheduler = lr_schedulers[0]  # avoid error in the following code
+    else:
+        lr_scheduler = train_util.get_scheduler_fix(args, optimizer, accelerator.num_processes)
+
+    # 実験的機能：勾配も含めたfp16/bf16学習を行う　モデル全体をfp16/bf16にする
+    if args.full_fp16:
+        assert (
+            args.mixed_precision == "fp16"
+        ), "full_fp16 requires mixed precision='fp16' / full_fp16を使う場合はmixed_precision='fp16'を指定してください。"
+        accelerator.print("enable full fp16 training.")
+        flux.to(weight_dtype)
+        controlnet.to(weight_dtype)
+        if clip_l is not None:
+            clip_l.to(weight_dtype)
+            t5xxl.to(weight_dtype)  # TODO check works with fp16 or not
+    elif args.full_bf16:
+        assert (
+            args.mixed_precision == "bf16"
+        ), "full_bf16 requires mixed precision='bf16' / full_bf16を使う場合はmixed_precision='bf16'を指定してください。"
+        accelerator.print("enable full bf16 training.")
+        flux.to(weight_dtype)
+        controlnet.to(weight_dtype)
+        if clip_l is not None:
+            clip_l.to(weight_dtype)
+            t5xxl.to(weight_dtype)
+
+    # if we don't cache text encoder outputs, move them to device
+    if not args.cache_text_encoder_outputs:
+        clip_l.to(accelerator.device)
+        t5xxl.to(accelerator.device)
+
+    clean_memory_on_device(accelerator.device)
+
+    if args.deepspeed:
+        ds_model = deepspeed_utils.prepare_deepspeed_model(args, mmdit=controlnet)
+        # most of ZeRO stage uses optimizer partitioning, so we have to prepare optimizer and ds_model at the same time. # pull/1139#issuecomment-1986790007
+        ds_model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
+            ds_model, optimizer, train_dataloader, lr_scheduler
+        )
+        training_models = [ds_model]
+
+    else:
+        # accelerator does some magic
+        # if we doesn't swap blocks, we can move the model to device
+        controlnet = accelerator.prepare(controlnet)  # , device_placement=[not is_swapping_blocks])
+        optimizer, train_dataloader, lr_scheduler = accelerator.prepare(optimizer, train_dataloader, lr_scheduler)
+
+    # 実験的機能：勾配も含めたfp16学習を行う　PyTorchにパッチを当ててfp16でのgrad scaleを有効にする
+    if args.full_fp16:
+        # During deepseed training, accelerate not handles fp16/bf16|mixed precision directly via scaler. Let deepspeed engine do.
+        # -> But we think it's ok to patch accelerator even if deepspeed is enabled.
+        train_util.patch_accelerator_for_fp16_training(accelerator)
+
+    # resumeする
+    train_util.resume_from_local_or_hf_if_specified(accelerator, args)
+
+    if args.fused_backward_pass:
+        # use fused optimizer for backward pass: other optimizers will be supported in the future
+        import library.adafactor_fused
+
+        library.adafactor_fused.patch_adafactor_fused(optimizer)
+
+        for param_group, param_name_group in zip(optimizer.param_groups, param_names):
+            for parameter, param_name in zip(param_group["params"], param_name_group):
+                if parameter.requires_grad:
+
+                    def create_grad_hook(p_name, p_group):
+                        def grad_hook(tensor: torch.Tensor):
+                            if accelerator.sync_gradients and args.max_grad_norm != 0.0:
+                                accelerator.clip_grad_norm_(tensor, args.max_grad_norm)
+                            optimizer.step_param(tensor, p_group)
+                            tensor.grad = None
+
+                        return grad_hook
+
+                    parameter.register_post_accumulate_grad_hook(create_grad_hook(param_name, param_group))
+
+    elif args.blockwise_fused_optimizers:
+        # prepare for additional optimizers and lr schedulers
+        for i in range(1, len(optimizers)):
+            optimizers[i] = accelerator.prepare(optimizers[i])
+            lr_schedulers[i] = accelerator.prepare(lr_schedulers[i])
+
+        # counters are used to determine when to step the optimizer
+        global optimizer_hooked_count
+        global num_parameters_per_group
+        global parameter_optimizer_map
+
+        optimizer_hooked_count = {}
+        num_parameters_per_group = [0] * len(optimizers)
+        parameter_optimizer_map = {}
+
+        for opt_idx, optimizer in enumerate(optimizers):
+            for param_group in optimizer.param_groups:
+                for parameter in param_group["params"]:
+                    if parameter.requires_grad:
+
+                        def grad_hook(parameter: torch.Tensor):
+                            if accelerator.sync_gradients and args.max_grad_norm != 0.0:
+                                accelerator.clip_grad_norm_(parameter, args.max_grad_norm)
+
+                            i = parameter_optimizer_map[parameter]
+                            optimizer_hooked_count[i] += 1
+                            if optimizer_hooked_count[i] == num_parameters_per_group[i]:
+                                optimizers[i].step()
+                                optimizers[i].zero_grad(set_to_none=True)
+
+                        parameter.register_post_accumulate_grad_hook(grad_hook)
+                        parameter_optimizer_map[parameter] = opt_idx
+                        num_parameters_per_group[opt_idx] += 1
+
+    # epoch数を計算する
+    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
+    num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)
+    if (args.save_n_epoch_ratio is not None) and (args.save_n_epoch_ratio > 0):
+        args.save_every_n_epochs = math.floor(num_train_epochs / args.save_n_epoch_ratio) or 1
+
+    # 学習する
+    # total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
+    accelerator.print("running training / 学習開始")
+    accelerator.print(f"  num examples / サンプル数: {train_dataset_group.num_train_images}")
+    accelerator.print(f"  num batches per epoch / 1epochのバッチ数: {len(train_dataloader)}")
+    accelerator.print(f"  num epochs / epoch数: {num_train_epochs}")
+    accelerator.print(
+        f"  batch size per device / バッチサイズ: {', '.join([str(d.batch_size) for d in train_dataset_group.datasets])}"
+    )
+    # accelerator.print(
+    #     f"  total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ（並列学習、勾配合計含む）: {total_batch_size}"
+    # )
+    accelerator.print(f"  gradient accumulation steps / 勾配を合計するステップ数 = {args.gradient_accumulation_steps}")
+    accelerator.print(f"  total optimization steps / 学習ステップ数: {args.max_train_steps}")
+
+    progress_bar = tqdm(range(args.max_train_steps), smoothing=0, disable=not accelerator.is_local_main_process, desc="steps")
+    global_step = 0
+
+    noise_scheduler = FlowMatchEulerDiscreteScheduler(num_train_timesteps=1000, shift=args.discrete_flow_shift)
+    noise_scheduler_copy = copy.deepcopy(noise_scheduler)
+
+    if accelerator.is_main_process:
+        init_kwargs = {}
+        if args.wandb_run_name:
+            init_kwargs["wandb"] = {"name": args.wandb_run_name}
+        if args.log_tracker_config is not None:
+            init_kwargs = toml.load(args.log_tracker_config)
+        accelerator.init_trackers(
+            "finetuning" if args.log_tracker_name is None else args.log_tracker_name,
+            config=train_util.get_sanitized_config_or_none(args),
+            init_kwargs=init_kwargs,
+        )
+
+    if is_swapping_blocks:
+        flux.prepare_block_swap_before_forward()
+
+    # For --sample_at_first
+    optimizer_eval_fn()
+    flux_train_utils.sample_images(
+        accelerator, args, 0, global_step, flux, ae, [clip_l, t5xxl], sample_prompts_te_outputs, controlnet=controlnet
+    )
+    optimizer_train_fn()
+    if len(accelerator.trackers) > 0:
+        # log empty object to commit the sample images to wandb
+        accelerator.log({}, step=0)
+
+    loss_recorder = train_util.LossRecorder()
+    epoch = 0  # avoid error when max_train_steps is 0
+    for epoch in range(num_train_epochs):
+        accelerator.print(f"\nepoch {epoch+1}/{num_train_epochs}")
+        current_epoch.value = epoch + 1
+
+        for m in training_models:
+            m.train()
+
+        for step, batch in enumerate(train_dataloader):
+            current_step.value = global_step
+
+            if args.blockwise_fused_optimizers:
+                optimizer_hooked_count = {i: 0 for i in range(len(optimizers))}  # reset counter for each step
+
+            with accelerator.accumulate(*training_models):
+                if "latents" in batch and batch["latents"] is not None:
+                    latents = batch["latents"].to(accelerator.device, dtype=weight_dtype)
+                else:
+                    with torch.no_grad():
+                        # encode images to latents. images are [-1, 1]
+                        latents = ae.encode(batch["images"].to(ae.dtype)).to(accelerator.device, dtype=weight_dtype)
+
+                    # NaNが含まれていれば警告を表示し0に置き換える
+                    if torch.any(torch.isnan(latents)):
+                        accelerator.print("NaN found in latents, replacing with zeros")
+                        latents = torch.nan_to_num(latents, 0, out=latents)
+
+                text_encoder_outputs_list = batch.get("text_encoder_outputs_list", None)
+                if text_encoder_outputs_list is not None:
+                    text_encoder_conds = text_encoder_outputs_list
+                else:
+                    # not cached or training, so get from text encoders
+                    tokens_and_masks = batch["input_ids_list"]
+                    with torch.no_grad():
+                        input_ids = [ids.to(accelerator.device) for ids in batch["input_ids_list"]]
+                        text_encoder_conds = text_encoding_strategy.encode_tokens(
+                            flux_tokenize_strategy, [clip_l, t5xxl], input_ids, args.apply_t5_attn_mask
+                        )
+                text_encoder_conds = [c.to(weight_dtype) for c in text_encoder_conds]
+
+                # TODO support some features for noise implemented in get_noise_noisy_latents_and_timesteps
+
+                # Sample noise that we'll add to the latents
+                noise = torch.randn_like(latents)
+                bsz = latents.shape[0]
+
+                # get noisy model input and timesteps
+                noisy_model_input, timesteps, sigmas = flux_train_utils.get_noisy_model_input_and_timesteps(
+                    args, noise_scheduler_copy, latents, noise, accelerator.device, weight_dtype
+                )
+
+                # pack latents and get img_ids
+                packed_noisy_model_input = flux_utils.pack_latents(noisy_model_input)  # b, c, h*2, w*2 -> b, h*w, c*4
+                packed_latent_height, packed_latent_width = noisy_model_input.shape[2] // 2, noisy_model_input.shape[3] // 2
+                img_ids = (
+                    flux_utils.prepare_img_ids(bsz, packed_latent_height, packed_latent_width)
+                    .to(device=accelerator.device)
+                    .to(weight_dtype)
+                )
+
+                # get guidance: ensure args.guidance_scale is float
+                guidance_vec = torch.full((bsz,), float(args.guidance_scale), device=accelerator.device, dtype=weight_dtype)
+
+                # call model
+                l_pooled, t5_out, txt_ids, t5_attn_mask = text_encoder_conds
+                if not args.apply_t5_attn_mask:
+                    t5_attn_mask = None
+
+                with accelerator.autocast():
+                    block_samples, block_single_samples = controlnet(
+                        img=packed_noisy_model_input,
+                        img_ids=img_ids,
+                        controlnet_cond=batch["conditioning_images"].to(accelerator.device).to(weight_dtype),
+                        txt=t5_out,
+                        txt_ids=txt_ids,
+                        y=l_pooled,
+                        timesteps=timesteps / 1000,
+                        guidance=guidance_vec,
+                        txt_attention_mask=t5_attn_mask,
+                    )
+                    # YiYi notes: divide it by 1000 for now because we scale it by 1000 in the transformer model (we should not keep it but I want to keep the inputs same for the model for testing)
+                    model_pred = flux(
+                        img=packed_noisy_model_input,
+                        img_ids=img_ids,
+                        txt=t5_out,
+                        txt_ids=txt_ids,
+                        y=l_pooled,
+                        block_controlnet_hidden_states=block_samples,
+                        block_controlnet_single_hidden_states=block_single_samples,
+                        timesteps=timesteps / 1000,
+                        guidance=guidance_vec,
+                        txt_attention_mask=t5_attn_mask,
+                    )
+
+                # unpack latents
+                model_pred = flux_utils.unpack_latents(model_pred, packed_latent_height, packed_latent_width)
+
+                # apply model prediction type
+                model_pred, weighting = flux_train_utils.apply_model_prediction_type(args, model_pred, noisy_model_input, sigmas)
+
+                # flow matching loss: this is different from SD3
+                target = noise - latents
+
+                # calculate loss
+                loss = train_util.conditional_loss(
+                    model_pred.float(), target.float(), reduction="none", loss_type=args.loss_type, huber_c=None
+                )
+                if weighting is not None:
+                    loss = loss * weighting
+                if args.masked_loss or ("alpha_masks" in batch and batch["alpha_masks"] is not None):
+                    loss = apply_masked_loss(loss, batch)
+                loss = loss.mean([1, 2, 3])
+
+                loss_weights = batch["loss_weights"]  # 各sampleごとのweight
+                loss = loss * loss_weights
+                loss = loss.mean()
+
+                # backward
+                accelerator.backward(loss)
+
+                if not (args.fused_backward_pass or args.blockwise_fused_optimizers):
+                    if accelerator.sync_gradients and args.max_grad_norm != 0.0:
+                        params_to_clip = []
+                        for m in training_models:
+                            params_to_clip.extend(m.parameters())
+                        accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
+
+                    optimizer.step()
+                    lr_scheduler.step()
+                    optimizer.zero_grad(set_to_none=True)
+                else:
+                    # optimizer.step() and optimizer.zero_grad() are called in the optimizer hook
+                    lr_scheduler.step()
+                    if args.blockwise_fused_optimizers:
+                        for i in range(1, len(optimizers)):
+                            lr_schedulers[i].step()
+
+            # Checks if the accelerator has performed an optimization step behind the scenes
+            if accelerator.sync_gradients:
+                progress_bar.update(1)
+                global_step += 1
+
+                optimizer_eval_fn()
+                flux_train_utils.sample_images(
+                    accelerator,
+                    args,
+                    None,
+                    global_step,
+                    flux,
+                    ae,
+                    [clip_l, t5xxl],
+                    sample_prompts_te_outputs,
+                    controlnet=controlnet,
+                )
+
+                # 指定ステップごとにモデルを保存
+                if args.save_every_n_steps is not None and global_step % args.save_every_n_steps == 0:
+                    accelerator.wait_for_everyone()
+                    if accelerator.is_main_process:
+                        flux_train_utils.save_flux_model_on_epoch_end_or_stepwise(
+                            args,
+                            False,
+                            accelerator,
+                            save_dtype,
+                            epoch,
+                            num_train_epochs,
+                            global_step,
+                            accelerator.unwrap_model(controlnet),
+                        )
+                optimizer_train_fn()
+
+            current_loss = loss.detach().item()  # 平均なのでbatch sizeは関係ないはず
+            if len(accelerator.trackers) > 0:
+                logs = {"loss": current_loss}
+                train_util.append_lr_to_logs(logs, lr_scheduler, args.optimizer_type, including_unet=True)
+
+                accelerator.log(logs, step=global_step)
+
+            loss_recorder.add(epoch=epoch, step=step, loss=current_loss)
+            avr_loss: float = loss_recorder.moving_average
+            logs = {"avr_loss": avr_loss}  # , "lr": lr_scheduler.get_last_lr()[0]}
+            progress_bar.set_postfix(**logs)
+
+            if global_step >= args.max_train_steps:
+                break
+
+        if len(accelerator.trackers) > 0:
+            logs = {"loss/epoch": loss_recorder.moving_average}
+            accelerator.log(logs, step=epoch + 1)
+
+        accelerator.wait_for_everyone()
+
+        optimizer_eval_fn()
+        if args.save_every_n_epochs is not None:
+            if accelerator.is_main_process:
+                flux_train_utils.save_flux_model_on_epoch_end_or_stepwise(
+                    args,
+                    True,
+                    accelerator,
+                    save_dtype,
+                    epoch,
+                    num_train_epochs,
+                    global_step,
+                    accelerator.unwrap_model(controlnet),
+                )
+
+        flux_train_utils.sample_images(
+            accelerator, args, epoch + 1, global_step, flux, ae, [clip_l, t5xxl], sample_prompts_te_outputs, controlnet=controlnet
+        )
+        optimizer_train_fn()
+
+    is_main_process = accelerator.is_main_process
+    # if is_main_process:
+    controlnet = accelerator.unwrap_model(controlnet)
+
+    accelerator.end_training()
+    optimizer_eval_fn()
+
+    if args.save_state or args.save_state_on_train_end:
+        train_util.save_state_on_train_end(args, accelerator)
+
+    del accelerator  # この後メモリを使うのでこれは消す
+
+    if is_main_process:
+        flux_train_utils.save_flux_model_on_train_end(args, save_dtype, epoch, global_step, controlnet)
+        logger.info("model saved.")
+
+
+def setup_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser()
+
+    add_logging_arguments(parser)
+    train_util.add_sd_models_arguments(parser)  # TODO split this
+    train_util.add_dataset_arguments(parser, False, True, True)
+    train_util.add_training_arguments(parser, False)
+    train_util.add_masked_loss_arguments(parser)
+    deepspeed_utils.add_deepspeed_arguments(parser)
+    train_util.add_sd_saving_arguments(parser)
+    train_util.add_optimizer_arguments(parser)
+    config_util.add_config_arguments(parser)
+    add_custom_train_arguments(parser)  # TODO remove this from here
+    train_util.add_dit_training_arguments(parser)
+    flux_train_utils.add_flux_train_arguments(parser)
+
+    parser.add_argument(
+        "--mem_eff_save",
+        action="store_true",
+        help="[EXPERIMENTAL] use memory efficient custom model saving method / メモリ効率の良い独自のモデル保存方法を使う",
+    )
+
+    parser.add_argument(
+        "--fused_optimizer_groups",
+        type=int,
+        default=None,
+        help="**this option is not working** will be removed in the future / このオプションは動作しません。将来削除されます",
+    )
+    parser.add_argument(
+        "--blockwise_fused_optimizers",
+        action="store_true",
+        help="enable blockwise optimizers for fused backward pass and optimizer step / fused backward passとoptimizer step のためブロック単位のoptimizerを有効にする",
+    )
+    parser.add_argument(
+        "--skip_latents_validity_check",
+        action="store_true",
+        help="[Deprecated] use 'skip_cache_check' instead / 代わりに 'skip_cache_check' を使用してください",
+    )
+    parser.add_argument(
+        "--double_blocks_to_swap",
+        type=int,
+        default=None,
+        help="[Deprecated] use 'blocks_to_swap' instead / 代わりに 'blocks_to_swap' を使用してください",
+    )
+    parser.add_argument(
+        "--single_blocks_to_swap",
+        type=int,
+        default=None,
+        help="[Deprecated] use 'blocks_to_swap' instead / 代わりに 'blocks_to_swap' を使用してください",
+    )
+    parser.add_argument(
+        "--cpu_offload_checkpointing",
+        action="store_true",
+        help="[EXPERIMENTAL] enable offloading of tensors to CPU during checkpointing / チェックポイント時にテンソルをCPUにオフロードする",
+    )
+    return parser
+
+
+if __name__ == "__main__":
+    parser = setup_parser()
+
+    args = parser.parse_args()
+    train_util.verify_command_line_training_args(args)
+    args = train_util.read_config_from_file(args, parser)
+
+    train(args)
--- a/flux_train_network.py
+++ b/flux_train_network.py
@@ -0,0 +1,559 @@
+import argparse
+import copy
+import math
+import random
+from typing import Any, Optional, Union
+
+import torch
+from accelerate import Accelerator
+
+from library.device_utils import clean_memory_on_device, init_ipex
+
+init_ipex()
+
+import train_network
+from library import (
+    flux_models,
+    flux_train_utils,
+    flux_utils,
+    sd3_train_utils,
+    strategy_base,
+    strategy_flux,
+    train_util,
+)
+from library.utils import setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+
+class FluxNetworkTrainer(train_network.NetworkTrainer):
+    def __init__(self):
+        super().__init__()
+        self.sample_prompts_te_outputs = None
+        self.is_schnell: Optional[bool] = None
+        self.is_swapping_blocks: bool = False
+
+    def assert_extra_args(
+        self,
+        args,
+        train_dataset_group: Union[train_util.DatasetGroup, train_util.MinimalDataset],
+        val_dataset_group: Optional[train_util.DatasetGroup],
+    ):
+        super().assert_extra_args(args, train_dataset_group, val_dataset_group)
+        # sdxl_train_util.verify_sdxl_training_args(args)
+
+        if args.fp8_base_unet:
+            args.fp8_base = True  # if fp8_base_unet is enabled, fp8_base is also enabled for FLUX.1
+
+        if args.cache_text_encoder_outputs_to_disk and not args.cache_text_encoder_outputs:
+            logger.warning(
+                "cache_text_encoder_outputs_to_disk is enabled, so cache_text_encoder_outputs is also enabled / cache_text_encoder_outputs_to_diskが有効になっているため、cache_text_encoder_outputsも有効になります"
+            )
+            args.cache_text_encoder_outputs = True
+
+        if args.cache_text_encoder_outputs:
+            assert (
+                train_dataset_group.is_text_encoder_output_cacheable()
+            ), "when caching Text Encoder output, either caption_dropout_rate, shuffle_caption, token_warmup_step or caption_tag_dropout_rate cannot be used / Text Encoderの出力をキャッシュするときはcaption_dropout_rate, shuffle_caption, token_warmup_step, caption_tag_dropout_rateは使えません"
+
+        # prepare CLIP-L/T5XXL training flags
+        self.train_clip_l = not args.network_train_unet_only
+        self.train_t5xxl = False  # default is False even if args.network_train_unet_only is False
+
+        if args.max_token_length is not None:
+            logger.warning("max_token_length is not used in Flux training / max_token_lengthはFluxのトレーニングでは使用されません")
+
+        assert (
+            args.blocks_to_swap is None or args.blocks_to_swap == 0
+        ) or not args.cpu_offload_checkpointing, "blocks_to_swap is not supported with cpu_offload_checkpointing / blocks_to_swapはcpu_offload_checkpointingと併用できません"
+
+        # deprecated split_mode option
+        if args.split_mode:
+            if args.blocks_to_swap is not None:
+                logger.warning(
+                    "split_mode is deprecated. Because `--blocks_to_swap` is set, `--split_mode` is ignored."
+                    " / split_modeは非推奨です。`--blocks_to_swap`が設定されているため、`--split_mode`は無視されます。"
+                )
+            else:
+                logger.warning(
+                    "split_mode is deprecated. Please use `--blocks_to_swap` instead. `--blocks_to_swap 18` is automatically set."
+                    " / split_modeは非推奨です。代わりに`--blocks_to_swap`を使用してください。`--blocks_to_swap 18`が自動的に設定されました。"
+                )
+                args.blocks_to_swap = 18  # 18 is safe for most cases
+
+        train_dataset_group.verify_bucket_reso_steps(32)  # TODO check this
+        if val_dataset_group is not None:
+            val_dataset_group.verify_bucket_reso_steps(32)  # TODO check this
+
+    def load_target_model(self, args, weight_dtype, accelerator):
+        # currently offload to cpu for some models
+
+        # if the file is fp8 and we are using fp8_base, we can load it as is (fp8)
+        loading_dtype = None if args.fp8_base else weight_dtype
+
+        # if we load to cpu, flux.to(fp8) takes a long time, so we should load to gpu in future
+        self.is_schnell, model = flux_utils.load_flow_model(
+            args.pretrained_model_name_or_path, loading_dtype, "cpu", disable_mmap=args.disable_mmap_load_safetensors
+        )
+        if args.fp8_base:
+            # check dtype of model
+            if model.dtype == torch.float8_e4m3fnuz or model.dtype == torch.float8_e5m2 or model.dtype == torch.float8_e5m2fnuz:
+                raise ValueError(f"Unsupported fp8 model dtype: {model.dtype}")
+            elif model.dtype == torch.float8_e4m3fn:
+                logger.info("Loaded fp8 FLUX model")
+            else:
+                logger.info(
+                    "Cast FLUX model to fp8. This may take a while. You can reduce the time by using fp8 checkpoint."
+                    " / FLUXモデルをfp8に変換しています。これには時間がかかる場合があります。fp8チェックポイントを使用することで時間を短縮できます。"
+                )
+                model.to(torch.float8_e4m3fn)
+
+        # if args.split_mode:
+        #     model = self.prepare_split_model(model, weight_dtype, accelerator)
+
+        self.is_swapping_blocks = args.blocks_to_swap is not None and args.blocks_to_swap > 0
+        if self.is_swapping_blocks:
+            # Swap blocks between CPU and GPU to reduce memory usage, in forward and backward passes.
+            logger.info(f"enable block swap: blocks_to_swap={args.blocks_to_swap}")
+            model.enable_block_swap(args.blocks_to_swap, accelerator.device)
+
+        clip_l = flux_utils.load_clip_l(args.clip_l, weight_dtype, "cpu", disable_mmap=args.disable_mmap_load_safetensors)
+        clip_l.eval()
+
+        # if the file is fp8 and we are using fp8_base (not unet), we can load it as is (fp8)
+        if args.fp8_base and not args.fp8_base_unet:
+            loading_dtype = None  # as is
+        else:
+            loading_dtype = weight_dtype
+
+        # loading t5xxl to cpu takes a long time, so we should load to gpu in future
+        t5xxl = flux_utils.load_t5xxl(args.t5xxl, loading_dtype, "cpu", disable_mmap=args.disable_mmap_load_safetensors)
+        t5xxl.eval()
+        if args.fp8_base and not args.fp8_base_unet:
+            # check dtype of model
+            if t5xxl.dtype == torch.float8_e4m3fnuz or t5xxl.dtype == torch.float8_e5m2 or t5xxl.dtype == torch.float8_e5m2fnuz:
+                raise ValueError(f"Unsupported fp8 model dtype: {t5xxl.dtype}")
+            elif t5xxl.dtype == torch.float8_e4m3fn:
+                logger.info("Loaded fp8 T5XXL model")
+
+        ae = flux_utils.load_ae(args.ae, weight_dtype, "cpu", disable_mmap=args.disable_mmap_load_safetensors)
+
+        return flux_utils.MODEL_VERSION_FLUX_V1, [clip_l, t5xxl], ae, model
+
+    def get_tokenize_strategy(self, args):
+        _, is_schnell, _, _ = flux_utils.analyze_checkpoint_state(args.pretrained_model_name_or_path)
+
+        if args.t5xxl_max_token_length is None:
+            if is_schnell:
+                t5xxl_max_token_length = 256
+            else:
+                t5xxl_max_token_length = 512
+        else:
+            t5xxl_max_token_length = args.t5xxl_max_token_length
+
+        logger.info(f"t5xxl_max_token_length: {t5xxl_max_token_length}")
+        return strategy_flux.FluxTokenizeStrategy(t5xxl_max_token_length, args.tokenizer_cache_dir)
+
+    def get_tokenizers(self, tokenize_strategy: strategy_flux.FluxTokenizeStrategy):
+        return [tokenize_strategy.clip_l, tokenize_strategy.t5xxl]
+
+    def get_latents_caching_strategy(self, args):
+        latents_caching_strategy = strategy_flux.FluxLatentsCachingStrategy(args.cache_latents_to_disk, args.vae_batch_size, False)
+        return latents_caching_strategy
+
+    def get_text_encoding_strategy(self, args):
+        return strategy_flux.FluxTextEncodingStrategy(apply_t5_attn_mask=args.apply_t5_attn_mask)
+
+    def post_process_network(self, args, accelerator, network, text_encoders, unet):
+        # check t5xxl is trained or not
+        self.train_t5xxl = network.train_t5xxl
+
+        if self.train_t5xxl and args.cache_text_encoder_outputs:
+            raise ValueError(
+                "T5XXL is trained, so cache_text_encoder_outputs cannot be used / T5XXL学習時はcache_text_encoder_outputsは使用できません"
+            )
+
+    def get_models_for_text_encoding(self, args, accelerator, text_encoders):
+        if args.cache_text_encoder_outputs:
+            if self.train_clip_l and not self.train_t5xxl:
+                return text_encoders[0:1]  # only CLIP-L is needed for encoding because T5XXL is cached
+            else:
+                return None  # no text encoders are needed for encoding because both are cached
+        else:
+            return text_encoders  # both CLIP-L and T5XXL are needed for encoding
+
+    def get_text_encoders_train_flags(self, args, text_encoders):
+        return [self.train_clip_l, self.train_t5xxl]
+
+    def get_text_encoder_outputs_caching_strategy(self, args):
+        if args.cache_text_encoder_outputs:
+            # if the text encoders is trained, we need tokenization, so is_partial is True
+            return strategy_flux.FluxTextEncoderOutputsCachingStrategy(
+                args.cache_text_encoder_outputs_to_disk,
+                args.text_encoder_batch_size,
+                args.skip_cache_check,
+                is_partial=self.train_clip_l or self.train_t5xxl,
+                apply_t5_attn_mask=args.apply_t5_attn_mask,
+            )
+        else:
+            return None
+
+    def cache_text_encoder_outputs_if_needed(
+        self, args, accelerator: Accelerator, unet, vae, text_encoders, dataset: train_util.DatasetGroup, weight_dtype
+    ):
+        if args.cache_text_encoder_outputs:
+            if not args.lowram:
+                # メモリ消費を減らす
+                logger.info("move vae and unet to cpu to save memory")
+                org_vae_device = vae.device
+                org_unet_device = unet.device
+                vae.to("cpu")
+                unet.to("cpu")
+                clean_memory_on_device(accelerator.device)
+
+            # When TE is not be trained, it will not be prepared so we need to use explicit autocast
+            logger.info("move text encoders to gpu")
+            text_encoders[0].to(accelerator.device, dtype=weight_dtype)  # always not fp8
+            text_encoders[1].to(accelerator.device)
+
+            if text_encoders[1].dtype == torch.float8_e4m3fn:
+                # if we load fp8 weights, the model is already fp8, so we use it as is
+                self.prepare_text_encoder_fp8(1, text_encoders[1], text_encoders[1].dtype, weight_dtype)
+            else:
+                # otherwise, we need to convert it to target dtype
+                text_encoders[1].to(weight_dtype)
+
+            with accelerator.autocast():
+                dataset.new_cache_text_encoder_outputs(text_encoders, accelerator)
+
+            # cache sample prompts
+            if args.sample_prompts is not None:
+                logger.info(f"cache Text Encoder outputs for sample prompt: {args.sample_prompts}")
+
+                tokenize_strategy: strategy_flux.FluxTokenizeStrategy = strategy_base.TokenizeStrategy.get_strategy()
+                text_encoding_strategy: strategy_flux.FluxTextEncodingStrategy = strategy_base.TextEncodingStrategy.get_strategy()
+
+                prompts = train_util.load_prompts(args.sample_prompts)
+                sample_prompts_te_outputs = {}  # key: prompt, value: text encoder outputs
+                with accelerator.autocast(), torch.no_grad():
+                    for prompt_dict in prompts:
+                        for p in [prompt_dict.get("prompt", ""), prompt_dict.get("negative_prompt", "")]:
+                            if p not in sample_prompts_te_outputs:
+                                logger.info(f"cache Text Encoder outputs for prompt: {p}")
+                                tokens_and_masks = tokenize_strategy.tokenize(p)
+                                sample_prompts_te_outputs[p] = text_encoding_strategy.encode_tokens(
+                                    tokenize_strategy, text_encoders, tokens_and_masks, args.apply_t5_attn_mask
+                                )
+                self.sample_prompts_te_outputs = sample_prompts_te_outputs
+
+            accelerator.wait_for_everyone()
+
+            # move back to cpu
+            if not self.is_train_text_encoder(args):
+                logger.info("move CLIP-L back to cpu")
+                text_encoders[0].to("cpu")
+            logger.info("move t5XXL back to cpu")
+            text_encoders[1].to("cpu")
+            clean_memory_on_device(accelerator.device)
+
+            if not args.lowram:
+                logger.info("move vae and unet back to original device")
+                vae.to(org_vae_device)
+                unet.to(org_unet_device)
+        else:
+            # Text Encoderから毎回出力を取得するので、GPUに乗せておく
+            text_encoders[0].to(accelerator.device, dtype=weight_dtype)
+            text_encoders[1].to(accelerator.device)
+
+    # def call_unet(self, args, accelerator, unet, noisy_latents, timesteps, text_conds, batch, weight_dtype):
+    #     noisy_latents = noisy_latents.to(weight_dtype)  # TODO check why noisy_latents is not weight_dtype
+
+    #     # get size embeddings
+    #     orig_size = batch["original_sizes_hw"]
+    #     crop_size = batch["crop_top_lefts"]
+    #     target_size = batch["target_sizes_hw"]
+    #     embs = sdxl_train_util.get_size_embeddings(orig_size, crop_size, target_size, accelerator.device).to(weight_dtype)
+
+    #     # concat embeddings
+    #     encoder_hidden_states1, encoder_hidden_states2, pool2 = text_conds
+    #     vector_embedding = torch.cat([pool2, embs], dim=1).to(weight_dtype)
+    #     text_embedding = torch.cat([encoder_hidden_states1, encoder_hidden_states2], dim=2).to(weight_dtype)
+
+    #     noise_pred = unet(noisy_latents, timesteps, text_embedding, vector_embedding)
+    #     return noise_pred
+
+    def sample_images(self, accelerator, args, epoch, global_step, device, ae, tokenizer, text_encoder, flux):
+        text_encoders = text_encoder  # for compatibility
+        text_encoders = self.get_models_for_text_encoding(args, accelerator, text_encoders)
+
+        flux_train_utils.sample_images(
+            accelerator, args, epoch, global_step, flux, ae, text_encoders, self.sample_prompts_te_outputs
+        )
+        # return
+
+        """
+        class FluxUpperLowerWrapper(torch.nn.Module):
+            def __init__(self, flux_upper: flux_models.FluxUpper, flux_lower: flux_models.FluxLower, device: torch.device):
+                super().__init__()
+                self.flux_upper = flux_upper
+                self.flux_lower = flux_lower
+                self.target_device = device
+
+            def prepare_block_swap_before_forward(self):
+                pass
+
+            def forward(self, img, img_ids, txt, txt_ids, timesteps, y, guidance=None, txt_attention_mask=None):
+                self.flux_lower.to("cpu")
+                clean_memory_on_device(self.target_device)
+                self.flux_upper.to(self.target_device)
+                img, txt, vec, pe = self.flux_upper(img, img_ids, txt, txt_ids, timesteps, y, guidance, txt_attention_mask)
+                self.flux_upper.to("cpu")
+                clean_memory_on_device(self.target_device)
+                self.flux_lower.to(self.target_device)
+                return self.flux_lower(img, txt, vec, pe, txt_attention_mask)
+
+        wrapper = FluxUpperLowerWrapper(self.flux_upper, flux, accelerator.device)
+        clean_memory_on_device(accelerator.device)
+        flux_train_utils.sample_images(
+            accelerator, args, epoch, global_step, wrapper, ae, text_encoders, self.sample_prompts_te_outputs
+        )
+        clean_memory_on_device(accelerator.device)
+        """
+
+    def get_noise_scheduler(self, args: argparse.Namespace, device: torch.device) -> Any:
+        noise_scheduler = sd3_train_utils.FlowMatchEulerDiscreteScheduler(num_train_timesteps=1000, shift=args.discrete_flow_shift)
+        self.noise_scheduler_copy = copy.deepcopy(noise_scheduler)
+        return noise_scheduler
+
+    def encode_images_to_latents(self, args, vae, images):
+        return vae.encode(images)
+
+    def shift_scale_latents(self, args, latents):
+        return latents
+
+    def get_noise_pred_and_target(
+        self,
+        args,
+        accelerator,
+        noise_scheduler,
+        latents,
+        batch,
+        text_encoder_conds,
+        unet: flux_models.Flux,
+        network,
+        weight_dtype,
+        train_unet,
+        is_train=True,
+    ):
+        # Sample noise that we'll add to the latents
+        noise = torch.randn_like(latents)
+        bsz = latents.shape[0]
+
+        # get noisy model input and timesteps
+        noisy_model_input, timesteps, sigmas = flux_train_utils.get_noisy_model_input_and_timesteps(
+            args, noise_scheduler, latents, noise, accelerator.device, weight_dtype
+        )
+
+        # pack latents and get img_ids
+        packed_noisy_model_input = flux_utils.pack_latents(noisy_model_input)  # b, c, h*2, w*2 -> b, h*w, c*4
+        packed_latent_height, packed_latent_width = noisy_model_input.shape[2] // 2, noisy_model_input.shape[3] // 2
+        img_ids = flux_utils.prepare_img_ids(bsz, packed_latent_height, packed_latent_width).to(device=accelerator.device)
+
+        # get guidance
+        # ensure guidance_scale in args is float
+        guidance_vec = torch.full((bsz,), float(args.guidance_scale), device=accelerator.device)
+
+        # ensure the hidden state will require grad
+        if args.gradient_checkpointing:
+            noisy_model_input.requires_grad_(True)
+            for t in text_encoder_conds:
+                if t is not None and t.dtype.is_floating_point:
+                    t.requires_grad_(True)
+            img_ids.requires_grad_(True)
+            guidance_vec.requires_grad_(True)
+
+        # Predict the noise residual
+        l_pooled, t5_out, txt_ids, t5_attn_mask = text_encoder_conds
+        if not args.apply_t5_attn_mask:
+            t5_attn_mask = None
+
+        def call_dit(img, img_ids, t5_out, txt_ids, l_pooled, timesteps, guidance_vec, t5_attn_mask):
+            # grad is enabled even if unet is not in train mode, because Text Encoder is in train mode
+            with torch.set_grad_enabled(is_train), accelerator.autocast():
+                # YiYi notes: divide it by 1000 for now because we scale it by 1000 in the transformer model (we should not keep it but I want to keep the inputs same for the model for testing)
+                model_pred = unet(
+                    img=img,
+                    img_ids=img_ids,
+                    txt=t5_out,
+                    txt_ids=txt_ids,
+                    y=l_pooled,
+                    timesteps=timesteps / 1000,
+                    guidance=guidance_vec,
+                    txt_attention_mask=t5_attn_mask,
+                )
+            return model_pred
+
+        model_pred = call_dit(
+            img=packed_noisy_model_input,
+            img_ids=img_ids,
+            t5_out=t5_out,
+            txt_ids=txt_ids,
+            l_pooled=l_pooled,
+            timesteps=timesteps,
+            guidance_vec=guidance_vec,
+            t5_attn_mask=t5_attn_mask,
+        )
+
+        # unpack latents
+        model_pred = flux_utils.unpack_latents(model_pred, packed_latent_height, packed_latent_width)
+
+        # apply model prediction type
+        model_pred, weighting = flux_train_utils.apply_model_prediction_type(args, model_pred, noisy_model_input, sigmas)
+
+        # flow matching loss: this is different from SD3
+        target = noise - latents
+
+        # differential output preservation
+        if "custom_attributes" in batch:
+            diff_output_pr_indices = []
+            for i, custom_attributes in enumerate(batch["custom_attributes"]):
+                if "diff_output_preservation" in custom_attributes and custom_attributes["diff_output_preservation"]:
+                    diff_output_pr_indices.append(i)
+
+            if len(diff_output_pr_indices) > 0:
+                network.set_multiplier(0.0)
+                unet.prepare_block_swap_before_forward()
+                with torch.no_grad():
+                    model_pred_prior = call_dit(
+                        img=packed_noisy_model_input[diff_output_pr_indices],
+                        img_ids=img_ids[diff_output_pr_indices],
+                        t5_out=t5_out[diff_output_pr_indices],
+                        txt_ids=txt_ids[diff_output_pr_indices],
+                        l_pooled=l_pooled[diff_output_pr_indices],
+                        timesteps=timesteps[diff_output_pr_indices],
+                        guidance_vec=guidance_vec[diff_output_pr_indices] if guidance_vec is not None else None,
+                        t5_attn_mask=t5_attn_mask[diff_output_pr_indices] if t5_attn_mask is not None else None,
+                    )
+                network.set_multiplier(1.0)  # may be overwritten by "network_multipliers" in the next step
+
+                model_pred_prior = flux_utils.unpack_latents(model_pred_prior, packed_latent_height, packed_latent_width)
+                model_pred_prior, _ = flux_train_utils.apply_model_prediction_type(
+                    args,
+                    model_pred_prior,
+                    noisy_model_input[diff_output_pr_indices],
+                    sigmas[diff_output_pr_indices] if sigmas is not None else None,
+                )
+                target[diff_output_pr_indices] = model_pred_prior.to(target.dtype)
+
+        return model_pred, target, timesteps, weighting
+
+    def post_process_loss(self, loss, args, timesteps, noise_scheduler):
+        return loss
+
+    def get_sai_model_spec(self, args):
+        return train_util.get_sai_model_spec(None, args, False, True, False, flux="dev")
+
+    def update_metadata(self, metadata, args):
+        metadata["ss_apply_t5_attn_mask"] = args.apply_t5_attn_mask
+        metadata["ss_weighting_scheme"] = args.weighting_scheme
+        metadata["ss_logit_mean"] = args.logit_mean
+        metadata["ss_logit_std"] = args.logit_std
+        metadata["ss_mode_scale"] = args.mode_scale
+        metadata["ss_guidance_scale"] = args.guidance_scale
+        metadata["ss_timestep_sampling"] = args.timestep_sampling
+        metadata["ss_sigmoid_scale"] = args.sigmoid_scale
+        metadata["ss_model_prediction_type"] = args.model_prediction_type
+        metadata["ss_discrete_flow_shift"] = args.discrete_flow_shift
+
+    def is_text_encoder_not_needed_for_training(self, args):
+        return args.cache_text_encoder_outputs and not self.is_train_text_encoder(args)
+
+    def prepare_text_encoder_grad_ckpt_workaround(self, index, text_encoder):
+        if index == 0:  # CLIP-L
+            return super().prepare_text_encoder_grad_ckpt_workaround(index, text_encoder)
+        else:  # T5XXL
+            text_encoder.encoder.embed_tokens.requires_grad_(True)
+
+    def prepare_text_encoder_fp8(self, index, text_encoder, te_weight_dtype, weight_dtype):
+        if index == 0:  # CLIP-L
+            logger.info(f"prepare CLIP-L for fp8: set to {te_weight_dtype}, set embeddings to {weight_dtype}")
+            text_encoder.to(te_weight_dtype)  # fp8
+            text_encoder.text_model.embeddings.to(dtype=weight_dtype)
+        else:  # T5XXL
+
+            def prepare_fp8(text_encoder, target_dtype):
+                def forward_hook(module):
+                    def forward(hidden_states):
+                        hidden_gelu = module.act(module.wi_0(hidden_states))
+                        hidden_linear = module.wi_1(hidden_states)
+                        hidden_states = hidden_gelu * hidden_linear
+                        hidden_states = module.dropout(hidden_states)
+
+                        hidden_states = module.wo(hidden_states)
+                        return hidden_states
+
+                    return forward
+
+                for module in text_encoder.modules():
+                    if module.__class__.__name__ in ["T5LayerNorm", "Embedding"]:
+                        # print("set", module.__class__.__name__, "to", target_dtype)
+                        module.to(target_dtype)
+                    if module.__class__.__name__ in ["T5DenseGatedActDense"]:
+                        # print("set", module.__class__.__name__, "hooks")
+                        module.forward = forward_hook(module)
+
+            if flux_utils.get_t5xxl_actual_dtype(text_encoder) == torch.float8_e4m3fn and text_encoder.dtype == weight_dtype:
+                logger.info(f"T5XXL already prepared for fp8")
+            else:
+                logger.info(f"prepare T5XXL for fp8: set to {te_weight_dtype}, set embeddings to {weight_dtype}, add hooks")
+                text_encoder.to(te_weight_dtype)  # fp8
+                prepare_fp8(text_encoder, weight_dtype)
+
+    def on_validation_step_end(self, args, accelerator, network, text_encoders, unet, batch, weight_dtype):
+        if self.is_swapping_blocks:
+            # prepare for next forward: because backward pass is not called, we need to prepare it here
+            accelerator.unwrap_model(unet).prepare_block_swap_before_forward()
+
+    def prepare_unet_with_accelerator(
+        self, args: argparse.Namespace, accelerator: Accelerator, unet: torch.nn.Module
+    ) -> torch.nn.Module:
+        if not self.is_swapping_blocks:
+            return super().prepare_unet_with_accelerator(args, accelerator, unet)
+
+        # if we doesn't swap blocks, we can move the model to device
+        flux: flux_models.Flux = unet
+        flux = accelerator.prepare(flux, device_placement=[not self.is_swapping_blocks])
+        accelerator.unwrap_model(flux).move_to_device_except_swap_blocks(accelerator.device)  # reduce peak memory usage
+        accelerator.unwrap_model(flux).prepare_block_swap_before_forward()
+
+        return flux
+
+
+def setup_parser() -> argparse.ArgumentParser:
+    parser = train_network.setup_parser()
+    train_util.add_dit_training_arguments(parser)
+    flux_train_utils.add_flux_train_arguments(parser)
+
+    parser.add_argument(
+        "--split_mode",
+        action="store_true",
+        # help="[EXPERIMENTAL] use split mode for Flux model, network arg `train_blocks=single` is required"
+        # + "/[実験的] Fluxモデルの分割モードを使用する。ネットワーク引数`train_blocks=single`が必要",
+        help="[Deprecated] This option is deprecated. Please use `--blocks_to_swap` instead."
+        " / このオプションは非推奨です。代わりに`--blocks_to_swap`を使用してください。",
+    )
+    return parser
+
+
+if __name__ == "__main__":
+    parser = setup_parser()
+
+    args = parser.parse_args()
+    train_util.verify_command_line_training_args(args)
+    args = train_util.read_config_from_file(args, parser)
+
+    trainer = FluxNetworkTrainer()
+    trainer.train(args)
--- a/gen_img.py
+++ b/gen_img.py
--- a/gen_img_diffusers.py
+++ b/gen_img_diffusers.py
--- a/library/adafactor_fused.py
+++ b/library/adafactor_fused.py
@@ -0,0 +1,138 @@
+import math
+import torch
+from transformers import Adafactor
+
+# stochastic rounding for bfloat16
+# The implementation was provided by 2kpr. Thank you very much!
+
+def copy_stochastic_(target: torch.Tensor, source: torch.Tensor):
+    """
+    copies source into target using stochastic rounding
+
+    Args:
+        target: the target tensor with dtype=bfloat16
+        source: the target tensor with dtype=float32
+    """
+    # create a random 16 bit integer
+    result = torch.randint_like(source, dtype=torch.int32, low=0, high=(1 << 16))
+
+    # add the random number to the lower 16 bit of the mantissa
+    result.add_(source.view(dtype=torch.int32))
+
+    # mask off the lower 16 bit of the mantissa
+    result.bitwise_and_(-65536)  # -65536 = FFFF0000 as a signed int32
+
+    # copy the higher 16 bit into the target tensor
+    target.copy_(result.view(dtype=torch.float32))
+
+    del result
+
+
+@torch.no_grad()
+def adafactor_step_param(self, p, group):
+    if p.grad is None:
+        return
+    grad = p.grad
+    if grad.dtype in {torch.float16, torch.bfloat16}:
+        grad = grad.float()
+    if grad.is_sparse:
+        raise RuntimeError("Adafactor does not support sparse gradients.")
+
+    state = self.state[p]
+    grad_shape = grad.shape
+
+    factored, use_first_moment = Adafactor._get_options(group, grad_shape)
+    # State Initialization
+    if len(state) == 0:
+        state["step"] = 0
+
+        if use_first_moment:
+            # Exponential moving average of gradient values
+            state["exp_avg"] = torch.zeros_like(grad)
+        if factored:
+            state["exp_avg_sq_row"] = torch.zeros(grad_shape[:-1]).to(grad)
+            state["exp_avg_sq_col"] = torch.zeros(grad_shape[:-2] + grad_shape[-1:]).to(grad)
+        else:
+            state["exp_avg_sq"] = torch.zeros_like(grad)
+
+        state["RMS"] = 0
+    else:
+        if use_first_moment:
+            state["exp_avg"] = state["exp_avg"].to(grad)
+        if factored:
+            state["exp_avg_sq_row"] = state["exp_avg_sq_row"].to(grad)
+            state["exp_avg_sq_col"] = state["exp_avg_sq_col"].to(grad)
+        else:
+            state["exp_avg_sq"] = state["exp_avg_sq"].to(grad)
+
+    p_data_fp32 = p
+    if p.dtype in {torch.float16, torch.bfloat16}:
+        p_data_fp32 = p_data_fp32.float()
+
+    state["step"] += 1
+    state["RMS"] = Adafactor._rms(p_data_fp32)
+    lr = Adafactor._get_lr(group, state)
+
+    beta2t = 1.0 - math.pow(state["step"], group["decay_rate"])
+    update = (grad**2) + group["eps"][0]
+    if factored:
+        exp_avg_sq_row = state["exp_avg_sq_row"]
+        exp_avg_sq_col = state["exp_avg_sq_col"]
+
+        exp_avg_sq_row.mul_(beta2t).add_(update.mean(dim=-1), alpha=(1.0 - beta2t))
+        exp_avg_sq_col.mul_(beta2t).add_(update.mean(dim=-2), alpha=(1.0 - beta2t))
+
+        # Approximation of exponential moving average of square of gradient
+        update = Adafactor._approx_sq_grad(exp_avg_sq_row, exp_avg_sq_col)
+        update.mul_(grad)
+    else:
+        exp_avg_sq = state["exp_avg_sq"]
+
+        exp_avg_sq.mul_(beta2t).add_(update, alpha=(1.0 - beta2t))
+        update = exp_avg_sq.rsqrt().mul_(grad)
+
+    update.div_((Adafactor._rms(update) / group["clip_threshold"]).clamp_(min=1.0))
+    update.mul_(lr)
+
+    if use_first_moment:
+        exp_avg = state["exp_avg"]
+        exp_avg.mul_(group["beta1"]).add_(update, alpha=(1 - group["beta1"]))
+        update = exp_avg
+
+    if group["weight_decay"] != 0:
+        p_data_fp32.add_(p_data_fp32, alpha=(-group["weight_decay"] * lr))
+
+    p_data_fp32.add_(-update)
+
+    # if p.dtype in {torch.float16, torch.bfloat16}:
+    #    p.copy_(p_data_fp32)
+
+    if p.dtype == torch.bfloat16:
+        copy_stochastic_(p, p_data_fp32)
+    elif p.dtype == torch.float16:
+        p.copy_(p_data_fp32)
+
+
+@torch.no_grad()
+def adafactor_step(self, closure=None):
+    """
+    Performs a single optimization step
+
+    Arguments:
+        closure (callable, optional): A closure that reevaluates the model
+            and returns the loss.
+    """
+    loss = None
+    if closure is not None:
+        loss = closure()
+
+    for group in self.param_groups:
+        for p in group["params"]:
+            adafactor_step_param(self, p, group)
+
+    return loss
+
+
+def patch_adafactor_fused(optimizer: Adafactor):
+    optimizer.step_param = adafactor_step_param.__get__(optimizer)
+    optimizer.step = adafactor_step.__get__(optimizer)
--- a/library/attention_processors.py
+++ b/library/attention_processors.py
@@ -0,0 +1,227 @@
+import math
+from typing import Any
+from einops import rearrange
+import torch
+from diffusers.models.attention_processor import Attention
+
+
+# flash attention forwards and backwards
+
+# https://arxiv.org/abs/2205.14135
+
+EPSILON = 1e-6
+
+
+class FlashAttentionFunction(torch.autograd.function.Function):
+    @staticmethod
+    @torch.no_grad()
+    def forward(ctx, q, k, v, mask, causal, q_bucket_size, k_bucket_size):
+        """Algorithm 2 in the paper"""
+
+        device = q.device
+        dtype = q.dtype
+        max_neg_value = -torch.finfo(q.dtype).max
+        qk_len_diff = max(k.shape[-2] - q.shape[-2], 0)
+
+        o = torch.zeros_like(q)
+        all_row_sums = torch.zeros((*q.shape[:-1], 1), dtype=dtype, device=device)
+        all_row_maxes = torch.full(
+            (*q.shape[:-1], 1), max_neg_value, dtype=dtype, device=device
+        )
+
+        scale = q.shape[-1] ** -0.5
+
+        if mask is None:
+            mask = (None,) * math.ceil(q.shape[-2] / q_bucket_size)
+        else:
+            mask = rearrange(mask, "b n -> b 1 1 n")
+            mask = mask.split(q_bucket_size, dim=-1)
+
+        row_splits = zip(
+            q.split(q_bucket_size, dim=-2),
+            o.split(q_bucket_size, dim=-2),
+            mask,
+            all_row_sums.split(q_bucket_size, dim=-2),
+            all_row_maxes.split(q_bucket_size, dim=-2),
+        )
+
+        for ind, (qc, oc, row_mask, row_sums, row_maxes) in enumerate(row_splits):
+            q_start_index = ind * q_bucket_size - qk_len_diff
+
+            col_splits = zip(
+                k.split(k_bucket_size, dim=-2),
+                v.split(k_bucket_size, dim=-2),
+            )
+
+            for k_ind, (kc, vc) in enumerate(col_splits):
+                k_start_index = k_ind * k_bucket_size
+
+                attn_weights = (
+                    torch.einsum("... i d, ... j d -> ... i j", qc, kc) * scale
+                )
+
+                if row_mask is not None:
+                    attn_weights.masked_fill_(~row_mask, max_neg_value)
+
+                if causal and q_start_index < (k_start_index + k_bucket_size - 1):
+                    causal_mask = torch.ones(
+                        (qc.shape[-2], kc.shape[-2]), dtype=torch.bool, device=device
+                    ).triu(q_start_index - k_start_index + 1)
+                    attn_weights.masked_fill_(causal_mask, max_neg_value)
+
+                block_row_maxes = attn_weights.amax(dim=-1, keepdims=True)
+                attn_weights -= block_row_maxes
+                exp_weights = torch.exp(attn_weights)
+
+                if row_mask is not None:
+                    exp_weights.masked_fill_(~row_mask, 0.0)
+
+                block_row_sums = exp_weights.sum(dim=-1, keepdims=True).clamp(
+                    min=EPSILON
+                )
+
+                new_row_maxes = torch.maximum(block_row_maxes, row_maxes)
+
+                exp_values = torch.einsum(
+                    "... i j, ... j d -> ... i d", exp_weights, vc
+                )
+
+                exp_row_max_diff = torch.exp(row_maxes - new_row_maxes)
+                exp_block_row_max_diff = torch.exp(block_row_maxes - new_row_maxes)
+
+                new_row_sums = (
+                    exp_row_max_diff * row_sums
+                    + exp_block_row_max_diff * block_row_sums
+                )
+
+                oc.mul_((row_sums / new_row_sums) * exp_row_max_diff).add_(
+                    (exp_block_row_max_diff / new_row_sums) * exp_values
+                )
+
+                row_maxes.copy_(new_row_maxes)
+                row_sums.copy_(new_row_sums)
+
+        ctx.args = (causal, scale, mask, q_bucket_size, k_bucket_size)
+        ctx.save_for_backward(q, k, v, o, all_row_sums, all_row_maxes)
+
+        return o
+
+    @staticmethod
+    @torch.no_grad()
+    def backward(ctx, do):
+        """Algorithm 4 in the paper"""
+
+        causal, scale, mask, q_bucket_size, k_bucket_size = ctx.args
+        q, k, v, o, l, m = ctx.saved_tensors
+
+        device = q.device
+
+        max_neg_value = -torch.finfo(q.dtype).max
+        qk_len_diff = max(k.shape[-2] - q.shape[-2], 0)
+
+        dq = torch.zeros_like(q)
+        dk = torch.zeros_like(k)
+        dv = torch.zeros_like(v)
+
+        row_splits = zip(
+            q.split(q_bucket_size, dim=-2),
+            o.split(q_bucket_size, dim=-2),
+            do.split(q_bucket_size, dim=-2),
+            mask,
+            l.split(q_bucket_size, dim=-2),
+            m.split(q_bucket_size, dim=-2),
+            dq.split(q_bucket_size, dim=-2),
+        )
+
+        for ind, (qc, oc, doc, row_mask, lc, mc, dqc) in enumerate(row_splits):
+            q_start_index = ind * q_bucket_size - qk_len_diff
+
+            col_splits = zip(
+                k.split(k_bucket_size, dim=-2),
+                v.split(k_bucket_size, dim=-2),
+                dk.split(k_bucket_size, dim=-2),
+                dv.split(k_bucket_size, dim=-2),
+            )
+
+            for k_ind, (kc, vc, dkc, dvc) in enumerate(col_splits):
+                k_start_index = k_ind * k_bucket_size
+
+                attn_weights = (
+                    torch.einsum("... i d, ... j d -> ... i j", qc, kc) * scale
+                )
+
+                if causal and q_start_index < (k_start_index + k_bucket_size - 1):
+                    causal_mask = torch.ones(
+                        (qc.shape[-2], kc.shape[-2]), dtype=torch.bool, device=device
+                    ).triu(q_start_index - k_start_index + 1)
+                    attn_weights.masked_fill_(causal_mask, max_neg_value)
+
+                exp_attn_weights = torch.exp(attn_weights - mc)
+
+                if row_mask is not None:
+                    exp_attn_weights.masked_fill_(~row_mask, 0.0)
+
+                p = exp_attn_weights / lc
+
+                dv_chunk = torch.einsum("... i j, ... i d -> ... j d", p, doc)
+                dp = torch.einsum("... i d, ... j d -> ... i j", doc, vc)
+
+                D = (doc * oc).sum(dim=-1, keepdims=True)
+                ds = p * scale * (dp - D)
+
+                dq_chunk = torch.einsum("... i j, ... j d -> ... i d", ds, kc)
+                dk_chunk = torch.einsum("... i j, ... i d -> ... j d", ds, qc)
+
+                dqc.add_(dq_chunk)
+                dkc.add_(dk_chunk)
+                dvc.add_(dv_chunk)
+
+        return dq, dk, dv, None, None, None, None
+
+
+class FlashAttnProcessor:
+    def __call__(
+        self,
+        attn: Attention,
+        hidden_states,
+        encoder_hidden_states=None,
+        attention_mask=None,
+    ) -> Any:
+        q_bucket_size = 512
+        k_bucket_size = 1024
+
+        h = attn.heads
+        q = attn.to_q(hidden_states)
+
+        encoder_hidden_states = (
+            encoder_hidden_states
+            if encoder_hidden_states is not None
+            else hidden_states
+        )
+        encoder_hidden_states = encoder_hidden_states.to(hidden_states.dtype)
+
+        if hasattr(attn, "hypernetwork") and attn.hypernetwork is not None:
+            context_k, context_v = attn.hypernetwork.forward(
+                hidden_states, encoder_hidden_states
+            )
+            context_k = context_k.to(hidden_states.dtype)
+            context_v = context_v.to(hidden_states.dtype)
+        else:
+            context_k = encoder_hidden_states
+            context_v = encoder_hidden_states
+
+        k = attn.to_k(context_k)
+        v = attn.to_v(context_v)
+        del encoder_hidden_states, hidden_states
+
+        q, k, v = map(lambda t: rearrange(t, "b n (h d) -> b h n d", h=h), (q, k, v))
+
+        out = FlashAttentionFunction.apply(
+            q, k, v, attention_mask, False, q_bucket_size, k_bucket_size
+        )
+
+        out = rearrange(out, "b h n d -> b n (h d)")
+
+        out = attn.to_out[0](out)
+        out = attn.to_out[1](out)
+        return out
--- a/library/chroma_models.py
+++ b/library/chroma_models.py
@@ -0,0 +1,614 @@
+# copy from the official repo: https://github.com/lodestone-rock/flow/blob/master/src/models/chroma/model.py
+# and modified
+# licensed under Apache License 2.0
+
+import math
+from dataclasses import dataclass
+
+import torch
+from einops import rearrange
+from torch import Tensor, nn
+import torch.nn.functional as F
+import torch.utils.checkpoint as ckpt
+
+from .flux_models import (
+    attention,
+    rope,
+    apply_rope,
+    EmbedND,
+    timestep_embedding,
+    MLPEmbedder,
+    RMSNorm,
+    QKNorm,
+)
+
+
+def distribute_modulations(tensor: torch.Tensor, depth_single_blocks, depth_double_blocks):
+    """
+    Distributes slices of the tensor into the block_dict as ModulationOut objects.
+
+    Args:
+        tensor (torch.Tensor): Input tensor with shape [batch_size, vectors, dim].
+    """
+    batch_size, vectors, dim = tensor.shape
+
+    block_dict = {}
+
+    # HARD CODED VALUES! lookup table for the generated vectors
+    # TODO: move this into chroma config!
+    # Add 38 single mod blocks
+    for i in range(depth_single_blocks):
+        key = f"single_blocks.{i}.modulation.lin"
+        block_dict[key] = None
+
+    # Add 19 image double blocks
+    for i in range(depth_double_blocks):
+        key = f"double_blocks.{i}.img_mod.lin"
+        block_dict[key] = None
+
+    # Add 19 text double blocks
+    for i in range(depth_double_blocks):
+        key = f"double_blocks.{i}.txt_mod.lin"
+        block_dict[key] = None
+
+    # Add the final layer
+    block_dict["final_layer.adaLN_modulation.1"] = None
+    # 6.2b version
+    # block_dict["lite_double_blocks.4.img_mod.lin"] = None
+    # block_dict["lite_double_blocks.4.txt_mod.lin"] = None
+
+    idx = 0  # Index to keep track of the vector slices
+
+    for key in block_dict.keys():
+        if "single_blocks" in key:
+            # Single block: 1 ModulationOut
+            block_dict[key] = ModulationOut(
+                shift=tensor[:, idx : idx + 1, :],
+                scale=tensor[:, idx + 1 : idx + 2, :],
+                gate=tensor[:, idx + 2 : idx + 3, :],
+            )
+            idx += 3  # Advance by 3 vectors
+
+        elif "img_mod" in key:
+            # Double block: List of 2 ModulationOut
+            double_block = []
+            for _ in range(2):  # Create 2 ModulationOut objects
+                double_block.append(
+                    ModulationOut(
+                        shift=tensor[:, idx : idx + 1, :],
+                        scale=tensor[:, idx + 1 : idx + 2, :],
+                        gate=tensor[:, idx + 2 : idx + 3, :],
+                    )
+                )
+                idx += 3  # Advance by 3 vectors per ModulationOut
+            block_dict[key] = double_block
+
+        elif "txt_mod" in key:
+            # Double block: List of 2 ModulationOut
+            double_block = []
+            for _ in range(2):  # Create 2 ModulationOut objects
+                double_block.append(
+                    ModulationOut(
+                        shift=tensor[:, idx : idx + 1, :],
+                        scale=tensor[:, idx + 1 : idx + 2, :],
+                        gate=tensor[:, idx + 2 : idx + 3, :],
+                    )
+                )
+                idx += 3  # Advance by 3 vectors per ModulationOut
+            block_dict[key] = double_block
+
+        elif "final_layer" in key:
+            # Final layer: 1 ModulationOut
+            block_dict[key] = [
+                tensor[:, idx : idx + 1, :],
+                tensor[:, idx + 1 : idx + 2, :],
+            ]
+            idx += 2  # Advance by 3 vectors
+
+    return block_dict
+
+
+class Approximator(nn.Module):
+    def __init__(self, in_dim: int, out_dim: int, hidden_dim: int, n_layers=4):
+        super().__init__()
+        self.in_proj = nn.Linear(in_dim, hidden_dim, bias=True)
+        self.layers = nn.ModuleList([MLPEmbedder(hidden_dim, hidden_dim) for x in range(n_layers)])
+        self.norms = nn.ModuleList([RMSNorm(hidden_dim) for x in range(n_layers)])
+        self.out_proj = nn.Linear(hidden_dim, out_dim)
+
+    @property
+    def device(self):
+        # Get the device of the module (assumes all parameters are on the same device)
+        return next(self.parameters()).device
+
+    def forward(self, x: Tensor) -> Tensor:
+        x = self.in_proj(x)
+
+        for layer, norms in zip(self.layers, self.norms):
+            x = x + layer(norms(x))
+
+        x = self.out_proj(x)
+
+        return x
+
+
+class SelfAttention(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        num_heads: int = 8,
+        qkv_bias: bool = False,
+    ):
+        super().__init__()
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+
+        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
+        self.norm = QKNorm(head_dim)
+        self.proj = nn.Linear(dim, dim)
+
+    def forward(self, x: Tensor, pe: Tensor) -> Tensor:
+        qkv = self.qkv(x)
+        q, k, v = rearrange(qkv, "B L (K H D) -> K B H L D", K=3, H=self.num_heads)
+        q, k = self.norm(q, k, v)
+        x = attention(q, k, v, pe=pe)
+        x = self.proj(x)
+        return x
+
+
+@dataclass
+class ModulationOut:
+    shift: Tensor
+    scale: Tensor
+    gate: Tensor
+
+
+def _modulation_shift_scale_fn(x, scale, shift):
+    return (1 + scale) * x + shift
+
+
+def _modulation_gate_fn(x, gate, gate_params):
+    return x + gate * gate_params
+
+
+class DoubleStreamBlock(nn.Module):
+    def __init__(
+        self,
+        hidden_size: int,
+        num_heads: int,
+        mlp_ratio: float,
+        qkv_bias: bool = False,
+    ):
+        super().__init__()
+
+        mlp_hidden_dim = int(hidden_size * mlp_ratio)
+        self.num_heads = num_heads
+        self.hidden_size = hidden_size
+        self.img_norm1 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        self.img_attn = SelfAttention(
+            dim=hidden_size,
+            num_heads=num_heads,
+            qkv_bias=qkv_bias,
+        )
+
+        self.img_norm2 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        self.img_mlp = nn.Sequential(
+            nn.Linear(hidden_size, mlp_hidden_dim, bias=True),
+            nn.GELU(approximate="tanh"),
+            nn.Linear(mlp_hidden_dim, hidden_size, bias=True),
+        )
+
+        self.txt_norm1 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        self.txt_attn = SelfAttention(
+            dim=hidden_size,
+            num_heads=num_heads,
+            qkv_bias=qkv_bias,
+        )
+
+        self.txt_norm2 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        self.txt_mlp = nn.Sequential(
+            nn.Linear(hidden_size, mlp_hidden_dim, bias=True),
+            nn.GELU(approximate="tanh"),
+            nn.Linear(mlp_hidden_dim, hidden_size, bias=True),
+        )
+
+    @property
+    def device(self):
+        # Get the device of the module (assumes all parameters are on the same device)
+        return next(self.parameters()).device
+
+    def modulation_shift_scale_fn(self, x, scale, shift):
+        return _modulation_shift_scale_fn(x, scale, shift)
+
+    def modulation_gate_fn(self, x, gate, gate_params):
+        return _modulation_gate_fn(x, gate, gate_params)
+
+    def forward(
+        self,
+        img: Tensor,
+        txt: Tensor,
+        pe: Tensor,
+        distill_vec: list[ModulationOut],
+        mask: Tensor,
+    ) -> tuple[Tensor, Tensor]:
+        (img_mod1, img_mod2), (txt_mod1, txt_mod2) = distill_vec
+
+        # prepare image for attention
+        img_modulated = self.img_norm1(img)
+        # replaced with compiled fn
+        # img_modulated = (1 + img_mod1.scale) * img_modulated + img_mod1.shift
+        img_modulated = self.modulation_shift_scale_fn(img_modulated, img_mod1.scale, img_mod1.shift)
+        img_qkv = self.img_attn.qkv(img_modulated)
+        img_q, img_k, img_v = rearrange(img_qkv, "B L (K H D) -> K B H L D", K=3, H=self.num_heads)
+        img_q, img_k = self.img_attn.norm(img_q, img_k, img_v)
+
+        # prepare txt for attention
+        txt_modulated = self.txt_norm1(txt)
+        # replaced with compiled fn
+        # txt_modulated = (1 + txt_mod1.scale) * txt_modulated + txt_mod1.shift
+        txt_modulated = self.modulation_shift_scale_fn(txt_modulated, txt_mod1.scale, txt_mod1.shift)
+        txt_qkv = self.txt_attn.qkv(txt_modulated)
+        txt_q, txt_k, txt_v = rearrange(txt_qkv, "B L (K H D) -> K B H L D", K=3, H=self.num_heads)
+        txt_q, txt_k = self.txt_attn.norm(txt_q, txt_k, txt_v)
+
+        # run actual attention
+        q = torch.cat((txt_q, img_q), dim=2)
+        k = torch.cat((txt_k, img_k), dim=2)
+        v = torch.cat((txt_v, img_v), dim=2)
+
+        attn = attention(q, k, v, pe=pe, mask=mask)
+        txt_attn, img_attn = attn[:, : txt.shape[1]], attn[:, txt.shape[1] :]
+
+        # calculate the img bloks
+        # replaced with compiled fn
+        # img = img + img_mod1.gate * self.img_attn.proj(img_attn)
+        # img = img + img_mod2.gate * self.img_mlp((1 + img_mod2.scale) * self.img_norm2(img) + img_mod2.shift)
+        img = self.modulation_gate_fn(img, img_mod1.gate, self.img_attn.proj(img_attn))
+        img = self.modulation_gate_fn(
+            img,
+            img_mod2.gate,
+            self.img_mlp(self.modulation_shift_scale_fn(self.img_norm2(img), img_mod2.scale, img_mod2.shift)),
+        )
+
+        # calculate the txt bloks
+        # replaced with compiled fn
+        # txt = txt + txt_mod1.gate * self.txt_attn.proj(txt_attn)
+        # txt = txt + txt_mod2.gate * self.txt_mlp((1 + txt_mod2.scale) * self.txt_norm2(txt) + txt_mod2.shift)
+        txt = self.modulation_gate_fn(txt, txt_mod1.gate, self.txt_attn.proj(txt_attn))
+        txt = self.modulation_gate_fn(
+            txt,
+            txt_mod2.gate,
+            self.txt_mlp(self.modulation_shift_scale_fn(self.txt_norm2(txt), txt_mod2.scale, txt_mod2.shift)),
+        )
+
+        return img, txt
+
+
+class SingleStreamBlock(nn.Module):
+    """
+    A DiT block with parallel linear layers as described in
+    https://arxiv.org/abs/2302.05442 and adapted modulation interface.
+    """
+
+    def __init__(
+        self,
+        hidden_size: int,
+        num_heads: int,
+        mlp_ratio: float = 4.0,
+        qk_scale: float | None = None,
+    ):
+        super().__init__()
+        self.hidden_dim = hidden_size
+        self.num_heads = num_heads
+        head_dim = hidden_size // num_heads
+        self.scale = qk_scale or head_dim**-0.5
+
+        self.mlp_hidden_dim = int(hidden_size * mlp_ratio)
+        # qkv and mlp_in
+        self.linear1 = nn.Linear(hidden_size, hidden_size * 3 + self.mlp_hidden_dim)
+        # proj and mlp_out
+        self.linear2 = nn.Linear(hidden_size + self.mlp_hidden_dim, hidden_size)
+
+        self.norm = QKNorm(head_dim)
+
+        self.hidden_size = hidden_size
+        self.pre_norm = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+
+        self.mlp_act = nn.GELU(approximate="tanh")
+
+    @property
+    def device(self):
+        # Get the device of the module (assumes all parameters are on the same device)
+        return next(self.parameters()).device
+
+    def modulation_shift_scale_fn(self, x, scale, shift):
+        return _modulation_shift_scale_fn(x, scale, shift)
+
+    def modulation_gate_fn(self, x, gate, gate_params):
+        return _modulation_gate_fn(x, gate, gate_params)
+
+    def forward(self, x: Tensor, pe: Tensor, distill_vec: list[ModulationOut], mask: Tensor) -> Tensor:
+        mod = distill_vec
+        # replaced with compiled fn
+        # x_mod = (1 + mod.scale) * self.pre_norm(x) + mod.shift
+        x_mod = self.modulation_shift_scale_fn(self.pre_norm(x), mod.scale, mod.shift)
+        qkv, mlp = torch.split(self.linear1(x_mod), [3 * self.hidden_size, self.mlp_hidden_dim], dim=-1)
+
+        q, k, v = rearrange(qkv, "B L (K H D) -> K B H L D", K=3, H=self.num_heads)
+        q, k = self.norm(q, k, v)
+
+        # compute attention
+        attn = attention(q, k, v, pe=pe, mask=mask)
+        # compute activation in mlp stream, cat again and run second linear layer
+        output = self.linear2(torch.cat((attn, self.mlp_act(mlp)), 2))
+        # replaced with compiled fn
+        # return x + mod.gate * output
+        return self.modulation_gate_fn(x, mod.gate, output)
+
+
+class LastLayer(nn.Module):
+    def __init__(
+        self,
+        hidden_size: int,
+        patch_size: int,
+        out_channels: int,
+    ):
+        super().__init__()
+        self.norm_final = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        self.linear = nn.Linear(hidden_size, patch_size * patch_size * out_channels, bias=True)
+
+    @property
+    def device(self):
+        # Get the device of the module (assumes all parameters are on the same device)
+        return next(self.parameters()).device
+
+    def modulation_shift_scale_fn(self, x, scale, shift):
+        return _modulation_shift_scale_fn(x, scale, shift)
+
+    def forward(self, x: Tensor, distill_vec: list[Tensor]) -> Tensor:
+        shift, scale = distill_vec
+        shift = shift.squeeze(1)
+        scale = scale.squeeze(1)
+        # replaced with compiled fn
+        # x = (1 + scale[:, None, :]) * self.norm_final(x) + shift[:, None, :]
+        x = self.modulation_shift_scale_fn(self.norm_final(x), scale[:, None, :], shift[:, None, :])
+        x = self.linear(x)
+        return x
+
+
+@dataclass
+class ChromaParams:
+    in_channels: int
+    context_in_dim: int
+    hidden_size: int
+    mlp_ratio: float
+    num_heads: int
+    depth: int
+    depth_single_blocks: int
+    axes_dim: list[int]
+    theta: int
+    qkv_bias: bool
+    guidance_embed: bool
+    approximator_in_dim: int
+    approximator_depth: int
+    approximator_hidden_size: int
+    _use_compiled: bool
+
+
+chroma_params = ChromaParams(
+    in_channels=64,
+    context_in_dim=4096,
+    hidden_size=3072,
+    mlp_ratio=4.0,
+    num_heads=24,
+    depth=19,
+    depth_single_blocks=38,
+    axes_dim=[16, 56, 56],
+    theta=10_000,
+    qkv_bias=True,
+    guidance_embed=True,
+    approximator_in_dim=64,
+    approximator_depth=5,
+    approximator_hidden_size=5120,
+    _use_compiled=False,
+)
+
+
+def modify_mask_to_attend_padding(mask, max_seq_length, num_extra_padding=8):
+    """
+    Modifies attention mask to allow attention to a few extra padding tokens.
+
+    Args:
+        mask: Original attention mask (1 for tokens to attend to, 0 for masked tokens)
+        max_seq_length: Maximum sequence length of the model
+        num_extra_padding: Number of padding tokens to unmask
+
+    Returns:
+        Modified mask
+    """
+    # Get the actual sequence length from the mask
+    seq_length = mask.sum(dim=-1)
+    batch_size = mask.shape[0]
+
+    modified_mask = mask.clone()
+
+    for i in range(batch_size):
+        current_seq_len = int(seq_length[i].item())
+
+        # Only add extra padding tokens if there's room
+        if current_seq_len < max_seq_length:
+            # Calculate how many padding tokens we can unmask
+            available_padding = max_seq_length - current_seq_len
+            tokens_to_unmask = min(num_extra_padding, available_padding)
+
+            # Unmask the specified number of padding tokens right after the sequence
+            modified_mask[i, current_seq_len : current_seq_len + tokens_to_unmask] = 1
+
+    return modified_mask
+
+
+class Chroma(nn.Module):
+    """
+    Transformer model for flow matching on sequences.
+    """
+
+    def __init__(self, params: ChromaParams):
+        super().__init__()
+        self.params = params
+        self.in_channels = params.in_channels
+        self.out_channels = self.in_channels
+        if params.hidden_size % params.num_heads != 0:
+            raise ValueError(f"Hidden size {params.hidden_size} must be divisible by num_heads {params.num_heads}")
+        pe_dim = params.hidden_size // params.num_heads
+        if sum(params.axes_dim) != pe_dim:
+            raise ValueError(f"Got {params.axes_dim} but expected positional dim {pe_dim}")
+        self.hidden_size = params.hidden_size
+        self.num_heads = params.num_heads
+        self.pe_embedder = EmbedND(dim=pe_dim, theta=params.theta, axes_dim=params.axes_dim)
+        self.img_in = nn.Linear(self.in_channels, self.hidden_size, bias=True)
+
+        # TODO: need proper mapping for this approximator output!
+        # currently the mapping is hardcoded in distribute_modulations function
+        self.distilled_guidance_layer = Approximator(
+            params.approximator_in_dim,
+            self.hidden_size,
+            params.approximator_hidden_size,
+            params.approximator_depth,
+        )
+        self.txt_in = nn.Linear(params.context_in_dim, self.hidden_size)
+
+        self.double_blocks = nn.ModuleList(
+            [
+                DoubleStreamBlock(
+                    self.hidden_size,
+                    self.num_heads,
+                    mlp_ratio=params.mlp_ratio,
+                    qkv_bias=params.qkv_bias,
+                )
+                for _ in range(params.depth)
+            ]
+        )
+
+        self.single_blocks = nn.ModuleList(
+            [
+                SingleStreamBlock(
+                    self.hidden_size,
+                    self.num_heads,
+                    mlp_ratio=params.mlp_ratio,
+                )
+                for _ in range(params.depth_single_blocks)
+            ]
+        )
+
+        self.final_layer = LastLayer(
+            self.hidden_size,
+            1,
+            self.out_channels,
+        )
+
+        # TODO: move this hardcoded value to config
+        # single layer has 3 modulation vectors
+        # double layer has 6 modulation vectors for each expert
+        # final layer has 2 modulation vectors
+        self.mod_index_length = 3 * params.depth_single_blocks + 2 * 6 * params.depth + 2
+        self.depth_single_blocks = params.depth_single_blocks
+        self.depth_double_blocks = params.depth
+        # self.mod_index = torch.tensor(list(range(self.mod_index_length)), device=0)
+        self.register_buffer(
+            "mod_index",
+            torch.tensor(list(range(self.mod_index_length)), device="cpu"),
+            persistent=False,
+        )
+        self.approximator_in_dim = params.approximator_in_dim
+
+    @property
+    def device(self):
+        # Get the device of the module (assumes all parameters are on the same device)
+        return next(self.parameters()).device
+
+    def forward(
+        self,
+        img: Tensor,
+        img_ids: Tensor,
+        txt: Tensor,
+        txt_ids: Tensor,
+        txt_mask: Tensor,
+        timesteps: Tensor,
+        guidance: Tensor,
+        attn_padding: int = 1,
+    ) -> Tensor:
+        if img.ndim != 3 or txt.ndim != 3:
+            raise ValueError("Input img and txt tensors must have 3 dimensions.")
+
+        # running on sequences img
+        img = self.img_in(img)
+        txt = self.txt_in(txt)
+
+        # TODO:
+        # need to fix grad accumulation issue here for now it's in no grad mode
+        # besides, i don't want to wash out the PFP that's trained on this model weights anyway
+        # the fan out operation here is deleting the backward graph
+        # alternatively doing forward pass for every block manually is doable but slow
+        # custom backward probably be better
+        with torch.no_grad():
+            distill_timestep = timestep_embedding(timesteps, self.approximator_in_dim // 4)
+            # TODO: need to add toggle to omit this from schnell but that's not a priority
+            distil_guidance = timestep_embedding(guidance, self.approximator_in_dim // 4)
+            # get all modulation index
+            modulation_index = timestep_embedding(self.mod_index, self.approximator_in_dim // 2)
+            # we need to broadcast the modulation index here so each batch has all of the index
+            modulation_index = modulation_index.unsqueeze(0).repeat(img.shape[0], 1, 1)
+            # and we need to broadcast timestep and guidance along too
+            timestep_guidance = (
+                torch.cat([distill_timestep, distil_guidance], dim=1).unsqueeze(1).repeat(1, self.mod_index_length, 1)
+            )
+            # then and only then we could concatenate it together
+            input_vec = torch.cat([timestep_guidance, modulation_index], dim=-1)
+            mod_vectors = self.distilled_guidance_layer(input_vec.requires_grad_(True))
+        mod_vectors_dict = distribute_modulations(mod_vectors, self.depth_single_blocks, self.depth_double_blocks)
+
+        ids = torch.cat((txt_ids, img_ids), dim=1)
+        pe = self.pe_embedder(ids)
+
+        # compute mask
+        # assume max seq length from the batched input
+
+        max_len = txt.shape[1]
+
+        # mask
+        with torch.no_grad():
+            txt_mask_w_padding = modify_mask_to_attend_padding(txt_mask, max_len, attn_padding)
+            txt_img_mask = torch.cat(
+                [
+                    txt_mask_w_padding,
+                    torch.ones([img.shape[0], img.shape[1]], device=txt_mask.device),
+                ],
+                dim=1,
+            )
+            txt_img_mask = txt_img_mask.float().T @ txt_img_mask.float()
+            txt_img_mask = txt_img_mask[None, None, ...].repeat(txt.shape[0], self.num_heads, 1, 1).int().bool()
+            # txt_mask_w_padding[txt_mask_w_padding==False] = True
+
+        for i, block in enumerate(self.double_blocks):
+            # the guidance replaced by FFN output
+            img_mod = mod_vectors_dict[f"double_blocks.{i}.img_mod.lin"]
+            txt_mod = mod_vectors_dict[f"double_blocks.{i}.txt_mod.lin"]
+            double_mod = [img_mod, txt_mod]
+
+            # just in case in different GPU for simple pipeline parallel
+            if self.training:
+                img, txt = ckpt.checkpoint(block, img, txt, pe, double_mod, txt_img_mask)
+            else:
+                img, txt = block(img=img, txt=txt, pe=pe, distill_vec=double_mod, mask=txt_img_mask)
+
+        img = torch.cat((txt, img), 1)
+        for i, block in enumerate(self.single_blocks):
+            single_mod = mod_vectors_dict[f"single_blocks.{i}.modulation.lin"]
+            if self.training:
+                img = ckpt.checkpoint(block, img, pe, single_mod, txt_img_mask)
+            else:
+                img = block(img, pe=pe, distill_vec=single_mod, mask=txt_img_mask)
+        img = img[:, txt.shape[1] :, ...]
+        final_mod = mod_vectors_dict["final_layer.adaLN_modulation.1"]
+        img = self.final_layer(img, distill_vec=final_mod)  # (N, T, patch_size ** 2 * out_channels)
+        return img
--- a/library/config_util.py
+++ b/library/config_util.py
@@ -0,0 +1,743 @@
+import argparse
+from dataclasses import (
+    asdict,
+    dataclass,
+)
+import functools
+import random
+from textwrap import dedent, indent
+import json
+from pathlib import Path
+
+# from toolz import curry
+from typing import Dict, List, Optional, Sequence, Tuple, Union
+
+import toml
+import voluptuous
+from voluptuous import (
+    Any,
+    ExactSequence,
+    MultipleInvalid,
+    Object,
+    Required,
+    Schema,
+)
+from transformers import CLIPTokenizer
+
+from . import train_util
+from .train_util import (
+    DreamBoothSubset,
+    FineTuningSubset,
+    ControlNetSubset,
+    DreamBoothDataset,
+    FineTuningDataset,
+    ControlNetDataset,
+    DatasetGroup,
+)
+from .utils import setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+
+def add_config_arguments(parser: argparse.ArgumentParser):
+    parser.add_argument(
+        "--dataset_config", type=Path, default=None, help="config file for detail settings / 詳細な設定用の設定ファイル"
+    )
+
+
+# TODO: inherit Params class in Subset, Dataset
+
+
+@dataclass
+class BaseSubsetParams:
+    image_dir: Optional[str] = None
+    num_repeats: int = 1
+    shuffle_caption: bool = False
+    caption_separator: str = (",",)
+    keep_tokens: int = 0
+    keep_tokens_separator: str = (None,)
+    secondary_separator: Optional[str] = None
+    enable_wildcard: bool = False
+    color_aug: bool = False
+    flip_aug: bool = False
+    face_crop_aug_range: Optional[Tuple[float, float]] = None
+    random_crop: bool = False
+    caption_prefix: Optional[str] = None
+    caption_suffix: Optional[str] = None
+    caption_dropout_rate: float = 0.0
+    caption_dropout_every_n_epochs: int = 0
+    caption_tag_dropout_rate: float = 0.0
+    token_warmup_min: int = 1
+    token_warmup_step: float = 0
+    custom_attributes: Optional[Dict[str, Any]] = None
+    validation_seed: int = 0
+    validation_split: float = 0.0
+    resize_interpolation: Optional[str] = None
+
+
+@dataclass
+class DreamBoothSubsetParams(BaseSubsetParams):
+    is_reg: bool = False
+    class_tokens: Optional[str] = None
+    caption_extension: str = ".caption"
+    cache_info: bool = False
+    alpha_mask: bool = False
+
+
+@dataclass
+class FineTuningSubsetParams(BaseSubsetParams):
+    metadata_file: Optional[str] = None
+    alpha_mask: bool = False
+
+
+@dataclass
+class ControlNetSubsetParams(BaseSubsetParams):
+    conditioning_data_dir: str = None
+    caption_extension: str = ".caption"
+    cache_info: bool = False
+
+
+@dataclass
+class BaseDatasetParams:
+    resolution: Optional[Tuple[int, int]] = None
+    network_multiplier: float = 1.0
+    debug_dataset: bool = False
+    validation_seed: Optional[int] = None
+    validation_split: float = 0.0
+    resize_interpolation: Optional[str] = None
+
+@dataclass
+class DreamBoothDatasetParams(BaseDatasetParams):
+    batch_size: int = 1
+    enable_bucket: bool = False
+    min_bucket_reso: int = 256
+    max_bucket_reso: int = 1024
+    bucket_reso_steps: int = 64
+    bucket_no_upscale: bool = False
+    prior_loss_weight: float = 1.0
+    
+@dataclass
+class FineTuningDatasetParams(BaseDatasetParams):
+    batch_size: int = 1
+    enable_bucket: bool = False
+    min_bucket_reso: int = 256
+    max_bucket_reso: int = 1024
+    bucket_reso_steps: int = 64
+    bucket_no_upscale: bool = False
+
+
+@dataclass
+class ControlNetDatasetParams(BaseDatasetParams):
+    batch_size: int = 1
+    enable_bucket: bool = False
+    min_bucket_reso: int = 256
+    max_bucket_reso: int = 1024
+    bucket_reso_steps: int = 64
+    bucket_no_upscale: bool = False
+
+
+@dataclass
+class SubsetBlueprint:
+    params: Union[DreamBoothSubsetParams, FineTuningSubsetParams]
+
+
+@dataclass
+class DatasetBlueprint:
+    is_dreambooth: bool
+    is_controlnet: bool
+    params: Union[DreamBoothDatasetParams, FineTuningDatasetParams]
+    subsets: Sequence[SubsetBlueprint]
+
+
+@dataclass
+class DatasetGroupBlueprint:
+    datasets: Sequence[DatasetBlueprint]
+
+
+@dataclass
+class Blueprint:
+    dataset_group: DatasetGroupBlueprint
+
+
+class ConfigSanitizer:
+    # @curry
+    @staticmethod
+    def __validate_and_convert_twodim(klass, value: Sequence) -> Tuple:
+        Schema(ExactSequence([klass, klass]))(value)
+        return tuple(value)
+
+    # @curry
+    @staticmethod
+    def __validate_and_convert_scalar_or_twodim(klass, value: Union[float, Sequence]) -> Tuple:
+        Schema(Any(klass, ExactSequence([klass, klass])))(value)
+        try:
+            Schema(klass)(value)
+            return (value, value)
+        except:
+            return ConfigSanitizer.__validate_and_convert_twodim(klass, value)
+
+    # subset schema
+    SUBSET_ASCENDABLE_SCHEMA = {
+        "color_aug": bool,
+        "face_crop_aug_range": functools.partial(__validate_and_convert_twodim.__func__, float),
+        "flip_aug": bool,
+        "num_repeats": int,
+        "random_crop": bool,
+        "shuffle_caption": bool,
+        "keep_tokens": int,
+        "keep_tokens_separator": str,
+        "secondary_separator": str,
+        "caption_separator": str,
+        "enable_wildcard": bool,
+        "token_warmup_min": int,
+        "token_warmup_step": Any(float, int),
+        "caption_prefix": str,
+        "caption_suffix": str,
+        "custom_attributes": dict,
+        "resize_interpolation": str,
+    }
+    # DO means DropOut
+    DO_SUBSET_ASCENDABLE_SCHEMA = {
+        "caption_dropout_every_n_epochs": int,
+        "caption_dropout_rate": Any(float, int),
+        "caption_tag_dropout_rate": Any(float, int),
+    }
+    # DB means DreamBooth
+    DB_SUBSET_ASCENDABLE_SCHEMA = {
+        "caption_extension": str,
+        "class_tokens": str,
+        "cache_info": bool,
+    }
+    DB_SUBSET_DISTINCT_SCHEMA = {
+        Required("image_dir"): str,
+        "is_reg": bool,
+        "alpha_mask": bool,
+    }
+    # FT means FineTuning
+    FT_SUBSET_DISTINCT_SCHEMA = {
+        Required("metadata_file"): str,
+        "image_dir": str,
+        "alpha_mask": bool,
+    }
+    CN_SUBSET_ASCENDABLE_SCHEMA = {
+        "caption_extension": str,
+        "cache_info": bool,
+    }
+    CN_SUBSET_DISTINCT_SCHEMA = {
+        Required("image_dir"): str,
+        Required("conditioning_data_dir"): str,
+    }
+
+    # datasets schema
+    DATASET_ASCENDABLE_SCHEMA = {
+        "batch_size": int,
+        "bucket_no_upscale": bool,
+        "bucket_reso_steps": int,
+        "enable_bucket": bool,
+        "max_bucket_reso": int,
+        "min_bucket_reso": int,
+        "validation_seed": int,
+        "validation_split": float,
+        "resolution": functools.partial(__validate_and_convert_scalar_or_twodim.__func__, int),
+        "network_multiplier": float,
+        "resize_interpolation": str,
+    }
+
+    # options handled by argparse but not handled by user config
+    ARGPARSE_SPECIFIC_SCHEMA = {
+        "debug_dataset": bool,
+        "max_token_length": Any(None, int),
+        "prior_loss_weight": Any(float, int),
+    }
+    # for handling default None value of argparse
+    ARGPARSE_NULLABLE_OPTNAMES = [
+        "face_crop_aug_range",
+        "resolution",
+    ]
+    # prepare map because option name may differ among argparse and user config
+    ARGPARSE_OPTNAME_TO_CONFIG_OPTNAME = {
+        "train_batch_size": "batch_size",
+        "dataset_repeats": "num_repeats",
+    }
+
+    def __init__(self, support_dreambooth: bool, support_finetuning: bool, support_controlnet: bool, support_dropout: bool) -> None:
+        assert support_dreambooth or support_finetuning or support_controlnet, (
+            "Neither DreamBooth mode nor fine tuning mode nor controlnet mode specified. Please specify one mode or more."
+            + " / DreamBooth モードか fine tuning モードか controlnet モードのどれも指定されていません。1つ以上指定してください。"
+        )
+
+        self.db_subset_schema = self.__merge_dict(
+            self.SUBSET_ASCENDABLE_SCHEMA,
+            self.DB_SUBSET_DISTINCT_SCHEMA,
+            self.DB_SUBSET_ASCENDABLE_SCHEMA,
+            self.DO_SUBSET_ASCENDABLE_SCHEMA if support_dropout else {},
+        )
+
+        self.ft_subset_schema = self.__merge_dict(
+            self.SUBSET_ASCENDABLE_SCHEMA,
+            self.FT_SUBSET_DISTINCT_SCHEMA,
+            self.DO_SUBSET_ASCENDABLE_SCHEMA if support_dropout else {},
+        )
+
+        self.cn_subset_schema = self.__merge_dict(
+            self.SUBSET_ASCENDABLE_SCHEMA,
+            self.CN_SUBSET_DISTINCT_SCHEMA,
+            self.CN_SUBSET_ASCENDABLE_SCHEMA,
+            self.DO_SUBSET_ASCENDABLE_SCHEMA if support_dropout else {},
+        )
+
+        self.db_dataset_schema = self.__merge_dict(
+            self.DATASET_ASCENDABLE_SCHEMA,
+            self.SUBSET_ASCENDABLE_SCHEMA,
+            self.DB_SUBSET_ASCENDABLE_SCHEMA,
+            self.DO_SUBSET_ASCENDABLE_SCHEMA if support_dropout else {},
+            {"subsets": [self.db_subset_schema]},
+        )
+
+        self.ft_dataset_schema = self.__merge_dict(
+            self.DATASET_ASCENDABLE_SCHEMA,
+            self.SUBSET_ASCENDABLE_SCHEMA,
+            self.DO_SUBSET_ASCENDABLE_SCHEMA if support_dropout else {},
+            {"subsets": [self.ft_subset_schema]},
+        )
+
+        self.cn_dataset_schema = self.__merge_dict(
+            self.DATASET_ASCENDABLE_SCHEMA,
+            self.SUBSET_ASCENDABLE_SCHEMA,
+            self.CN_SUBSET_ASCENDABLE_SCHEMA,
+            self.DO_SUBSET_ASCENDABLE_SCHEMA if support_dropout else {},
+            {"subsets": [self.cn_subset_schema]},
+        )
+
+        if support_dreambooth and support_finetuning:
+
+            def validate_flex_dataset(dataset_config: dict):
+                subsets_config = dataset_config.get("subsets", [])
+
+                if support_controlnet and all(["conditioning_data_dir" in subset for subset in subsets_config]):
+                    return Schema(self.cn_dataset_schema)(dataset_config)
+                # check dataset meets FT style
+                # NOTE: all FT subsets should have "metadata_file"
+                elif all(["metadata_file" in subset for subset in subsets_config]):
+                    return Schema(self.ft_dataset_schema)(dataset_config)
+                # check dataset meets DB style
+                # NOTE: all DB subsets should have no "metadata_file"
+                elif all(["metadata_file" not in subset for subset in subsets_config]):
+                    return Schema(self.db_dataset_schema)(dataset_config)
+                else:
+                    raise voluptuous.Invalid(
+                        "DreamBooth subset and fine tuning subset cannot be mixed in the same dataset. Please split them into separate datasets. / DreamBoothのサブセットとfine tuninのサブセットを同一のデータセットに混在させることはできません。別々のデータセットに分割してください。"
+                    )
+
+            self.dataset_schema = validate_flex_dataset
+        elif support_dreambooth:
+            if support_controlnet:
+                self.dataset_schema = self.cn_dataset_schema
+            else:
+                self.dataset_schema = self.db_dataset_schema
+        elif support_finetuning:
+            self.dataset_schema = self.ft_dataset_schema
+        elif support_controlnet:
+            self.dataset_schema = self.cn_dataset_schema
+
+        self.general_schema = self.__merge_dict(
+            self.DATASET_ASCENDABLE_SCHEMA,
+            self.SUBSET_ASCENDABLE_SCHEMA,
+            self.DB_SUBSET_ASCENDABLE_SCHEMA if support_dreambooth else {},
+            self.CN_SUBSET_ASCENDABLE_SCHEMA if support_controlnet else {},
+            self.DO_SUBSET_ASCENDABLE_SCHEMA if support_dropout else {},
+        )
+
+        self.user_config_validator = Schema(
+            {
+                "general": self.general_schema,
+                "datasets": [self.dataset_schema],
+            }
+        )
+
+        self.argparse_schema = self.__merge_dict(
+            self.general_schema,
+            self.ARGPARSE_SPECIFIC_SCHEMA,
+            {optname: Any(None, self.general_schema[optname]) for optname in self.ARGPARSE_NULLABLE_OPTNAMES},
+            {a_name: self.general_schema[c_name] for a_name, c_name in self.ARGPARSE_OPTNAME_TO_CONFIG_OPTNAME.items()},
+        )
+
+        self.argparse_config_validator = Schema(Object(self.argparse_schema), extra=voluptuous.ALLOW_EXTRA)
+
+    def sanitize_user_config(self, user_config: dict) -> dict:
+        try:
+            return self.user_config_validator(user_config)
+        except MultipleInvalid:
+            # TODO: エラー発生時のメッセージをわかりやすくする
+            logger.error("Invalid user config / ユーザ設定の形式が正しくないようです")
+            raise
+
+    # NOTE: In nature, argument parser result is not needed to be sanitize
+    #   However this will help us to detect program bug
+    def sanitize_argparse_namespace(self, argparse_namespace: argparse.Namespace) -> argparse.Namespace:
+        try:
+            return self.argparse_config_validator(argparse_namespace)
+        except MultipleInvalid:
+            # XXX: this should be a bug
+            logger.error(
+                "Invalid cmdline parsed arguments. This should be a bug. / コマンドラインのパース結果が正しくないようです。プログラムのバグの可能性が高いです。"
+            )
+            raise
+
+    # NOTE: value would be overwritten by latter dict if there is already the same key
+    @staticmethod
+    def __merge_dict(*dict_list: dict) -> dict:
+        merged = {}
+        for schema in dict_list:
+            # merged |= schema
+            for k, v in schema.items():
+                merged[k] = v
+        return merged
+
+
+class BlueprintGenerator:
+    BLUEPRINT_PARAM_NAME_TO_CONFIG_OPTNAME = {}
+
+    def __init__(self, sanitizer: ConfigSanitizer):
+        self.sanitizer = sanitizer
+
+    # runtime_params is for parameters which is only configurable on runtime, such as tokenizer
+    def generate(self, user_config: dict, argparse_namespace: argparse.Namespace, **runtime_params) -> Blueprint:
+        sanitized_user_config = self.sanitizer.sanitize_user_config(user_config)
+        sanitized_argparse_namespace = self.sanitizer.sanitize_argparse_namespace(argparse_namespace)
+
+        # convert argparse namespace to dict like config
+        # NOTE: it is ok to have extra entries in dict
+        optname_map = self.sanitizer.ARGPARSE_OPTNAME_TO_CONFIG_OPTNAME
+        argparse_config = {
+            optname_map.get(optname, optname): value for optname, value in vars(sanitized_argparse_namespace).items()
+        }
+
+        general_config = sanitized_user_config.get("general", {})
+
+        dataset_blueprints = []
+        for dataset_config in sanitized_user_config.get("datasets", []):
+            # NOTE: if subsets have no "metadata_file", these are DreamBooth datasets/subsets
+            subsets = dataset_config.get("subsets", [])
+            is_dreambooth = all(["metadata_file" not in subset for subset in subsets])
+            is_controlnet = all(["conditioning_data_dir" in subset for subset in subsets])
+            if is_controlnet:
+                subset_params_klass = ControlNetSubsetParams
+                dataset_params_klass = ControlNetDatasetParams
+            elif is_dreambooth:
+                subset_params_klass = DreamBoothSubsetParams
+                dataset_params_klass = DreamBoothDatasetParams
+            else:
+                subset_params_klass = FineTuningSubsetParams
+                dataset_params_klass = FineTuningDatasetParams
+
+            subset_blueprints = []
+            for subset_config in subsets:
+                params = self.generate_params_by_fallbacks(
+                    subset_params_klass, [subset_config, dataset_config, general_config, argparse_config, runtime_params]
+                )
+                subset_blueprints.append(SubsetBlueprint(params))
+
+            params = self.generate_params_by_fallbacks(
+                dataset_params_klass, [dataset_config, general_config, argparse_config, runtime_params]
+            )
+            dataset_blueprints.append(DatasetBlueprint(is_dreambooth, is_controlnet, params, subset_blueprints))
+
+        dataset_group_blueprint = DatasetGroupBlueprint(dataset_blueprints)
+
+        return Blueprint(dataset_group_blueprint)
+
+    @staticmethod
+    def generate_params_by_fallbacks(param_klass, fallbacks: Sequence[dict]):
+        name_map = BlueprintGenerator.BLUEPRINT_PARAM_NAME_TO_CONFIG_OPTNAME
+        search_value = BlueprintGenerator.search_value
+        default_params = asdict(param_klass())
+        param_names = default_params.keys()
+
+        params = {name: search_value(name_map.get(name, name), fallbacks, default_params.get(name)) for name in param_names}
+
+        return param_klass(**params)
+
+    @staticmethod
+    def search_value(key: str, fallbacks: Sequence[dict], default_value=None):
+        for cand in fallbacks:
+            value = cand.get(key)
+            if value is not None:
+                return value
+
+        return default_value
+
+def generate_dataset_group_by_blueprint(dataset_group_blueprint: DatasetGroupBlueprint) -> Tuple[DatasetGroup, Optional[DatasetGroup]]:
+    datasets: List[Union[DreamBoothDataset, FineTuningDataset, ControlNetDataset]] = []
+
+    for dataset_blueprint in dataset_group_blueprint.datasets:
+        extra_dataset_params = {}
+
+        if dataset_blueprint.is_controlnet:
+            subset_klass = ControlNetSubset
+            dataset_klass = ControlNetDataset
+        elif dataset_blueprint.is_dreambooth:
+            subset_klass = DreamBoothSubset
+            dataset_klass = DreamBoothDataset
+            # DreamBooth datasets support splitting training and validation datasets
+            extra_dataset_params = {"is_training_dataset": True}
+        else:
+            subset_klass = FineTuningSubset
+            dataset_klass = FineTuningDataset
+
+        subsets = [subset_klass(**asdict(subset_blueprint.params)) for subset_blueprint in dataset_blueprint.subsets]
+        dataset = dataset_klass(subsets=subsets, **asdict(dataset_blueprint.params), **extra_dataset_params)
+        datasets.append(dataset)
+
+    val_datasets: List[Union[DreamBoothDataset, FineTuningDataset, ControlNetDataset]] = []
+    for dataset_blueprint in dataset_group_blueprint.datasets:
+        if dataset_blueprint.params.validation_split < 0.0 or dataset_blueprint.params.validation_split > 1.0:
+            logging.warning(f"Dataset param `validation_split` ({dataset_blueprint.params.validation_split}) is not a valid number between 0.0 and 1.0, skipping validation split...")
+            continue
+
+        # if the dataset isn't setting a validation split, there is no current validation dataset
+        if dataset_blueprint.params.validation_split == 0.0:
+            continue
+
+        extra_dataset_params = {}
+        if dataset_blueprint.is_controlnet:
+            subset_klass = ControlNetSubset
+            dataset_klass = ControlNetDataset
+        elif dataset_blueprint.is_dreambooth:
+            subset_klass = DreamBoothSubset
+            dataset_klass = DreamBoothDataset
+            # DreamBooth datasets support splitting training and validation datasets
+            extra_dataset_params = {"is_training_dataset": False}
+        else:
+            subset_klass = FineTuningSubset
+            dataset_klass = FineTuningDataset
+
+        subsets = [subset_klass(**asdict(subset_blueprint.params)) for subset_blueprint in dataset_blueprint.subsets]
+        dataset = dataset_klass(subsets=subsets, **asdict(dataset_blueprint.params), **extra_dataset_params)
+        val_datasets.append(dataset)
+
+    def print_info(_datasets, dataset_type: str):
+        info = ""
+        for i, dataset in enumerate(_datasets):
+            is_dreambooth = isinstance(dataset, DreamBoothDataset)
+            is_controlnet = isinstance(dataset, ControlNetDataset)
+            info += dedent(f"""\
+                [{dataset_type} {i}]
+                  batch_size: {dataset.batch_size}
+                  resolution: {(dataset.width, dataset.height)}
+                  resize_interpolation: {dataset.resize_interpolation}
+                  enable_bucket: {dataset.enable_bucket}
+            """)
+
+            if dataset.enable_bucket:
+                info += indent(dedent(f"""\
+                  min_bucket_reso: {dataset.min_bucket_reso}
+                  max_bucket_reso: {dataset.max_bucket_reso}
+                  bucket_reso_steps: {dataset.bucket_reso_steps}
+                  bucket_no_upscale: {dataset.bucket_no_upscale}
+                \n"""), "  ")
+            else:
+                info += "\n"
+
+            for j, subset in enumerate(dataset.subsets):
+                info += indent(dedent(f"""\
+                  [Subset {j} of {dataset_type} {i}]
+                    image_dir: "{subset.image_dir}"
+                    image_count: {subset.img_count}
+                    num_repeats: {subset.num_repeats}
+                    shuffle_caption: {subset.shuffle_caption}
+                    keep_tokens: {subset.keep_tokens}
+                    caption_dropout_rate: {subset.caption_dropout_rate}
+                    caption_dropout_every_n_epochs: {subset.caption_dropout_every_n_epochs}
+                    caption_tag_dropout_rate: {subset.caption_tag_dropout_rate}
+                    caption_prefix: {subset.caption_prefix}
+                    caption_suffix: {subset.caption_suffix}
+                    color_aug: {subset.color_aug}
+                    flip_aug: {subset.flip_aug}
+                    face_crop_aug_range: {subset.face_crop_aug_range}
+                    random_crop: {subset.random_crop}
+                    token_warmup_min: {subset.token_warmup_min},
+                    token_warmup_step: {subset.token_warmup_step},
+                    alpha_mask: {subset.alpha_mask}
+                    resize_interpolation: {subset.resize_interpolation}
+                    custom_attributes: {subset.custom_attributes}
+                """), "  ")
+
+                if is_dreambooth:
+                    info += indent(dedent(f"""\
+                        is_reg: {subset.is_reg}
+                        class_tokens: {subset.class_tokens}
+                        caption_extension: {subset.caption_extension}
+                    \n"""), "    ")
+                elif not is_controlnet:
+                    info += indent(dedent(f"""\
+                        metadata_file: {subset.metadata_file}
+                    \n"""), "    ")
+
+        logger.info(info)
+
+    print_info(datasets, "Dataset")
+
+    if len(val_datasets) > 0:
+        print_info(val_datasets, "Validation Dataset")
+
+    # make buckets first because it determines the length of dataset
+    # and set the same seed for all datasets
+    seed = random.randint(0, 2**31)  # actual seed is seed + epoch_no
+
+    for i, dataset in enumerate(datasets):
+        logger.info(f"[Prepare dataset {i}]")
+        dataset.make_buckets()
+        dataset.set_seed(seed)
+
+    for i, dataset in enumerate(val_datasets):
+        logger.info(f"[Prepare validation dataset {i}]")
+        dataset.make_buckets()
+        dataset.set_seed(seed)
+
+    return (
+        DatasetGroup(datasets),
+        DatasetGroup(val_datasets) if val_datasets else None
+    )
+
+
+def generate_dreambooth_subsets_config_by_subdirs(train_data_dir: Optional[str] = None, reg_data_dir: Optional[str] = None):
+    def extract_dreambooth_params(name: str) -> Tuple[int, str]:
+        tokens = name.split("_")
+        try:
+            n_repeats = int(tokens[0])
+        except ValueError as e:
+            logger.warning(f"ignore directory without repeats / 繰り返し回数のないディレクトリを無視します: {name}")
+            return 0, ""
+        caption_by_folder = "_".join(tokens[1:])
+        return n_repeats, caption_by_folder
+
+    def generate(base_dir: Optional[str], is_reg: bool):
+        if base_dir is None:
+            return []
+
+        base_dir: Path = Path(base_dir)
+        if not base_dir.is_dir():
+            return []
+
+        subsets_config = []
+        for subdir in base_dir.iterdir():
+            if not subdir.is_dir():
+                continue
+
+            num_repeats, class_tokens = extract_dreambooth_params(subdir.name)
+            if num_repeats < 1:
+                continue
+
+            subset_config = {"image_dir": str(subdir), "num_repeats": num_repeats, "is_reg": is_reg, "class_tokens": class_tokens}
+            subsets_config.append(subset_config)
+
+        return subsets_config
+
+    subsets_config = []
+    subsets_config += generate(train_data_dir, False)
+    subsets_config += generate(reg_data_dir, True)
+
+    return subsets_config
+
+
+def generate_controlnet_subsets_config_by_subdirs(
+    train_data_dir: Optional[str] = None, conditioning_data_dir: Optional[str] = None, caption_extension: str = ".txt"
+):
+    def generate(base_dir: Optional[str]):
+        if base_dir is None:
+            return []
+
+        base_dir: Path = Path(base_dir)
+        if not base_dir.is_dir():
+            return []
+
+        subsets_config = []
+        subset_config = {
+            "image_dir": train_data_dir,
+            "conditioning_data_dir": conditioning_data_dir,
+            "caption_extension": caption_extension,
+            "num_repeats": 1,
+        }
+        subsets_config.append(subset_config)
+
+        return subsets_config
+
+    subsets_config = []
+    subsets_config += generate(train_data_dir)
+
+    return subsets_config
+
+
+def load_user_config(file: str) -> dict:
+    file: Path = Path(file)
+    if not file.is_file():
+        raise ValueError(f"file not found / ファイルが見つかりません: {file}")
+
+    if file.name.lower().endswith(".json"):
+        try:
+            with open(file, "r") as f:
+                config = json.load(f)
+        except Exception:
+            logger.error(
+                f"Error on parsing JSON config file. Please check the format. / JSON 形式の設定ファイルの読み込みに失敗しました。文法が正しいか確認してください。: {file}"
+            )
+            raise
+    elif file.name.lower().endswith(".toml"):
+        try:
+            config = toml.load(file)
+        except Exception:
+            logger.error(
+                f"Error on parsing TOML config file. Please check the format. / TOML 形式の設定ファイルの読み込みに失敗しました。文法が正しいか確認してください。: {file}"
+            )
+            raise
+    else:
+        raise ValueError(f"not supported config file format / 対応していない設定ファイルの形式です: {file}")
+
+    return config
+
+
+# for config test
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--support_dreambooth", action="store_true")
+    parser.add_argument("--support_finetuning", action="store_true")
+    parser.add_argument("--support_controlnet", action="store_true")
+    parser.add_argument("--support_dropout", action="store_true")
+    parser.add_argument("dataset_config")
+    config_args, remain = parser.parse_known_args()
+
+    parser = argparse.ArgumentParser()
+    train_util.add_dataset_arguments(
+        parser, config_args.support_dreambooth, config_args.support_finetuning, config_args.support_dropout
+    )
+    train_util.add_training_arguments(parser, config_args.support_dreambooth)
+    argparse_namespace = parser.parse_args(remain)
+    train_util.prepare_dataset_args(argparse_namespace, config_args.support_finetuning)
+
+    logger.info("[argparse_namespace]")
+    logger.info(f"{vars(argparse_namespace)}")
+
+    user_config = load_user_config(config_args.dataset_config)
+
+    logger.info("")
+    logger.info("[user_config]")
+    logger.info(f"{user_config}")
+
+    sanitizer = ConfigSanitizer(
+        config_args.support_dreambooth, config_args.support_finetuning, config_args.support_controlnet, config_args.support_dropout
+    )
+    sanitized_user_config = sanitizer.sanitize_user_config(user_config)
+
+    logger.info("")
+    logger.info("[sanitized_user_config]")
+    logger.info(f"{sanitized_user_config}")
+
+    blueprint = BlueprintGenerator(sanitizer).generate(user_config, argparse_namespace)
+
+    logger.info("")
+    logger.info("[blueprint]")
+    logger.info(f"{blueprint}")
--- a/library/custom_offloading_utils.py
+++ b/library/custom_offloading_utils.py
@@ -0,0 +1,231 @@
+from concurrent.futures import ThreadPoolExecutor
+import time
+from typing import Optional, Union, Callable, Tuple
+import torch
+import torch.nn as nn
+
+from library.device_utils import clean_memory_on_device
+
+
+def synchronize_device(device: torch.device):
+    if device.type == "cuda":
+        torch.cuda.synchronize()
+    elif device.type == "xpu":
+        torch.xpu.synchronize()
+    elif device.type == "mps":
+        torch.mps.synchronize()
+
+
+def swap_weight_devices_cuda(device: torch.device, layer_to_cpu: nn.Module, layer_to_cuda: nn.Module):
+    assert layer_to_cpu.__class__ == layer_to_cuda.__class__
+
+    weight_swap_jobs: list[Tuple[nn.Module, nn.Module, torch.Tensor, torch.Tensor]] = []
+
+    # This is not working for all cases (e.g. SD3), so we need to find the corresponding modules
+    # for module_to_cpu, module_to_cuda in zip(layer_to_cpu.modules(), layer_to_cuda.modules()):
+    #     print(module_to_cpu.__class__, module_to_cuda.__class__)
+    #     if hasattr(module_to_cpu, "weight") and module_to_cpu.weight is not None:
+    #         weight_swap_jobs.append((module_to_cpu, module_to_cuda, module_to_cpu.weight.data, module_to_cuda.weight.data))
+
+    modules_to_cpu = {k: v for k, v in layer_to_cpu.named_modules()}
+    for module_to_cuda_name, module_to_cuda in layer_to_cuda.named_modules():
+        if hasattr(module_to_cuda, "weight") and module_to_cuda.weight is not None:
+            module_to_cpu = modules_to_cpu.get(module_to_cuda_name, None)
+            if module_to_cpu is not None and module_to_cpu.weight.shape == module_to_cuda.weight.shape:
+                weight_swap_jobs.append((module_to_cpu, module_to_cuda, module_to_cpu.weight.data, module_to_cuda.weight.data))
+            else:
+                if module_to_cuda.weight.data.device.type != device.type:
+                    # print(
+                    #     f"Module {module_to_cuda_name} not found in CPU model or shape mismatch, so not swapping and moving to device"
+                    # )
+                    module_to_cuda.weight.data = module_to_cuda.weight.data.to(device)
+
+    torch.cuda.current_stream().synchronize()  # this prevents the illegal loss value
+
+    stream = torch.Stream(device="cuda")
+    with torch.cuda.stream(stream):
+        # cuda to cpu
+        for module_to_cpu, module_to_cuda, cuda_data_view, cpu_data_view in weight_swap_jobs:
+            cuda_data_view.record_stream(stream)
+            module_to_cpu.weight.data = cuda_data_view.data.to("cpu", non_blocking=True)
+
+        stream.synchronize()
+
+        # cpu to cuda
+        for module_to_cpu, module_to_cuda, cuda_data_view, cpu_data_view in weight_swap_jobs:
+            cuda_data_view.copy_(module_to_cuda.weight.data, non_blocking=True)
+            module_to_cuda.weight.data = cuda_data_view
+
+    stream.synchronize()
+    torch.cuda.current_stream().synchronize()  # this prevents the illegal loss value
+
+
+def swap_weight_devices_no_cuda(device: torch.device, layer_to_cpu: nn.Module, layer_to_cuda: nn.Module):
+    """
+    not tested
+    """
+    assert layer_to_cpu.__class__ == layer_to_cuda.__class__
+
+    weight_swap_jobs: list[Tuple[nn.Module, nn.Module, torch.Tensor, torch.Tensor]] = []
+    for module_to_cpu, module_to_cuda in zip(layer_to_cpu.modules(), layer_to_cuda.modules()):
+        if hasattr(module_to_cpu, "weight") and module_to_cpu.weight is not None:
+            weight_swap_jobs.append((module_to_cpu, module_to_cuda, module_to_cpu.weight.data, module_to_cuda.weight.data))
+
+
+    # device to cpu
+    for module_to_cpu, module_to_cuda, cuda_data_view, cpu_data_view in weight_swap_jobs:
+        module_to_cpu.weight.data = cuda_data_view.data.to("cpu", non_blocking=True)
+
+    synchronize_device(device)
+
+    # cpu to device
+    for module_to_cpu, module_to_cuda, cuda_data_view, cpu_data_view in weight_swap_jobs:
+        cuda_data_view.copy_(module_to_cuda.weight.data, non_blocking=True)
+        module_to_cuda.weight.data = cuda_data_view
+
+    synchronize_device(device)
+
+
+def weighs_to_device(layer: nn.Module, device: torch.device):
+    for module in layer.modules():
+        if hasattr(module, "weight") and module.weight is not None:
+            module.weight.data = module.weight.data.to(device, non_blocking=True)
+
+
+class Offloader:
+    """
+    common offloading class
+    """
+
+    def __init__(self, num_blocks: int, blocks_to_swap: int, device: torch.device, debug: bool = False):
+        self.num_blocks = num_blocks
+        self.blocks_to_swap = blocks_to_swap
+        self.device = device
+        self.debug = debug
+
+        self.thread_pool = ThreadPoolExecutor(max_workers=1)
+        self.futures = {}
+        self.cuda_available = device.type == "cuda"
+
+    def swap_weight_devices(self, block_to_cpu: nn.Module, block_to_cuda: nn.Module):
+        if self.cuda_available:
+            swap_weight_devices_cuda(self.device, block_to_cpu, block_to_cuda)
+        else:
+            swap_weight_devices_no_cuda(self.device, block_to_cpu, block_to_cuda)
+
+    def _submit_move_blocks(self, blocks, block_idx_to_cpu, block_idx_to_cuda):
+        def move_blocks(bidx_to_cpu, block_to_cpu, bidx_to_cuda, block_to_cuda):
+            if self.debug:
+                start_time = time.perf_counter()
+                print(f"Move block {bidx_to_cpu} to CPU and block {bidx_to_cuda} to {'CUDA' if self.cuda_available else 'device'}")
+
+            self.swap_weight_devices(block_to_cpu, block_to_cuda)
+
+            if self.debug:
+                print(f"Moved blocks {bidx_to_cpu} and {bidx_to_cuda} in {time.perf_counter()-start_time:.2f}s")
+            return bidx_to_cpu, bidx_to_cuda  # , event
+
+        block_to_cpu = blocks[block_idx_to_cpu]
+        block_to_cuda = blocks[block_idx_to_cuda]
+
+        self.futures[block_idx_to_cuda] = self.thread_pool.submit(
+            move_blocks, block_idx_to_cpu, block_to_cpu, block_idx_to_cuda, block_to_cuda
+        )
+
+    def _wait_blocks_move(self, block_idx):
+        if block_idx not in self.futures:
+            return
+
+        if self.debug:
+            print(f"Wait for block {block_idx}")
+            start_time = time.perf_counter()
+
+        future = self.futures.pop(block_idx)
+        _, bidx_to_cuda = future.result()
+
+        assert block_idx == bidx_to_cuda, f"Block index mismatch: {block_idx} != {bidx_to_cuda}"
+
+        if self.debug:
+            print(f"Waited for block {block_idx}: {time.perf_counter()-start_time:.2f}s")
+
+
+# Gradient tensors
+_grad_t = Union[tuple[torch.Tensor, ...], torch.Tensor]
+
+class ModelOffloader(Offloader):
+    """
+    supports forward offloading
+    """
+
+    def __init__(self, blocks: Union[list[nn.Module], nn.ModuleList], blocks_to_swap: int, device: torch.device, debug: bool = False):
+        super().__init__(len(blocks), blocks_to_swap, device, debug)
+
+        # register backward hooks
+        self.remove_handles = []
+        for i, block in enumerate(blocks):
+            hook = self.create_backward_hook(blocks, i)
+            if hook is not None:
+                handle = block.register_full_backward_hook(hook)
+                self.remove_handles.append(handle)
+
+    def __del__(self):
+        for handle in self.remove_handles:
+            handle.remove()
+
+    def create_backward_hook(self, blocks: Union[list[nn.Module], nn.ModuleList], block_index: int) -> Optional[Callable[[nn.Module, _grad_t, _grad_t], Union[None, _grad_t]]]:
+        # -1 for 0-based index
+        num_blocks_propagated = self.num_blocks - block_index - 1
+        swapping = num_blocks_propagated > 0 and num_blocks_propagated <= self.blocks_to_swap
+        waiting = block_index > 0 and block_index <= self.blocks_to_swap
+
+        if not swapping and not waiting:
+            return None
+
+        # create  hook
+        block_idx_to_cpu = self.num_blocks - num_blocks_propagated
+        block_idx_to_cuda = self.blocks_to_swap - num_blocks_propagated
+        block_idx_to_wait = block_index - 1
+
+        def backward_hook(module: nn.Module, grad_input: _grad_t, grad_output: _grad_t):
+            if self.debug:
+                print(f"Backward hook for block {block_index}")
+
+            if swapping:
+                self._submit_move_blocks(blocks, block_idx_to_cpu, block_idx_to_cuda)
+            if waiting:
+                self._wait_blocks_move(block_idx_to_wait)
+            return None
+
+        return backward_hook
+
+    def prepare_block_devices_before_forward(self, blocks: Union[list[nn.Module], nn.ModuleList]):
+        if self.blocks_to_swap is None or self.blocks_to_swap == 0:
+            return
+
+        if self.debug:
+            print("Prepare block devices before forward")
+
+        for b in blocks[0 : self.num_blocks - self.blocks_to_swap]:
+            b.to(self.device)
+            weighs_to_device(b, self.device)  # make sure weights are on device
+
+        for b in blocks[self.num_blocks - self.blocks_to_swap :]:
+            b.to(self.device)  # move block to device first
+            weighs_to_device(b, torch.device("cpu"))  # make sure weights are on cpu
+
+        synchronize_device(self.device)
+        clean_memory_on_device(self.device)
+
+    def wait_for_block(self, block_idx: int):
+        if self.blocks_to_swap is None or self.blocks_to_swap == 0:
+            return
+        self._wait_blocks_move(block_idx)
+
+    def submit_move_blocks(self, blocks: Union[list[nn.Module], nn.ModuleList], block_idx: int):
+        if self.blocks_to_swap is None or self.blocks_to_swap == 0:
+            return
+        if block_idx >= self.blocks_to_swap:
+            return
+        block_idx_to_cpu = block_idx
+        block_idx_to_cuda = self.num_blocks - self.blocks_to_swap + block_idx
+        self._submit_move_blocks(blocks, block_idx_to_cpu, block_idx_to_cuda)
--- a/library/custom_train_functions.py
+++ b/library/custom_train_functions.py
@@ -0,0 +1,561 @@
+from diffusers.schedulers.scheduling_ddpm import DDPMScheduler
+import torch
+import argparse
+import random
+import re
+from torch.types import Number
+from typing import List, Optional, Union
+from .utils import setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+
+def prepare_scheduler_for_custom_training(noise_scheduler, device):
+    if hasattr(noise_scheduler, "all_snr"):
+        return
+
+    alphas_cumprod = noise_scheduler.alphas_cumprod
+    sqrt_alphas_cumprod = torch.sqrt(alphas_cumprod)
+    sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - alphas_cumprod)
+    alpha = sqrt_alphas_cumprod
+    sigma = sqrt_one_minus_alphas_cumprod
+    all_snr = (alpha / sigma) ** 2
+
+    noise_scheduler.all_snr = all_snr.to(device)
+
+
+def fix_noise_scheduler_betas_for_zero_terminal_snr(noise_scheduler):
+    # fix beta: zero terminal SNR
+    logger.info(f"fix noise scheduler betas: https://arxiv.org/abs/2305.08891")
+
+    def enforce_zero_terminal_snr(betas):
+        # Convert betas to alphas_bar_sqrt
+        alphas = 1 - betas
+        alphas_bar = alphas.cumprod(0)
+        alphas_bar_sqrt = alphas_bar.sqrt()
+
+        # Store old values.
+        alphas_bar_sqrt_0 = alphas_bar_sqrt[0].clone()
+        alphas_bar_sqrt_T = alphas_bar_sqrt[-1].clone()
+        # Shift so last timestep is zero.
+        alphas_bar_sqrt -= alphas_bar_sqrt_T
+        # Scale so first timestep is back to old value.
+        alphas_bar_sqrt *= alphas_bar_sqrt_0 / (alphas_bar_sqrt_0 - alphas_bar_sqrt_T)
+
+        # Convert alphas_bar_sqrt to betas
+        alphas_bar = alphas_bar_sqrt**2
+        alphas = alphas_bar[1:] / alphas_bar[:-1]
+        alphas = torch.cat([alphas_bar[0:1], alphas])
+        betas = 1 - alphas
+        return betas
+
+    betas = noise_scheduler.betas
+    betas = enforce_zero_terminal_snr(betas)
+    alphas = 1.0 - betas
+    alphas_cumprod = torch.cumprod(alphas, dim=0)
+
+    # logger.info(f"original: {noise_scheduler.betas}")
+    # logger.info(f"fixed: {betas}")
+
+    noise_scheduler.betas = betas
+    noise_scheduler.alphas = alphas
+    noise_scheduler.alphas_cumprod = alphas_cumprod
+
+
+def apply_snr_weight(loss: torch.Tensor, timesteps: torch.IntTensor, noise_scheduler: DDPMScheduler, gamma: Number, v_prediction=False):
+    snr = torch.stack([noise_scheduler.all_snr[t] for t in timesteps])
+    min_snr_gamma = torch.minimum(snr, torch.full_like(snr, gamma))
+    if v_prediction:
+        snr_weight = torch.div(min_snr_gamma, snr + 1).float().to(loss.device)
+    else:
+        snr_weight = torch.div(min_snr_gamma, snr).float().to(loss.device)
+    loss = loss * snr_weight
+    return loss
+
+
+def scale_v_prediction_loss_like_noise_prediction(loss: torch.Tensor, timesteps: torch.IntTensor, noise_scheduler: DDPMScheduler):
+    scale = get_snr_scale(timesteps, noise_scheduler)
+    loss = loss * scale
+    return loss
+
+
+def get_snr_scale(timesteps: torch.IntTensor, noise_scheduler: DDPMScheduler):
+    snr_t = torch.stack([noise_scheduler.all_snr[t] for t in timesteps])  # batch_size
+    snr_t = torch.minimum(snr_t, torch.ones_like(snr_t) * 1000)  # if timestep is 0, snr_t is inf, so limit it to 1000
+    scale = snr_t / (snr_t + 1)
+    # # show debug info
+    # logger.info(f"timesteps: {timesteps}, snr_t: {snr_t}, scale: {scale}")
+    return scale
+
+
+def add_v_prediction_like_loss(loss: torch.Tensor, timesteps: torch.IntTensor, noise_scheduler: DDPMScheduler, v_pred_like_loss: torch.Tensor):
+    scale = get_snr_scale(timesteps, noise_scheduler)
+    # logger.info(f"add v-prediction like loss: {v_pred_like_loss}, scale: {scale}, loss: {loss}, time: {timesteps}")
+    loss = loss + loss / scale * v_pred_like_loss
+    return loss
+
+
+def apply_debiased_estimation(loss: torch.Tensor, timesteps: torch.IntTensor, noise_scheduler: DDPMScheduler, v_prediction=False):
+    snr_t = torch.stack([noise_scheduler.all_snr[t] for t in timesteps])  # batch_size
+    snr_t = torch.minimum(snr_t, torch.ones_like(snr_t) * 1000)  # if timestep is 0, snr_t is inf, so limit it to 1000
+    if v_prediction:
+        weight = 1 / (snr_t + 1)
+    else:
+        weight = 1 / torch.sqrt(snr_t)
+    loss = weight * loss
+    return loss
+
+
+# TODO train_utilと分散しているのでどちらかに寄せる
+
+
+def add_custom_train_arguments(parser: argparse.ArgumentParser, support_weighted_captions: bool = True):
+    parser.add_argument(
+        "--min_snr_gamma",
+        type=float,
+        default=None,
+        help="gamma for reducing the weight of high loss timesteps. Lower numbers have stronger effect. 5 is recommended by paper. / 低いタイムステップでの高いlossに対して重みを減らすためのgamma値、低いほど効果が強く、論文では5が推奨",
+    )
+    parser.add_argument(
+        "--scale_v_pred_loss_like_noise_pred",
+        action="store_true",
+        help="scale v-prediction loss like noise prediction loss / v-prediction lossをnoise prediction lossと同じようにスケーリングする",
+    )
+    parser.add_argument(
+        "--v_pred_like_loss",
+        type=float,
+        default=None,
+        help="add v-prediction like loss multiplied by this value / v-prediction lossをこの値をかけたものをlossに加算する",
+    )
+    parser.add_argument(
+        "--debiased_estimation_loss",
+        action="store_true",
+        help="debiased estimation loss / debiased estimation loss",
+    )
+    if support_weighted_captions:
+        parser.add_argument(
+            "--weighted_captions",
+            action="store_true",
+            default=False,
+            help="Enable weighted captions in the standard style (token:1.3). No commas inside parens, or shuffle/dropout may break the decoder. / 「[token]」、「(token)」「(token:1.3)」のような重み付きキャプションを有効にする。カンマを括弧内に入れるとシャッフルやdropoutで重みづけがおかしくなるので注意",
+        )
+
+
+re_attention = re.compile(
+    r"""
+\\\(|
+\\\)|
+\\\[|
+\\]|
+\\\\|
+\\|
+\(|
+\[|
+:([+-]?[.\d]+)\)|
+\)|
+]|
+[^\\()\[\]:]+|
+:
+""",
+    re.X,
+)
+
+
+def parse_prompt_attention(text):
+    """
+    Parses a string with attention tokens and returns a list of pairs: text and its associated weight.
+    Accepted tokens are:
+      (abc) - increases attention to abc by a multiplier of 1.1
+      (abc:3.12) - increases attention to abc by a multiplier of 3.12
+      [abc] - decreases attention to abc by a multiplier of 1.1
+      \( - literal character '('
+      \[ - literal character '['
+      \) - literal character ')'
+      \] - literal character ']'
+      \\ - literal character '\'
+      anything else - just text
+    >>> parse_prompt_attention('normal text')
+    [['normal text', 1.0]]
+    >>> parse_prompt_attention('an (important) word')
+    [['an ', 1.0], ['important', 1.1], [' word', 1.0]]
+    >>> parse_prompt_attention('(unbalanced')
+    [['unbalanced', 1.1]]
+    >>> parse_prompt_attention('\(literal\]')
+    [['(literal]', 1.0]]
+    >>> parse_prompt_attention('(unnecessary)(parens)')
+    [['unnecessaryparens', 1.1]]
+    >>> parse_prompt_attention('a (((house:1.3)) [on] a (hill:0.5), sun, (((sky))).')
+    [['a ', 1.0],
+     ['house', 1.5730000000000004],
+     [' ', 1.1],
+     ['on', 1.0],
+     [' a ', 1.1],
+     ['hill', 0.55],
+     [', sun, ', 1.1],
+     ['sky', 1.4641000000000006],
+     ['.', 1.1]]
+    """
+
+    res = []
+    round_brackets = []
+    square_brackets = []
+
+    round_bracket_multiplier = 1.1
+    square_bracket_multiplier = 1 / 1.1
+
+    def multiply_range(start_position, multiplier):
+        for p in range(start_position, len(res)):
+            res[p][1] *= multiplier
+
+    for m in re_attention.finditer(text):
+        text = m.group(0)
+        weight = m.group(1)
+
+        if text.startswith("\\"):
+            res.append([text[1:], 1.0])
+        elif text == "(":
+            round_brackets.append(len(res))
+        elif text == "[":
+            square_brackets.append(len(res))
+        elif weight is not None and len(round_brackets) > 0:
+            multiply_range(round_brackets.pop(), float(weight))
+        elif text == ")" and len(round_brackets) > 0:
+            multiply_range(round_brackets.pop(), round_bracket_multiplier)
+        elif text == "]" and len(square_brackets) > 0:
+            multiply_range(square_brackets.pop(), square_bracket_multiplier)
+        else:
+            res.append([text, 1.0])
+
+    for pos in round_brackets:
+        multiply_range(pos, round_bracket_multiplier)
+
+    for pos in square_brackets:
+        multiply_range(pos, square_bracket_multiplier)
+
+    if len(res) == 0:
+        res = [["", 1.0]]
+
+    # merge runs of identical weights
+    i = 0
+    while i + 1 < len(res):
+        if res[i][1] == res[i + 1][1]:
+            res[i][0] += res[i + 1][0]
+            res.pop(i + 1)
+        else:
+            i += 1
+
+    return res
+
+
+def get_prompts_with_weights(tokenizer, prompt: List[str], max_length: int):
+    r"""
+    Tokenize a list of prompts and return its tokens with weights of each token.
+
+    No padding, starting or ending token is included.
+    """
+    tokens = []
+    weights = []
+    truncated = False
+    for text in prompt:
+        texts_and_weights = parse_prompt_attention(text)
+        text_token = []
+        text_weight = []
+        for word, weight in texts_and_weights:
+            # tokenize and discard the starting and the ending token
+            token = tokenizer(word).input_ids[1:-1]
+            text_token += token
+            # copy the weight by length of token
+            text_weight += [weight] * len(token)
+            # stop if the text is too long (longer than truncation limit)
+            if len(text_token) > max_length:
+                truncated = True
+                break
+        # truncate
+        if len(text_token) > max_length:
+            truncated = True
+            text_token = text_token[:max_length]
+            text_weight = text_weight[:max_length]
+        tokens.append(text_token)
+        weights.append(text_weight)
+    if truncated:
+        logger.warning("Prompt was truncated. Try to shorten the prompt or increase max_embeddings_multiples")
+    return tokens, weights
+
+
+def pad_tokens_and_weights(tokens, weights, max_length, bos, eos, no_boseos_middle=True, chunk_length=77):
+    r"""
+    Pad the tokens (with starting and ending tokens) and weights (with 1.0) to max_length.
+    """
+    max_embeddings_multiples = (max_length - 2) // (chunk_length - 2)
+    weights_length = max_length if no_boseos_middle else max_embeddings_multiples * chunk_length
+    for i in range(len(tokens)):
+        tokens[i] = [bos] + tokens[i] + [eos] * (max_length - 1 - len(tokens[i]))
+        if no_boseos_middle:
+            weights[i] = [1.0] + weights[i] + [1.0] * (max_length - 1 - len(weights[i]))
+        else:
+            w = []
+            if len(weights[i]) == 0:
+                w = [1.0] * weights_length
+            else:
+                for j in range(max_embeddings_multiples):
+                    w.append(1.0)  # weight for starting token in this chunk
+                    w += weights[i][j * (chunk_length - 2) : min(len(weights[i]), (j + 1) * (chunk_length - 2))]
+                    w.append(1.0)  # weight for ending token in this chunk
+                w += [1.0] * (weights_length - len(w))
+            weights[i] = w[:]
+
+    return tokens, weights
+
+
+def get_unweighted_text_embeddings(
+    tokenizer,
+    text_encoder,
+    text_input: torch.Tensor,
+    chunk_length: int,
+    clip_skip: int,
+    eos: int,
+    pad: int,
+    no_boseos_middle: Optional[bool] = True,
+):
+    """
+    When the length of tokens is a multiple of the capacity of the text encoder,
+    it should be split into chunks and sent to the text encoder individually.
+    """
+    max_embeddings_multiples = (text_input.shape[1] - 2) // (chunk_length - 2)
+    if max_embeddings_multiples > 1:
+        text_embeddings = []
+        for i in range(max_embeddings_multiples):
+            # extract the i-th chunk
+            text_input_chunk = text_input[:, i * (chunk_length - 2) : (i + 1) * (chunk_length - 2) + 2].clone()
+
+            # cover the head and the tail by the starting and the ending tokens
+            text_input_chunk[:, 0] = text_input[0, 0]
+            if pad == eos:  # v1
+                text_input_chunk[:, -1] = text_input[0, -1]
+            else:  # v2
+                for j in range(len(text_input_chunk)):
+                    if text_input_chunk[j, -1] != eos and text_input_chunk[j, -1] != pad:  # 最後に普通の文字がある
+                        text_input_chunk[j, -1] = eos
+                    if text_input_chunk[j, 1] == pad:  # BOSだけであとはPAD
+                        text_input_chunk[j, 1] = eos
+
+            if clip_skip is None or clip_skip == 1:
+                text_embedding = text_encoder(text_input_chunk)[0]
+            else:
+                enc_out = text_encoder(text_input_chunk, output_hidden_states=True, return_dict=True)
+                text_embedding = enc_out["hidden_states"][-clip_skip]
+                text_embedding = text_encoder.text_model.final_layer_norm(text_embedding)
+
+            if no_boseos_middle:
+                if i == 0:
+                    # discard the ending token
+                    text_embedding = text_embedding[:, :-1]
+                elif i == max_embeddings_multiples - 1:
+                    # discard the starting token
+                    text_embedding = text_embedding[:, 1:]
+                else:
+                    # discard both starting and ending tokens
+                    text_embedding = text_embedding[:, 1:-1]
+
+            text_embeddings.append(text_embedding)
+        text_embeddings = torch.concat(text_embeddings, axis=1)
+    else:
+        if clip_skip is None or clip_skip == 1:
+            text_embeddings = text_encoder(text_input)[0]
+        else:
+            enc_out = text_encoder(text_input, output_hidden_states=True, return_dict=True)
+            text_embeddings = enc_out["hidden_states"][-clip_skip]
+            text_embeddings = text_encoder.text_model.final_layer_norm(text_embeddings)
+    return text_embeddings
+
+
+def get_weighted_text_embeddings(
+    tokenizer,
+    text_encoder,
+    prompt: Union[str, List[str]],
+    device,
+    max_embeddings_multiples: Optional[int] = 3,
+    no_boseos_middle: Optional[bool] = False,
+    clip_skip=None,
+):
+    r"""
+    Prompts can be assigned with local weights using brackets. For example,
+    prompt 'A (very beautiful) masterpiece' highlights the words 'very beautiful',
+    and the embedding tokens corresponding to the words get multiplied by a constant, 1.1.
+
+    Also, to regularize of the embedding, the weighted embedding would be scaled to preserve the original mean.
+
+    Args:
+        prompt (`str` or `List[str]`):
+            The prompt or prompts to guide the image generation.
+        max_embeddings_multiples (`int`, *optional*, defaults to `3`):
+            The max multiple length of prompt embeddings compared to the max output length of text encoder.
+        no_boseos_middle (`bool`, *optional*, defaults to `False`):
+            If the length of text token is multiples of the capacity of text encoder, whether reserve the starting and
+            ending token in each of the chunk in the middle.
+        skip_parsing (`bool`, *optional*, defaults to `False`):
+            Skip the parsing of brackets.
+        skip_weighting (`bool`, *optional*, defaults to `False`):
+            Skip the weighting. When the parsing is skipped, it is forced True.
+    """
+    max_length = (tokenizer.model_max_length - 2) * max_embeddings_multiples + 2
+    if isinstance(prompt, str):
+        prompt = [prompt]
+
+    prompt_tokens, prompt_weights = get_prompts_with_weights(tokenizer, prompt, max_length - 2)
+
+    # round up the longest length of tokens to a multiple of (model_max_length - 2)
+    max_length = max([len(token) for token in prompt_tokens])
+
+    max_embeddings_multiples = min(
+        max_embeddings_multiples,
+        (max_length - 1) // (tokenizer.model_max_length - 2) + 1,
+    )
+    max_embeddings_multiples = max(1, max_embeddings_multiples)
+    max_length = (tokenizer.model_max_length - 2) * max_embeddings_multiples + 2
+
+    # pad the length of tokens and weights
+    bos = tokenizer.bos_token_id
+    eos = tokenizer.eos_token_id
+    pad = tokenizer.pad_token_id
+    prompt_tokens, prompt_weights = pad_tokens_and_weights(
+        prompt_tokens,
+        prompt_weights,
+        max_length,
+        bos,
+        eos,
+        no_boseos_middle=no_boseos_middle,
+        chunk_length=tokenizer.model_max_length,
+    )
+    prompt_tokens = torch.tensor(prompt_tokens, dtype=torch.long, device=device)
+
+    # get the embeddings
+    text_embeddings = get_unweighted_text_embeddings(
+        tokenizer,
+        text_encoder,
+        prompt_tokens,
+        tokenizer.model_max_length,
+        clip_skip,
+        eos,
+        pad,
+        no_boseos_middle=no_boseos_middle,
+    )
+    prompt_weights = torch.tensor(prompt_weights, dtype=text_embeddings.dtype, device=device)
+
+    # assign weights to the prompts and normalize in the sense of mean
+    previous_mean = text_embeddings.float().mean(axis=[-2, -1]).to(text_embeddings.dtype)
+    text_embeddings = text_embeddings * prompt_weights.unsqueeze(-1)
+    current_mean = text_embeddings.float().mean(axis=[-2, -1]).to(text_embeddings.dtype)
+    text_embeddings = text_embeddings * (previous_mean / current_mean).unsqueeze(-1).unsqueeze(-1)
+
+    return text_embeddings
+
+
+# https://wandb.ai/johnowhitaker/multires_noise/reports/Multi-Resolution-Noise-for-Diffusion-Model-Training--VmlldzozNjYyOTU2
+def pyramid_noise_like(noise, device, iterations=6, discount=0.4) -> torch.FloatTensor:
+    b, c, w, h = noise.shape  # EDIT: w and h get over-written, rename for a different variant!
+    u = torch.nn.Upsample(size=(w, h), mode="bilinear").to(device)
+    for i in range(iterations):
+        r = random.random() * 2 + 2  # Rather than always going 2x,
+        wn, hn = max(1, int(w / (r**i))), max(1, int(h / (r**i)))
+        noise += u(torch.randn(b, c, wn, hn).to(device)) * discount**i
+        if wn == 1 or hn == 1:
+            break  # Lowest resolution is 1x1
+    return noise / noise.std()  # Scaled back to roughly unit variance
+
+
+# https://www.crosslabs.org//blog/diffusion-with-offset-noise
+def apply_noise_offset(latents, noise, noise_offset, adaptive_noise_scale) -> torch.FloatTensor:
+    if noise_offset is None:
+        return noise
+    if adaptive_noise_scale is not None:
+        # latent shape: (batch_size, channels, height, width)
+        # abs mean value for each channel
+        latent_mean = torch.abs(latents.mean(dim=(2, 3), keepdim=True))
+
+        # multiply adaptive noise scale to the mean value and add it to the noise offset
+        noise_offset = noise_offset + adaptive_noise_scale * latent_mean
+        noise_offset = torch.clamp(noise_offset, 0.0, None)  # in case of adaptive noise scale is negative
+
+    noise = noise + noise_offset * torch.randn((latents.shape[0], latents.shape[1], 1, 1), device=latents.device)
+    return noise
+
+
+def apply_masked_loss(loss, batch) -> torch.FloatTensor:
+    if "conditioning_images" in batch:
+        # conditioning image is -1 to 1. we need to convert it to 0 to 1
+        mask_image = batch["conditioning_images"].to(dtype=loss.dtype)[:, 0].unsqueeze(1)  # use R channel
+        mask_image = mask_image / 2 + 0.5
+        # print(f"conditioning_image: {mask_image.shape}")
+    elif "alpha_masks" in batch and batch["alpha_masks"] is not None:
+        # alpha mask is 0 to 1
+        mask_image = batch["alpha_masks"].to(dtype=loss.dtype).unsqueeze(1) # add channel dimension
+        # print(f"mask_image: {mask_image.shape}, {mask_image.mean()}")
+    else:
+        return loss
+
+    # resize to the same size as the loss
+    mask_image = torch.nn.functional.interpolate(mask_image, size=loss.shape[2:], mode="area")
+    loss = loss * mask_image
+    return loss
+
+
+"""
+##########################################
+# Perlin Noise
+def rand_perlin_2d(device, shape, res, fade=lambda t: 6 * t**5 - 15 * t**4 + 10 * t**3):
+    delta = (res[0] / shape[0], res[1] / shape[1])
+    d = (shape[0] // res[0], shape[1] // res[1])
+
+    grid = (
+        torch.stack(
+            torch.meshgrid(torch.arange(0, res[0], delta[0], device=device), torch.arange(0, res[1], delta[1], device=device)),
+            dim=-1,
+        )
+        % 1
+    )
+    angles = 2 * torch.pi * torch.rand(res[0] + 1, res[1] + 1, device=device)
+    gradients = torch.stack((torch.cos(angles), torch.sin(angles)), dim=-1)
+
+    tile_grads = (
+        lambda slice1, slice2: gradients[slice1[0] : slice1[1], slice2[0] : slice2[1]]
+        .repeat_interleave(d[0], 0)
+        .repeat_interleave(d[1], 1)
+    )
+    dot = lambda grad, shift: (
+        torch.stack((grid[: shape[0], : shape[1], 0] + shift[0], grid[: shape[0], : shape[1], 1] + shift[1]), dim=-1)
+        * grad[: shape[0], : shape[1]]
+    ).sum(dim=-1)
+
+    n00 = dot(tile_grads([0, -1], [0, -1]), [0, 0])
+    n10 = dot(tile_grads([1, None], [0, -1]), [-1, 0])
+    n01 = dot(tile_grads([0, -1], [1, None]), [0, -1])
+    n11 = dot(tile_grads([1, None], [1, None]), [-1, -1])
+    t = fade(grid[: shape[0], : shape[1]])
+    return 1.414 * torch.lerp(torch.lerp(n00, n10, t[..., 0]), torch.lerp(n01, n11, t[..., 0]), t[..., 1])
+
+
+def rand_perlin_2d_octaves(device, shape, res, octaves=1, persistence=0.5):
+    noise = torch.zeros(shape, device=device)
+    frequency = 1
+    amplitude = 1
+    for _ in range(octaves):
+        noise += amplitude * rand_perlin_2d(device, shape, (frequency * res[0], frequency * res[1]))
+        frequency *= 2
+        amplitude *= persistence
+    return noise
+
+
+def perlin_noise(noise, device, octaves):
+    _, c, w, h = noise.shape
+    perlin = lambda: rand_perlin_2d_octaves(device, (w, h), (4, 4), octaves)
+    noise_perlin = []
+    for _ in range(c):
+        noise_perlin.append(perlin())
+    noise_perlin = torch.stack(noise_perlin).unsqueeze(0)   # (1, c, w, h)
+    noise += noise_perlin # broadcast for each batch
+    return noise / noise.std()  # Scaled back to roughly unit variance
+"""
--- a/library/deepspeed_utils.py
+++ b/library/deepspeed_utils.py
@@ -0,0 +1,180 @@
+import os
+import argparse
+import torch
+from accelerate import DeepSpeedPlugin, Accelerator
+
+from .utils import setup_logging
+
+from .device_utils import get_preferred_device
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+
+def add_deepspeed_arguments(parser: argparse.ArgumentParser):
+    # DeepSpeed Arguments. https://huggingface.co/docs/accelerate/usage_guides/deepspeed
+    parser.add_argument("--deepspeed", action="store_true", help="enable deepspeed training")
+    parser.add_argument("--zero_stage", type=int, default=2, choices=[0, 1, 2, 3], help="Possible options are 0,1,2,3.")
+    parser.add_argument(
+        "--offload_optimizer_device",
+        type=str,
+        default=None,
+        choices=[None, "cpu", "nvme"],
+        help="Possible options are none|cpu|nvme. Only applicable with ZeRO Stages 2 and 3.",
+    )
+    parser.add_argument(
+        "--offload_optimizer_nvme_path",
+        type=str,
+        default=None,
+        help="Possible options are /nvme|/local_nvme. Only applicable with ZeRO Stage 3.",
+    )
+    parser.add_argument(
+        "--offload_param_device",
+        type=str,
+        default=None,
+        choices=[None, "cpu", "nvme"],
+        help="Possible options are none|cpu|nvme. Only applicable with ZeRO Stage 3.",
+    )
+    parser.add_argument(
+        "--offload_param_nvme_path",
+        type=str,
+        default=None,
+        help="Possible options are /nvme|/local_nvme. Only applicable with ZeRO Stage 3.",
+    )
+    parser.add_argument(
+        "--zero3_init_flag",
+        action="store_true",
+        help="Flag to indicate whether to enable `deepspeed.zero.Init` for constructing massive models."
+        "Only applicable with ZeRO Stage-3.",
+    )
+    parser.add_argument(
+        "--zero3_save_16bit_model",
+        action="store_true",
+        help="Flag to indicate whether to save 16-bit model. Only applicable with ZeRO Stage-3.",
+    )
+    parser.add_argument(
+        "--fp16_master_weights_and_gradients",
+        action="store_true",
+        help="fp16_master_and_gradients requires optimizer to support keeping fp16 master and gradients while keeping the optimizer states in fp32.",
+    )
+
+
+def prepare_deepspeed_args(args: argparse.Namespace):
+    if not args.deepspeed:
+        return
+
+    # To avoid RuntimeError: DataLoader worker exited unexpectedly with exit code 1.
+    args.max_data_loader_n_workers = 1
+
+
+def prepare_deepspeed_plugin(args: argparse.Namespace):
+    if not args.deepspeed:
+        return None
+
+    try:
+        import deepspeed
+    except ImportError as e:
+        logger.error(
+            "deepspeed is not installed. please install deepspeed in your environment with following command. DS_BUILD_OPS=0 pip install deepspeed"
+        )
+        exit(1)
+
+    deepspeed_plugin = DeepSpeedPlugin(
+        zero_stage=args.zero_stage,
+        gradient_accumulation_steps=args.gradient_accumulation_steps,
+        gradient_clipping=args.max_grad_norm,
+        offload_optimizer_device=args.offload_optimizer_device,
+        offload_optimizer_nvme_path=args.offload_optimizer_nvme_path,
+        offload_param_device=args.offload_param_device,
+        offload_param_nvme_path=args.offload_param_nvme_path,
+        zero3_init_flag=args.zero3_init_flag,
+        zero3_save_16bit_model=args.zero3_save_16bit_model,
+    )
+    deepspeed_plugin.deepspeed_config["train_micro_batch_size_per_gpu"] = args.train_batch_size
+    deepspeed_plugin.deepspeed_config["train_batch_size"] = (
+        args.train_batch_size * args.gradient_accumulation_steps * int(os.environ["WORLD_SIZE"])
+    )
+    
+    deepspeed_plugin.set_mixed_precision(args.mixed_precision)
+    if args.mixed_precision.lower() == "fp16":
+        deepspeed_plugin.deepspeed_config["fp16"]["initial_scale_power"] = 0  # preventing overflow.
+    if args.full_fp16 or args.fp16_master_weights_and_gradients:
+        if args.offload_optimizer_device == "cpu" and args.zero_stage == 2:
+            deepspeed_plugin.deepspeed_config["fp16"]["fp16_master_weights_and_grads"] = True
+            logger.info("[DeepSpeed] full fp16 enable.")
+        else:
+            logger.info(
+                "[DeepSpeed]full fp16, fp16_master_weights_and_grads currently only supported using ZeRO-Offload with DeepSpeedCPUAdam on ZeRO-2 stage."
+            )
+
+    if args.offload_optimizer_device is not None:
+        logger.info("[DeepSpeed] start to manually build cpu_adam.")
+        deepspeed.ops.op_builder.CPUAdamBuilder().load()
+        logger.info("[DeepSpeed] building cpu_adam done.")
+
+    return deepspeed_plugin
+
+
+# Accelerate library does not support multiple models for deepspeed. So, we need to wrap multiple models into a single model.
+def prepare_deepspeed_model(args: argparse.Namespace, **models):
+    # remove None from models
+    models = {k: v for k, v in models.items() if v is not None}
+
+    class DeepSpeedWrapper(torch.nn.Module):
+        def __init__(self, **kw_models) -> None:
+            super().__init__()
+            
+            self.models = torch.nn.ModuleDict()
+            
+            wrap_model_forward_with_torch_autocast = args.mixed_precision is not "no"
+
+            for key, model in kw_models.items():
+                if isinstance(model, list):
+                    model = torch.nn.ModuleList(model)
+                                            
+                if wrap_model_forward_with_torch_autocast:
+                    model = self.__wrap_model_with_torch_autocast(model)  
+                
+                assert isinstance(
+                    model, torch.nn.Module
+                ), f"model must be an instance of torch.nn.Module, but got {key} is {type(model)}"
+
+                self.models.update(torch.nn.ModuleDict({key: model}))
+
+        def __wrap_model_with_torch_autocast(self, model):
+            if isinstance(model, torch.nn.ModuleList):
+                model = torch.nn.ModuleList([self.__wrap_model_forward_with_torch_autocast(m) for m in model])
+            else:
+                model = self.__wrap_model_forward_with_torch_autocast(model)
+            return model
+
+        def __wrap_model_forward_with_torch_autocast(self, model):
+            
+            assert hasattr(model, "forward"), f"model must have a forward method."
+
+            forward_fn = model.forward
+
+            def forward(*args, **kwargs):
+                try:
+                    device_type = model.device.type
+                except AttributeError:
+                    logger.warning(
+                            "[DeepSpeed] model.device is not available. Using get_preferred_device() "
+                            "to determine the device_type for torch.autocast()."
+                    )                    
+                    device_type = get_preferred_device().type
+
+                with torch.autocast(device_type = device_type):
+                    return forward_fn(*args, **kwargs)
+
+            model.forward = forward
+            return model
+        
+        def get_models(self):
+            return self.models
+        
+
+    ds_model = DeepSpeedWrapper(**models)
+    return ds_model
--- a/library/device_utils.py
+++ b/library/device_utils.py
@@ -0,0 +1,89 @@
+import functools
+import gc
+
+import torch
+try:
+    # intel gpu support for pytorch older than 2.5
+    # ipex is not needed after pytorch 2.5
+    import intel_extension_for_pytorch as ipex  # noqa
+except Exception:
+    pass
+
+
+try:
+    HAS_CUDA = torch.cuda.is_available()
+except Exception:
+    HAS_CUDA = False
+
+try:
+    HAS_MPS = torch.backends.mps.is_available()
+except Exception:
+    HAS_MPS = False
+
+try:
+    HAS_XPU = torch.xpu.is_available()
+except Exception:
+    HAS_XPU = False
+
+
+def clean_memory():
+    gc.collect()
+    if HAS_CUDA:
+        torch.cuda.empty_cache()
+    if HAS_XPU:
+        torch.xpu.empty_cache()
+    if HAS_MPS:
+        torch.mps.empty_cache()
+
+
+def clean_memory_on_device(device: torch.device):
+    r"""
+    Clean memory on the specified device, will be called from training scripts.
+    """
+    gc.collect()
+
+    # device may "cuda" or "cuda:0", so we need to check the type of device
+    if device.type == "cuda":
+        torch.cuda.empty_cache()
+    if device.type == "xpu":
+        torch.xpu.empty_cache()
+    if device.type == "mps":
+        torch.mps.empty_cache()
+
+
+@functools.lru_cache(maxsize=None)
+def get_preferred_device() -> torch.device:
+    r"""
+    Do not call this function from training scripts. Use accelerator.device instead.
+    """
+    if HAS_CUDA:
+        device = torch.device("cuda")
+    elif HAS_XPU:
+        device = torch.device("xpu")
+    elif HAS_MPS:
+        device = torch.device("mps")
+    else:
+        device = torch.device("cpu")
+    print(f"get_preferred_device() -> {device}")
+    return device
+
+
+def init_ipex():
+    """
+    Apply IPEX to CUDA hijacks using `library.ipex.ipex_init`.
+
+    This function should run right after importing torch and before doing anything else.
+
+    If xpu is not available, this function does nothing.
+    """
+    try:
+        if HAS_XPU:
+            from library.ipex import ipex_init
+
+            is_initialized, error_message = ipex_init()
+            if not is_initialized:
+                print("failed to initialize ipex:", error_message)
+        else:
+            return
+    except Exception as e:
+        print("failed to initialize ipex:", e)
--- a/library/flux_models.py
+++ b/library/flux_models.py
--- a/library/flux_train_utils.py
+++ b/library/flux_train_utils.py
@@ -0,0 +1,682 @@
+import argparse
+import math
+import os
+import numpy as np
+import toml
+import json
+import time
+from typing import Callable, Dict, List, Optional, Tuple, Union
+
+import torch
+from accelerate import Accelerator, PartialState
+from transformers import CLIPTextModel
+from tqdm import tqdm
+from PIL import Image
+from safetensors.torch import save_file
+
+from library import flux_models, flux_utils, strategy_base, train_util
+from library.device_utils import init_ipex, clean_memory_on_device
+
+init_ipex()
+
+from .utils import setup_logging, mem_eff_save_file
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+
+# region sample images
+
+
+def sample_images(
+    accelerator: Accelerator,
+    args: argparse.Namespace,
+    epoch,
+    steps,
+    flux,
+    ae,
+    text_encoders,
+    sample_prompts_te_outputs,
+    prompt_replacement=None,
+    controlnet=None,
+):
+    if steps == 0:
+        if not args.sample_at_first:
+            return
+    else:
+        if args.sample_every_n_steps is None and args.sample_every_n_epochs is None:
+            return
+        if args.sample_every_n_epochs is not None:
+            # sample_every_n_steps は無視する
+            if epoch is None or epoch % args.sample_every_n_epochs != 0:
+                return
+        else:
+            if steps % args.sample_every_n_steps != 0 or epoch is not None:  # steps is not divisible or end of epoch
+                return
+
+    logger.info("")
+    logger.info(f"generating sample images at step / サンプル画像生成 ステップ: {steps}")
+    if not os.path.isfile(args.sample_prompts) and sample_prompts_te_outputs is None:
+        logger.error(f"No prompt file / プロンプトファイルがありません: {args.sample_prompts}")
+        return
+
+    distributed_state = PartialState()  # for multi gpu distributed inference. this is a singleton, so it's safe to use it here
+
+    # unwrap unet and text_encoder(s)
+    flux = accelerator.unwrap_model(flux)
+    if text_encoders is not None:
+        text_encoders = [(accelerator.unwrap_model(te) if te is not None else None) for te in text_encoders]
+    if controlnet is not None:
+        controlnet = accelerator.unwrap_model(controlnet)
+    # print([(te.parameters().__next__().device if te is not None else None) for te in text_encoders])
+
+    prompts = train_util.load_prompts(args.sample_prompts)
+
+    save_dir = args.output_dir + "/sample"
+    os.makedirs(save_dir, exist_ok=True)
+
+    # save random state to restore later
+    rng_state = torch.get_rng_state()
+    cuda_rng_state = None
+    try:
+        cuda_rng_state = torch.cuda.get_rng_state() if torch.cuda.is_available() else None
+    except Exception:
+        pass
+
+    if distributed_state.num_processes <= 1:
+        # If only one device is available, just use the original prompt list. We don't need to care about the distribution of prompts.
+        with torch.no_grad(), accelerator.autocast():
+            for prompt_dict in prompts:
+                sample_image_inference(
+                    accelerator,
+                    args,
+                    flux,
+                    text_encoders,
+                    ae,
+                    save_dir,
+                    prompt_dict,
+                    epoch,
+                    steps,
+                    sample_prompts_te_outputs,
+                    prompt_replacement,
+                    controlnet,
+                )
+    else:
+        # Creating list with N elements, where each element is a list of prompt_dicts, and N is the number of processes available (number of devices available)
+        # prompt_dicts are assigned to lists based on order of processes, to attempt to time the image creation time to match enum order. Probably only works when steps and sampler are identical.
+        per_process_prompts = []  # list of lists
+        for i in range(distributed_state.num_processes):
+            per_process_prompts.append(prompts[i :: distributed_state.num_processes])
+
+        with torch.no_grad():
+            with distributed_state.split_between_processes(per_process_prompts) as prompt_dict_lists:
+                for prompt_dict in prompt_dict_lists[0]:
+                    sample_image_inference(
+                        accelerator,
+                        args,
+                        flux,
+                        text_encoders,
+                        ae,
+                        save_dir,
+                        prompt_dict,
+                        epoch,
+                        steps,
+                        sample_prompts_te_outputs,
+                        prompt_replacement,
+                        controlnet,
+                    )
+
+    torch.set_rng_state(rng_state)
+    if cuda_rng_state is not None:
+        torch.cuda.set_rng_state(cuda_rng_state)
+
+    clean_memory_on_device(accelerator.device)
+
+
+def sample_image_inference(
+    accelerator: Accelerator,
+    args: argparse.Namespace,
+    flux: flux_models.Flux,
+    text_encoders: Optional[List[CLIPTextModel]],
+    ae: flux_models.AutoEncoder,
+    save_dir,
+    prompt_dict,
+    epoch,
+    steps,
+    sample_prompts_te_outputs,
+    prompt_replacement,
+    controlnet,
+):
+    assert isinstance(prompt_dict, dict)
+    negative_prompt = prompt_dict.get("negative_prompt")
+    sample_steps = prompt_dict.get("sample_steps", 20)
+    width = prompt_dict.get("width", 512)
+    height = prompt_dict.get("height", 512)
+    # TODO refactor variable names
+    cfg_scale = prompt_dict.get("guidance_scale", 1.0)
+    emb_guidance_scale = prompt_dict.get("scale", 3.5)
+    seed = prompt_dict.get("seed")
+    controlnet_image = prompt_dict.get("controlnet_image")
+    prompt: str = prompt_dict.get("prompt", "")
+    # sampler_name: str = prompt_dict.get("sample_sampler", args.sample_sampler)
+
+    if prompt_replacement is not None:
+        prompt = prompt.replace(prompt_replacement[0], prompt_replacement[1])
+        if negative_prompt is not None:
+            negative_prompt = negative_prompt.replace(prompt_replacement[0], prompt_replacement[1])
+
+    if seed is not None:
+        torch.manual_seed(seed)
+        torch.cuda.manual_seed(seed)
+    else:
+        # True random sample image generation
+        torch.seed()
+        torch.cuda.seed()
+
+    if negative_prompt is None:
+        negative_prompt = ""
+    height = max(64, height - height % 16)  # round to divisible by 16
+    width = max(64, width - width % 16)  # round to divisible by 16
+    logger.info(f"prompt: {prompt}")
+    if cfg_scale != 1.0:
+        logger.info(f"negative_prompt: {negative_prompt}")
+    elif negative_prompt != "":
+        logger.info(f"negative prompt is ignored because scale is 1.0")
+    logger.info(f"height: {height}")
+    logger.info(f"width: {width}")
+    logger.info(f"sample_steps: {sample_steps}")
+    logger.info(f"embedded guidance scale: {emb_guidance_scale}")
+    if cfg_scale != 1.0:
+        logger.info(f"CFG scale: {cfg_scale}")
+    # logger.info(f"sample_sampler: {sampler_name}")
+    if seed is not None:
+        logger.info(f"seed: {seed}")
+
+    # encode prompts
+    tokenize_strategy = strategy_base.TokenizeStrategy.get_strategy()
+    encoding_strategy = strategy_base.TextEncodingStrategy.get_strategy()
+
+    def encode_prompt(prpt):
+        text_encoder_conds = []
+        if sample_prompts_te_outputs and prpt in sample_prompts_te_outputs:
+            text_encoder_conds = sample_prompts_te_outputs[prpt]
+            print(f"Using cached text encoder outputs for prompt: {prpt}")
+        if text_encoders is not None:
+            print(f"Encoding prompt: {prpt}")
+            tokens_and_masks = tokenize_strategy.tokenize(prpt)
+            # strategy has apply_t5_attn_mask option
+            encoded_text_encoder_conds = encoding_strategy.encode_tokens(tokenize_strategy, text_encoders, tokens_and_masks)
+
+            # if text_encoder_conds is not cached, use encoded_text_encoder_conds
+            if len(text_encoder_conds) == 0:
+                text_encoder_conds = encoded_text_encoder_conds
+            else:
+                # if encoded_text_encoder_conds is not None, update cached text_encoder_conds
+                for i in range(len(encoded_text_encoder_conds)):
+                    if encoded_text_encoder_conds[i] is not None:
+                        text_encoder_conds[i] = encoded_text_encoder_conds[i]
+        return text_encoder_conds
+
+    l_pooled, t5_out, txt_ids, t5_attn_mask = encode_prompt(prompt)
+    # encode negative prompts
+    if cfg_scale != 1.0:
+        neg_l_pooled, neg_t5_out, _, neg_t5_attn_mask = encode_prompt(negative_prompt)
+        neg_t5_attn_mask = (
+            neg_t5_attn_mask.to(accelerator.device) if args.apply_t5_attn_mask and neg_t5_attn_mask is not None else None
+        )
+        neg_cond = (cfg_scale, neg_l_pooled, neg_t5_out, neg_t5_attn_mask)
+    else:
+        neg_cond = None
+
+    # sample image
+    weight_dtype = ae.dtype  # TOFO give dtype as argument
+    packed_latent_height = height // 16
+    packed_latent_width = width // 16
+    noise = torch.randn(
+        1,
+        packed_latent_height * packed_latent_width,
+        16 * 2 * 2,
+        device=accelerator.device,
+        dtype=weight_dtype,
+        generator=torch.Generator(device=accelerator.device).manual_seed(seed) if seed is not None else None,
+    )
+    timesteps = get_schedule(sample_steps, noise.shape[1], shift=True)  # FLUX.1 dev -> shift=True
+    img_ids = flux_utils.prepare_img_ids(1, packed_latent_height, packed_latent_width).to(accelerator.device, weight_dtype)
+    t5_attn_mask = t5_attn_mask.to(accelerator.device) if args.apply_t5_attn_mask else None
+
+    if controlnet_image is not None:
+        controlnet_image = Image.open(controlnet_image).convert("RGB")
+        controlnet_image = controlnet_image.resize((width, height), Image.LANCZOS)
+        controlnet_image = torch.from_numpy((np.array(controlnet_image) / 127.5) - 1)
+        controlnet_image = controlnet_image.permute(2, 0, 1).unsqueeze(0).to(weight_dtype).to(accelerator.device)
+
+    with accelerator.autocast(), torch.no_grad():
+        x = denoise(
+            flux,
+            noise,
+            img_ids,
+            t5_out,
+            txt_ids,
+            l_pooled,
+            timesteps=timesteps,
+            guidance=emb_guidance_scale,
+            t5_attn_mask=t5_attn_mask,
+            controlnet=controlnet,
+            controlnet_img=controlnet_image,
+            neg_cond=neg_cond,
+        )
+
+    x = flux_utils.unpack_latents(x, packed_latent_height, packed_latent_width)
+
+    # latent to image
+    clean_memory_on_device(accelerator.device)
+    org_vae_device = ae.device  # will be on cpu
+    ae.to(accelerator.device)  # distributed_state.device is same as accelerator.device
+    with accelerator.autocast(), torch.no_grad():
+        x = ae.decode(x)
+    ae.to(org_vae_device)
+    clean_memory_on_device(accelerator.device)
+
+    x = x.clamp(-1, 1)
+    x = x.permute(0, 2, 3, 1)
+    image = Image.fromarray((127.5 * (x + 1.0)).float().cpu().numpy().astype(np.uint8)[0])
+
+    # adding accelerator.wait_for_everyone() here should sync up and ensure that sample images are saved in the same order as the original prompt list
+    # but adding 'enum' to the filename should be enough
+
+    ts_str = time.strftime("%Y%m%d%H%M%S", time.localtime())
+    num_suffix = f"e{epoch:06d}" if epoch is not None else f"{steps:06d}"
+    seed_suffix = "" if seed is None else f"_{seed}"
+    i: int = prompt_dict["enum"]
+    img_filename = f"{'' if args.output_name is None else args.output_name + '_'}{num_suffix}_{i:02d}_{ts_str}{seed_suffix}.png"
+    image.save(os.path.join(save_dir, img_filename))
+
+    # send images to wandb if enabled
+    if "wandb" in [tracker.name for tracker in accelerator.trackers]:
+        wandb_tracker = accelerator.get_tracker("wandb")
+
+        import wandb
+
+        # not to commit images to avoid inconsistency between training and logging steps
+        wandb_tracker.log({f"sample_{i}": wandb.Image(image, caption=prompt)}, commit=False)  # positive prompt as a caption
+
+
+def time_shift(mu: float, sigma: float, t: torch.Tensor):
+    return math.exp(mu) / (math.exp(mu) + (1 / t - 1) ** sigma)
+
+
+def get_lin_function(x1: float = 256, y1: float = 0.5, x2: float = 4096, y2: float = 1.15) -> Callable[[float], float]:
+    m = (y2 - y1) / (x2 - x1)
+    b = y1 - m * x1
+    return lambda x: m * x + b
+
+
+def get_schedule(
+    num_steps: int,
+    image_seq_len: int,
+    base_shift: float = 0.5,
+    max_shift: float = 1.15,
+    shift: bool = True,
+) -> list[float]:
+    # extra step for zero
+    timesteps = torch.linspace(1, 0, num_steps + 1)
+
+    # shifting the schedule to favor high timesteps for higher signal images
+    if shift:
+        # eastimate mu based on linear estimation between two points
+        mu = get_lin_function(y1=base_shift, y2=max_shift)(image_seq_len)
+        timesteps = time_shift(mu, 1.0, timesteps)
+
+    return timesteps.tolist()
+
+
+def denoise(
+    model: flux_models.Flux,
+    img: torch.Tensor,
+    img_ids: torch.Tensor,
+    txt: torch.Tensor,  # t5_out
+    txt_ids: torch.Tensor,
+    vec: torch.Tensor,  # l_pooled
+    timesteps: list[float],
+    guidance: float = 4.0,
+    t5_attn_mask: Optional[torch.Tensor] = None,
+    controlnet: Optional[flux_models.ControlNetFlux] = None,
+    controlnet_img: Optional[torch.Tensor] = None,
+    neg_cond: Optional[Tuple[float, torch.Tensor, torch.Tensor, torch.Tensor]] = None,
+):
+    # this is ignored for schnell
+    guidance_vec = torch.full((img.shape[0],), guidance, device=img.device, dtype=img.dtype)
+    do_cfg = neg_cond is not None
+
+    for t_curr, t_prev in zip(tqdm(timesteps[:-1]), timesteps[1:]):
+        t_vec = torch.full((img.shape[0],), t_curr, dtype=img.dtype, device=img.device)
+        model.prepare_block_swap_before_forward()
+
+        if controlnet is not None:
+            block_samples, block_single_samples = controlnet(
+                img=img,
+                img_ids=img_ids,
+                controlnet_cond=controlnet_img,
+                txt=txt,
+                txt_ids=txt_ids,
+                y=vec,
+                timesteps=t_vec,
+                guidance=guidance_vec,
+                txt_attention_mask=t5_attn_mask,
+            )
+        else:
+            block_samples = None
+            block_single_samples = None
+
+        if not do_cfg:
+            pred = model(
+                img=img,
+                img_ids=img_ids,
+                txt=txt,
+                txt_ids=txt_ids,
+                y=vec,
+                block_controlnet_hidden_states=block_samples,
+                block_controlnet_single_hidden_states=block_single_samples,
+                timesteps=t_vec,
+                guidance=guidance_vec,
+                txt_attention_mask=t5_attn_mask,
+            )
+
+            img = img + (t_prev - t_curr) * pred
+        else:
+            cfg_scale, neg_l_pooled, neg_t5_out, neg_t5_attn_mask = neg_cond
+            nc_c_t5_attn_mask = None if t5_attn_mask is None else torch.cat([neg_t5_attn_mask, t5_attn_mask], dim=0)
+
+            # TODO is it ok to use the same block samples for both cond and uncond?
+            block_samples = None if block_samples is None else torch.cat([block_samples, block_samples], dim=0)
+            block_single_samples = (
+                None if block_single_samples is None else torch.cat([block_single_samples, block_single_samples], dim=0)
+            )
+
+            nc_c_pred = model(
+                img=torch.cat([img, img], dim=0),
+                img_ids=torch.cat([img_ids, img_ids], dim=0),
+                txt=torch.cat([neg_t5_out, txt], dim=0),
+                txt_ids=torch.cat([txt_ids, txt_ids], dim=0),
+                y=torch.cat([neg_l_pooled, vec], dim=0),
+                block_controlnet_hidden_states=block_samples,
+                block_controlnet_single_hidden_states=block_single_samples,
+                timesteps=t_vec,
+                guidance=guidance_vec,
+                txt_attention_mask=nc_c_t5_attn_mask,
+            )
+            neg_pred, pred = torch.chunk(nc_c_pred, 2, dim=0)
+            pred = neg_pred + (pred - neg_pred) * cfg_scale
+
+            img = img + (t_prev - t_curr) * pred
+
+    model.prepare_block_swap_before_forward()
+    return img
+
+
+# endregion
+
+
+# region train
+def get_sigmas(noise_scheduler, timesteps, device, n_dim=4, dtype=torch.float32):
+    sigmas = noise_scheduler.sigmas.to(device=device, dtype=dtype)
+    schedule_timesteps = noise_scheduler.timesteps.to(device)
+    timesteps = timesteps.to(device)
+    step_indices = [(schedule_timesteps == t).nonzero().item() for t in timesteps]
+
+    sigma = sigmas[step_indices].flatten()
+    return sigma
+
+
+def compute_density_for_timestep_sampling(
+    weighting_scheme: str, batch_size: int, logit_mean: float = None, logit_std: float = None, mode_scale: float = None
+):
+    """Compute the density for sampling the timesteps when doing SD3 training.
+
+    Courtesy: This was contributed by Rafie Walker in https://github.com/huggingface/diffusers/pull/8528.
+
+    SD3 paper reference: https://arxiv.org/abs/2403.03206v1.
+    """
+    if weighting_scheme == "logit_normal":
+        # See 3.1 in the SD3 paper ($rf/lognorm(0.00,1.00)$).
+        u = torch.normal(mean=logit_mean, std=logit_std, size=(batch_size,), device="cpu")
+        u = torch.nn.functional.sigmoid(u)
+    elif weighting_scheme == "mode":
+        u = torch.rand(size=(batch_size,), device="cpu")
+        u = 1 - u - mode_scale * (torch.cos(math.pi * u / 2) ** 2 - 1 + u)
+    else:
+        u = torch.rand(size=(batch_size,), device="cpu")
+    return u
+
+
+def compute_loss_weighting_for_sd3(weighting_scheme: str, sigmas=None):
+    """Computes loss weighting scheme for SD3 training.
+
+    Courtesy: This was contributed by Rafie Walker in https://github.com/huggingface/diffusers/pull/8528.
+
+    SD3 paper reference: https://arxiv.org/abs/2403.03206v1.
+    """
+    if weighting_scheme == "sigma_sqrt":
+        weighting = (sigmas**-2.0).float()
+    elif weighting_scheme == "cosmap":
+        bot = 1 - 2 * sigmas + 2 * sigmas**2
+        weighting = 2 / (math.pi * bot)
+    else:
+        weighting = torch.ones_like(sigmas)
+    return weighting
+
+
+def get_noisy_model_input_and_timesteps(
+    args, noise_scheduler, latents: torch.Tensor, noise: torch.Tensor, device, dtype
+) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+    bsz, _, h, w = latents.shape
+    assert bsz > 0, "Batch size not large enough"
+    num_timesteps = noise_scheduler.config.num_train_timesteps
+    if args.timestep_sampling == "uniform" or args.timestep_sampling == "sigmoid":
+        # Simple random sigma-based noise sampling
+        if args.timestep_sampling == "sigmoid":
+            # https://github.com/XLabs-AI/x-flux/tree/main
+            sigmas = torch.sigmoid(args.sigmoid_scale * torch.randn((bsz,), device=device))
+        else:
+            sigmas = torch.rand((bsz,), device=device)
+
+        timesteps = sigmas * num_timesteps
+    elif args.timestep_sampling == "shift":
+        shift = args.discrete_flow_shift
+        sigmas = torch.randn(bsz, device=device)
+        sigmas = sigmas * args.sigmoid_scale  # larger scale for more uniform sampling
+        sigmas = sigmas.sigmoid()
+        sigmas = (sigmas * shift) / (1 + (shift - 1) * sigmas)
+        timesteps = sigmas * num_timesteps
+    elif args.timestep_sampling == "flux_shift":
+        sigmas = torch.randn(bsz, device=device)
+        sigmas = sigmas * args.sigmoid_scale  # larger scale for more uniform sampling
+        sigmas = sigmas.sigmoid()
+        mu = get_lin_function(y1=0.5, y2=1.15)((h // 2) * (w // 2))  # we are pre-packed so must adjust for packed size
+        sigmas = time_shift(mu, 1.0, sigmas)
+        timesteps = sigmas * num_timesteps
+    else:
+        # Sample a random timestep for each image
+        # for weighting schemes where we sample timesteps non-uniformly
+        u = compute_density_for_timestep_sampling(
+            weighting_scheme=args.weighting_scheme,
+            batch_size=bsz,
+            logit_mean=args.logit_mean,
+            logit_std=args.logit_std,
+            mode_scale=args.mode_scale,
+        )
+        indices = (u * num_timesteps).long()
+        timesteps = noise_scheduler.timesteps[indices].to(device=device)
+        sigmas = get_sigmas(noise_scheduler, timesteps, device, n_dim=latents.ndim, dtype=dtype)
+
+    # Broadcast sigmas to latent shape
+    sigmas = sigmas.view(-1, 1, 1, 1)
+
+    # Add noise to the latents according to the noise magnitude at each timestep
+    # (this is the forward diffusion process)
+    if args.ip_noise_gamma:
+        xi = torch.randn_like(latents, device=latents.device, dtype=dtype)
+        if args.ip_noise_gamma_random_strength:
+            ip_noise_gamma = torch.rand(1, device=latents.device, dtype=dtype) * args.ip_noise_gamma
+        else:
+            ip_noise_gamma = args.ip_noise_gamma
+        noisy_model_input = (1.0 - sigmas) * latents + sigmas * (noise + ip_noise_gamma * xi)
+    else:
+        noisy_model_input = (1.0 - sigmas) * latents + sigmas * noise
+
+    return noisy_model_input.to(dtype), timesteps.to(dtype), sigmas
+
+
+def apply_model_prediction_type(args, model_pred, noisy_model_input, sigmas):
+    weighting = None
+    if args.model_prediction_type == "raw":
+        pass
+    elif args.model_prediction_type == "additive":
+        # add the model_pred to the noisy_model_input
+        model_pred = model_pred + noisy_model_input
+    elif args.model_prediction_type == "sigma_scaled":
+        # apply sigma scaling
+        model_pred = model_pred * (-sigmas) + noisy_model_input
+
+        # these weighting schemes use a uniform timestep sampling
+        # and instead post-weight the loss
+        weighting = compute_loss_weighting_for_sd3(weighting_scheme=args.weighting_scheme, sigmas=sigmas)
+
+    return model_pred, weighting
+
+
+def save_models(
+    ckpt_path: str,
+    flux: flux_models.Flux,
+    sai_metadata: Optional[dict],
+    save_dtype: Optional[torch.dtype] = None,
+    use_mem_eff_save: bool = False,
+):
+    state_dict = {}
+
+    def update_sd(prefix, sd):
+        for k, v in sd.items():
+            key = prefix + k
+            if save_dtype is not None and v.dtype != save_dtype:
+                v = v.detach().clone().to("cpu").to(save_dtype)
+            state_dict[key] = v
+
+    update_sd("", flux.state_dict())
+
+    if not use_mem_eff_save:
+        save_file(state_dict, ckpt_path, metadata=sai_metadata)
+    else:
+        mem_eff_save_file(state_dict, ckpt_path, metadata=sai_metadata)
+
+
+def save_flux_model_on_train_end(
+    args: argparse.Namespace, save_dtype: torch.dtype, epoch: int, global_step: int, flux: flux_models.Flux
+):
+    def sd_saver(ckpt_file, epoch_no, global_step):
+        sai_metadata = train_util.get_sai_model_spec(None, args, False, False, False, is_stable_diffusion_ckpt=True, flux="dev")
+        save_models(ckpt_file, flux, sai_metadata, save_dtype, args.mem_eff_save)
+
+    train_util.save_sd_model_on_train_end_common(args, True, True, epoch, global_step, sd_saver, None)
+
+
+# epochとstepの保存、メタデータにepoch/stepが含まれ引数が同じになるため、統合している
+# on_epoch_end: Trueならepoch終了時、Falseならstep経過時
+def save_flux_model_on_epoch_end_or_stepwise(
+    args: argparse.Namespace,
+    on_epoch_end: bool,
+    accelerator,
+    save_dtype: torch.dtype,
+    epoch: int,
+    num_train_epochs: int,
+    global_step: int,
+    flux: flux_models.Flux,
+):
+    def sd_saver(ckpt_file, epoch_no, global_step):
+        sai_metadata = train_util.get_sai_model_spec(None, args, False, False, False, is_stable_diffusion_ckpt=True, flux="dev")
+        save_models(ckpt_file, flux, sai_metadata, save_dtype, args.mem_eff_save)
+
+    train_util.save_sd_model_on_epoch_end_or_stepwise_common(
+        args,
+        on_epoch_end,
+        accelerator,
+        True,
+        True,
+        epoch,
+        num_train_epochs,
+        global_step,
+        sd_saver,
+        None,
+    )
+
+
+# endregion
+
+
+def add_flux_train_arguments(parser: argparse.ArgumentParser):
+    parser.add_argument(
+        "--clip_l",
+        type=str,
+        help="path to clip_l (*.sft or *.safetensors), should be float16 / clip_lのパス（*.sftまたは*.safetensors）、float16が前提",
+    )
+    parser.add_argument(
+        "--t5xxl",
+        type=str,
+        help="path to t5xxl (*.sft or *.safetensors), should be float16 / t5xxlのパス（*.sftまたは*.safetensors）、float16が前提",
+    )
+    parser.add_argument("--ae", type=str, help="path to ae (*.sft or *.safetensors) / aeのパス（*.sftまたは*.safetensors）")
+    parser.add_argument(
+        "--controlnet_model_name_or_path",
+        type=str,
+        default=None,
+        help="path to controlnet (*.sft or *.safetensors) / controlnetのパス（*.sftまたは*.safetensors）",
+    )
+    parser.add_argument(
+        "--t5xxl_max_token_length",
+        type=int,
+        default=None,
+        help="maximum token length for T5-XXL. if omitted, 256 for schnell and 512 for dev"
+        " / T5-XXLの最大トークン長。省略された場合、schnellの場合は256、devの場合は512",
+    )
+    parser.add_argument(
+        "--apply_t5_attn_mask",
+        action="store_true",
+        help="apply attention mask to T5-XXL encode and FLUX double blocks / T5-XXLエンコードとFLUXダブルブロックにアテンションマスクを適用する",
+    )
+
+    parser.add_argument(
+        "--guidance_scale",
+        type=float,
+        default=3.5,
+        help="the FLUX.1 dev variant is a guidance distilled model",
+    )
+
+    parser.add_argument(
+        "--timestep_sampling",
+        choices=["sigma", "uniform", "sigmoid", "shift", "flux_shift"],
+        default="sigma",
+        help="Method to sample timesteps: sigma-based, uniform random, sigmoid of random normal, shift of sigmoid and FLUX.1 shifting."
+        " / タイムステップをサンプリングする方法：sigma、random uniform、random normalのsigmoid、sigmoidのシフト、FLUX.1のシフト。",
+    )
+    parser.add_argument(
+        "--sigmoid_scale",
+        type=float,
+        default=1.0,
+        help='Scale factor for sigmoid timestep sampling (only used when timestep-sampling is "sigmoid"). / sigmoidタイムステップサンプリングの倍率（timestep-samplingが"sigmoid"の場合のみ有効）。',
+    )
+    parser.add_argument(
+        "--model_prediction_type",
+        choices=["raw", "additive", "sigma_scaled"],
+        default="sigma_scaled",
+        help="How to interpret and process the model prediction: "
+        "raw (use as is), additive (add to noisy input), sigma_scaled (apply sigma scaling)."
+        " / モデル予測の解釈と処理方法："
+        "raw（そのまま使用）、additive（ノイズ入力に加算）、sigma_scaled（シグマスケーリングを適用）。",
+    )
+    parser.add_argument(
+        "--discrete_flow_shift",
+        type=float,
+        default=3.0,
+        help="Discrete flow shift for the Euler Discrete Scheduler, default is 3.0. / Euler Discrete Schedulerの離散フローシフト、デフォルトは3.0。",
+    )
--- a/library/flux_utils.py
+++ b/library/flux_utils.py
@@ -0,0 +1,488 @@
+import json
+import os
+from dataclasses import replace
+from typing import List, Optional, Tuple, Union
+
+import einops
+import torch
+from accelerate import init_empty_weights
+from safetensors import safe_open
+from safetensors.torch import load_file
+from transformers import CLIPConfig, CLIPTextModel, T5Config, T5EncoderModel
+
+from library.utils import setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+from library import flux_models
+from library.utils import load_safetensors
+
+MODEL_VERSION_FLUX_V1 = "flux1"
+MODEL_NAME_DEV = "dev"
+MODEL_NAME_SCHNELL = "schnell"
+
+
+def analyze_checkpoint_state(ckpt_path: str) -> Tuple[bool, bool, Tuple[int, int], List[str]]:
+    """
+    チェックポイントの状態を分析し、DiffusersかBFLか、devかschnellか、ブロック数を計算して返す。
+
+    Args:
+        ckpt_path (str): チェックポイントファイルまたはディレクトリのパス。
+
+    Returns:
+        Tuple[bool, bool, Tuple[int, int], List[str]]:
+            - bool: Diffusersかどうかを示すフラグ。
+            - bool: Schnellかどうかを示すフラグ。
+            - Tuple[int, int]: ダブルブロックとシングルブロックの数。
+            - List[str]: チェックポイントに含まれるキーのリスト。
+    """
+    # check the state dict: Diffusers or BFL, dev or schnell, number of blocks
+    logger.info(f"Checking the state dict: Diffusers or BFL, dev or schnell")
+
+    if os.path.isdir(ckpt_path):  # if ckpt_path is a directory, it is Diffusers
+        ckpt_path = os.path.join(ckpt_path, "transformer", "diffusion_pytorch_model-00001-of-00003.safetensors")
+    if "00001-of-00003" in ckpt_path:
+        ckpt_paths = [ckpt_path.replace("00001-of-00003", f"0000{i}-of-00003") for i in range(1, 4)]
+    else:
+        ckpt_paths = [ckpt_path]
+
+    keys = []
+    for ckpt_path in ckpt_paths:
+        with safe_open(ckpt_path, framework="pt") as f:
+            keys.extend(f.keys())
+
+    # if the key has annoying prefix, remove it
+    if keys[0].startswith("model.diffusion_model."):
+        keys = [key.replace("model.diffusion_model.", "") for key in keys]
+
+    is_diffusers = "transformer_blocks.0.attn.add_k_proj.bias" in keys
+    is_schnell = not ("guidance_in.in_layer.bias" in keys or "time_text_embed.guidance_embedder.linear_1.bias" in keys)
+
+    # check number of double and single blocks
+    if not is_diffusers:
+        max_double_block_index = max(
+            [int(key.split(".")[1]) for key in keys if key.startswith("double_blocks.") and key.endswith(".img_attn.proj.bias")]
+        )
+        max_single_block_index = max(
+            [int(key.split(".")[1]) for key in keys if key.startswith("single_blocks.") and key.endswith(".modulation.lin.bias")]
+        )
+    else:
+        max_double_block_index = max(
+            [
+                int(key.split(".")[1])
+                for key in keys
+                if key.startswith("transformer_blocks.") and key.endswith(".attn.add_k_proj.bias")
+            ]
+        )
+        max_single_block_index = max(
+            [
+                int(key.split(".")[1])
+                for key in keys
+                if key.startswith("single_transformer_blocks.") and key.endswith(".attn.to_k.bias")
+            ]
+        )
+
+    num_double_blocks = max_double_block_index + 1
+    num_single_blocks = max_single_block_index + 1
+
+    return is_diffusers, is_schnell, (num_double_blocks, num_single_blocks), ckpt_paths
+
+
+def load_flow_model(
+    ckpt_path: str, dtype: Optional[torch.dtype], device: Union[str, torch.device], disable_mmap: bool = False
+) -> Tuple[bool, flux_models.Flux]:
+    is_diffusers, is_schnell, (num_double_blocks, num_single_blocks), ckpt_paths = analyze_checkpoint_state(ckpt_path)
+    name = MODEL_NAME_DEV if not is_schnell else MODEL_NAME_SCHNELL
+
+    # build model
+    logger.info(f"Building Flux model {name} from {'Diffusers' if is_diffusers else 'BFL'} checkpoint")
+    with torch.device("meta"):
+        params = flux_models.configs[name].params
+
+        # set the number of blocks
+        if params.depth != num_double_blocks:
+            logger.info(f"Setting the number of double blocks from {params.depth} to {num_double_blocks}")
+            params = replace(params, depth=num_double_blocks)
+        if params.depth_single_blocks != num_single_blocks:
+            logger.info(f"Setting the number of single blocks from {params.depth_single_blocks} to {num_single_blocks}")
+            params = replace(params, depth_single_blocks=num_single_blocks)
+
+        model = flux_models.Flux(params)
+        if dtype is not None:
+            model = model.to(dtype)
+
+    # load_sft doesn't support torch.device
+    logger.info(f"Loading state dict from {ckpt_path}")
+    sd = {}
+    for ckpt_path in ckpt_paths:
+        sd.update(load_safetensors(ckpt_path, device=str(device), disable_mmap=disable_mmap, dtype=dtype))
+
+    # convert Diffusers to BFL
+    if is_diffusers:
+        logger.info("Converting Diffusers to BFL")
+        sd = convert_diffusers_sd_to_bfl(sd, num_double_blocks, num_single_blocks)
+        logger.info("Converted Diffusers to BFL")
+
+    # if the key has annoying prefix, remove it
+    for key in list(sd.keys()):
+        new_key = key.replace("model.diffusion_model.", "")
+        if new_key == key:
+            break  # the model doesn't have annoying prefix
+        sd[new_key] = sd.pop(key)
+
+    info = model.load_state_dict(sd, strict=False, assign=True)
+    logger.info(f"Loaded Flux: {info}")
+    return is_schnell, model
+
+
+def load_ae(
+    ckpt_path: str, dtype: torch.dtype, device: Union[str, torch.device], disable_mmap: bool = False
+) -> flux_models.AutoEncoder:
+    logger.info("Building AutoEncoder")
+    with torch.device("meta"):
+        # dev and schnell have the same AE params
+        ae = flux_models.AutoEncoder(flux_models.configs[MODEL_NAME_DEV].ae_params).to(dtype)
+
+    logger.info(f"Loading state dict from {ckpt_path}")
+    sd = load_safetensors(ckpt_path, device=str(device), disable_mmap=disable_mmap, dtype=dtype)
+    info = ae.load_state_dict(sd, strict=False, assign=True)
+    logger.info(f"Loaded AE: {info}")
+    return ae
+
+
+def load_controlnet(
+    ckpt_path: Optional[str], is_schnell: bool, dtype: torch.dtype, device: Union[str, torch.device], disable_mmap: bool = False
+):
+    logger.info("Building ControlNet")
+    name = MODEL_NAME_DEV if not is_schnell else MODEL_NAME_SCHNELL
+    with torch.device(device):
+        controlnet = flux_models.ControlNetFlux(flux_models.configs[name].params).to(dtype)
+
+    if ckpt_path is not None:
+        logger.info(f"Loading state dict from {ckpt_path}")
+        sd = load_safetensors(ckpt_path, device=str(device), disable_mmap=disable_mmap, dtype=dtype)
+        info = controlnet.load_state_dict(sd, strict=False, assign=True)
+        logger.info(f"Loaded ControlNet: {info}")
+    return controlnet    
+
+
+def load_clip_l(
+    ckpt_path: Optional[str],
+    dtype: torch.dtype,
+    device: Union[str, torch.device],
+    disable_mmap: bool = False,
+    state_dict: Optional[dict] = None,
+) -> CLIPTextModel:
+    logger.info("Building CLIP-L")
+    CLIPL_CONFIG = {
+        "_name_or_path": "clip-vit-large-patch14/",
+        "architectures": ["CLIPModel"],
+        "initializer_factor": 1.0,
+        "logit_scale_init_value": 2.6592,
+        "model_type": "clip",
+        "projection_dim": 768,
+        # "text_config": {
+        "_name_or_path": "",
+        "add_cross_attention": False,
+        "architectures": None,
+        "attention_dropout": 0.0,
+        "bad_words_ids": None,
+        "bos_token_id": 0,
+        "chunk_size_feed_forward": 0,
+        "cross_attention_hidden_size": None,
+        "decoder_start_token_id": None,
+        "diversity_penalty": 0.0,
+        "do_sample": False,
+        "dropout": 0.0,
+        "early_stopping": False,
+        "encoder_no_repeat_ngram_size": 0,
+        "eos_token_id": 2,
+        "finetuning_task": None,
+        "forced_bos_token_id": None,
+        "forced_eos_token_id": None,
+        "hidden_act": "quick_gelu",
+        "hidden_size": 768,
+        "id2label": {"0": "LABEL_0", "1": "LABEL_1"},
+        "initializer_factor": 1.0,
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "is_decoder": False,
+        "is_encoder_decoder": False,
+        "label2id": {"LABEL_0": 0, "LABEL_1": 1},
+        "layer_norm_eps": 1e-05,
+        "length_penalty": 1.0,
+        "max_length": 20,
+        "max_position_embeddings": 77,
+        "min_length": 0,
+        "model_type": "clip_text_model",
+        "no_repeat_ngram_size": 0,
+        "num_attention_heads": 12,
+        "num_beam_groups": 1,
+        "num_beams": 1,
+        "num_hidden_layers": 12,
+        "num_return_sequences": 1,
+        "output_attentions": False,
+        "output_hidden_states": False,
+        "output_scores": False,
+        "pad_token_id": 1,
+        "prefix": None,
+        "problem_type": None,
+        "projection_dim": 768,
+        "pruned_heads": {},
+        "remove_invalid_values": False,
+        "repetition_penalty": 1.0,
+        "return_dict": True,
+        "return_dict_in_generate": False,
+        "sep_token_id": None,
+        "task_specific_params": None,
+        "temperature": 1.0,
+        "tie_encoder_decoder": False,
+        "tie_word_embeddings": True,
+        "tokenizer_class": None,
+        "top_k": 50,
+        "top_p": 1.0,
+        "torch_dtype": None,
+        "torchscript": False,
+        "transformers_version": "4.16.0.dev0",
+        "use_bfloat16": False,
+        "vocab_size": 49408,
+        "hidden_act": "gelu",
+        "hidden_size": 1280,
+        "intermediate_size": 5120,
+        "num_attention_heads": 20,
+        "num_hidden_layers": 32,
+        # },
+        # "text_config_dict": {
+        "hidden_size": 768,
+        "intermediate_size": 3072,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "projection_dim": 768,
+        # },
+        # "torch_dtype": "float32",
+        # "transformers_version": None,
+    }
+    config = CLIPConfig(**CLIPL_CONFIG)
+    with init_empty_weights():
+        clip = CLIPTextModel._from_config(config)
+
+    if state_dict is not None:
+        sd = state_dict
+    else:
+        logger.info(f"Loading state dict from {ckpt_path}")
+        sd = load_safetensors(ckpt_path, device=str(device), disable_mmap=disable_mmap, dtype=dtype)
+    info = clip.load_state_dict(sd, strict=False, assign=True)
+    logger.info(f"Loaded CLIP-L: {info}")
+    return clip
+
+
+def load_t5xxl(
+    ckpt_path: str,
+    dtype: Optional[torch.dtype],
+    device: Union[str, torch.device],
+    disable_mmap: bool = False,
+    state_dict: Optional[dict] = None,
+) -> T5EncoderModel:
+    T5_CONFIG_JSON = """
+{
+  "architectures": [
+    "T5EncoderModel"
+  ],
+  "classifier_dropout": 0.0,
+  "d_ff": 10240,
+  "d_kv": 64,
+  "d_model": 4096,
+  "decoder_start_token_id": 0,
+  "dense_act_fn": "gelu_new",
+  "dropout_rate": 0.1,
+  "eos_token_id": 1,
+  "feed_forward_proj": "gated-gelu",
+  "initializer_factor": 1.0,
+  "is_encoder_decoder": true,
+  "is_gated_act": true,
+  "layer_norm_epsilon": 1e-06,
+  "model_type": "t5",
+  "num_decoder_layers": 24,
+  "num_heads": 64,
+  "num_layers": 24,
+  "output_past": true,
+  "pad_token_id": 0,
+  "relative_attention_max_distance": 128,
+  "relative_attention_num_buckets": 32,
+  "tie_word_embeddings": false,
+  "torch_dtype": "float16",
+  "transformers_version": "4.41.2",
+  "use_cache": true,
+  "vocab_size": 32128
+}
+"""
+    config = json.loads(T5_CONFIG_JSON)
+    config = T5Config(**config)
+    with init_empty_weights():
+        t5xxl = T5EncoderModel._from_config(config)
+
+    if state_dict is not None:
+        sd = state_dict
+    else:
+        logger.info(f"Loading state dict from {ckpt_path}")
+        sd = load_safetensors(ckpt_path, device=str(device), disable_mmap=disable_mmap, dtype=dtype)
+    info = t5xxl.load_state_dict(sd, strict=False, assign=True)
+    logger.info(f"Loaded T5xxl: {info}")
+    return t5xxl
+
+
+def get_t5xxl_actual_dtype(t5xxl: T5EncoderModel) -> torch.dtype:
+    # nn.Embedding is the first layer, but it could be casted to bfloat16 or float32
+    return t5xxl.encoder.block[0].layer[0].SelfAttention.q.weight.dtype
+
+
+def prepare_img_ids(batch_size: int, packed_latent_height: int, packed_latent_width: int):
+    img_ids = torch.zeros(packed_latent_height, packed_latent_width, 3)
+    img_ids[..., 1] = img_ids[..., 1] + torch.arange(packed_latent_height)[:, None]
+    img_ids[..., 2] = img_ids[..., 2] + torch.arange(packed_latent_width)[None, :]
+    img_ids = einops.repeat(img_ids, "h w c -> b (h w) c", b=batch_size)
+    return img_ids
+
+
+def unpack_latents(x: torch.Tensor, packed_latent_height: int, packed_latent_width: int) -> torch.Tensor:
+    """
+    x: [b (h w) (c ph pw)] -> [b c (h ph) (w pw)], ph=2, pw=2
+    """
+    x = einops.rearrange(x, "b (h w) (c ph pw) -> b c (h ph) (w pw)", h=packed_latent_height, w=packed_latent_width, ph=2, pw=2)
+    return x
+
+
+def pack_latents(x: torch.Tensor) -> torch.Tensor:
+    """
+    x: [b c (h ph) (w pw)] -> [b (h w) (c ph pw)], ph=2, pw=2
+    """
+    x = einops.rearrange(x, "b c (h ph) (w pw) -> b (h w) (c ph pw)", ph=2, pw=2)
+    return x
+
+
+# region Diffusers
+
+NUM_DOUBLE_BLOCKS = 19
+NUM_SINGLE_BLOCKS = 38
+
+BFL_TO_DIFFUSERS_MAP = {
+    "time_in.in_layer.weight": ["time_text_embed.timestep_embedder.linear_1.weight"],
+    "time_in.in_layer.bias": ["time_text_embed.timestep_embedder.linear_1.bias"],
+    "time_in.out_layer.weight": ["time_text_embed.timestep_embedder.linear_2.weight"],
+    "time_in.out_layer.bias": ["time_text_embed.timestep_embedder.linear_2.bias"],
+    "vector_in.in_layer.weight": ["time_text_embed.text_embedder.linear_1.weight"],
+    "vector_in.in_layer.bias": ["time_text_embed.text_embedder.linear_1.bias"],
+    "vector_in.out_layer.weight": ["time_text_embed.text_embedder.linear_2.weight"],
+    "vector_in.out_layer.bias": ["time_text_embed.text_embedder.linear_2.bias"],
+    "guidance_in.in_layer.weight": ["time_text_embed.guidance_embedder.linear_1.weight"],
+    "guidance_in.in_layer.bias": ["time_text_embed.guidance_embedder.linear_1.bias"],
+    "guidance_in.out_layer.weight": ["time_text_embed.guidance_embedder.linear_2.weight"],
+    "guidance_in.out_layer.bias": ["time_text_embed.guidance_embedder.linear_2.bias"],
+    "txt_in.weight": ["context_embedder.weight"],
+    "txt_in.bias": ["context_embedder.bias"],
+    "img_in.weight": ["x_embedder.weight"],
+    "img_in.bias": ["x_embedder.bias"],
+    "double_blocks.().img_mod.lin.weight": ["norm1.linear.weight"],
+    "double_blocks.().img_mod.lin.bias": ["norm1.linear.bias"],
+    "double_blocks.().txt_mod.lin.weight": ["norm1_context.linear.weight"],
+    "double_blocks.().txt_mod.lin.bias": ["norm1_context.linear.bias"],
+    "double_blocks.().img_attn.qkv.weight": ["attn.to_q.weight", "attn.to_k.weight", "attn.to_v.weight"],
+    "double_blocks.().img_attn.qkv.bias": ["attn.to_q.bias", "attn.to_k.bias", "attn.to_v.bias"],
+    "double_blocks.().txt_attn.qkv.weight": ["attn.add_q_proj.weight", "attn.add_k_proj.weight", "attn.add_v_proj.weight"],
+    "double_blocks.().txt_attn.qkv.bias": ["attn.add_q_proj.bias", "attn.add_k_proj.bias", "attn.add_v_proj.bias"],
+    "double_blocks.().img_attn.norm.query_norm.scale": ["attn.norm_q.weight"],
+    "double_blocks.().img_attn.norm.key_norm.scale": ["attn.norm_k.weight"],
+    "double_blocks.().txt_attn.norm.query_norm.scale": ["attn.norm_added_q.weight"],
+    "double_blocks.().txt_attn.norm.key_norm.scale": ["attn.norm_added_k.weight"],
+    "double_blocks.().img_mlp.0.weight": ["ff.net.0.proj.weight"],
+    "double_blocks.().img_mlp.0.bias": ["ff.net.0.proj.bias"],
+    "double_blocks.().img_mlp.2.weight": ["ff.net.2.weight"],
+    "double_blocks.().img_mlp.2.bias": ["ff.net.2.bias"],
+    "double_blocks.().txt_mlp.0.weight": ["ff_context.net.0.proj.weight"],
+    "double_blocks.().txt_mlp.0.bias": ["ff_context.net.0.proj.bias"],
+    "double_blocks.().txt_mlp.2.weight": ["ff_context.net.2.weight"],
+    "double_blocks.().txt_mlp.2.bias": ["ff_context.net.2.bias"],
+    "double_blocks.().img_attn.proj.weight": ["attn.to_out.0.weight"],
+    "double_blocks.().img_attn.proj.bias": ["attn.to_out.0.bias"],
+    "double_blocks.().txt_attn.proj.weight": ["attn.to_add_out.weight"],
+    "double_blocks.().txt_attn.proj.bias": ["attn.to_add_out.bias"],
+    "single_blocks.().modulation.lin.weight": ["norm.linear.weight"],
+    "single_blocks.().modulation.lin.bias": ["norm.linear.bias"],
+    "single_blocks.().linear1.weight": ["attn.to_q.weight", "attn.to_k.weight", "attn.to_v.weight", "proj_mlp.weight"],
+    "single_blocks.().linear1.bias": ["attn.to_q.bias", "attn.to_k.bias", "attn.to_v.bias", "proj_mlp.bias"],
+    "single_blocks.().linear2.weight": ["proj_out.weight"],
+    "single_blocks.().norm.query_norm.scale": ["attn.norm_q.weight"],
+    "single_blocks.().norm.key_norm.scale": ["attn.norm_k.weight"],
+    "single_blocks.().linear2.weight": ["proj_out.weight"],
+    "single_blocks.().linear2.bias": ["proj_out.bias"],
+    "final_layer.linear.weight": ["proj_out.weight"],
+    "final_layer.linear.bias": ["proj_out.bias"],
+    "final_layer.adaLN_modulation.1.weight": ["norm_out.linear.weight"],
+    "final_layer.adaLN_modulation.1.bias": ["norm_out.linear.bias"],
+}
+
+
+def make_diffusers_to_bfl_map(num_double_blocks: int, num_single_blocks: int) -> dict[str, tuple[int, str]]:
+    # make reverse map from diffusers map
+    diffusers_to_bfl_map = {}  # key: diffusers_key, value: (index, bfl_key)
+    for b in range(num_double_blocks):
+        for key, weights in BFL_TO_DIFFUSERS_MAP.items():
+            if key.startswith("double_blocks."):
+                block_prefix = f"transformer_blocks.{b}."
+                for i, weight in enumerate(weights):
+                    diffusers_to_bfl_map[f"{block_prefix}{weight}"] = (i, key.replace("()", f"{b}"))
+    for b in range(num_single_blocks):
+        for key, weights in BFL_TO_DIFFUSERS_MAP.items():
+            if key.startswith("single_blocks."):
+                block_prefix = f"single_transformer_blocks.{b}."
+                for i, weight in enumerate(weights):
+                    diffusers_to_bfl_map[f"{block_prefix}{weight}"] = (i, key.replace("()", f"{b}"))
+    for key, weights in BFL_TO_DIFFUSERS_MAP.items():
+        if not (key.startswith("double_blocks.") or key.startswith("single_blocks.")):
+            for i, weight in enumerate(weights):
+                diffusers_to_bfl_map[weight] = (i, key)
+    return diffusers_to_bfl_map
+
+
+def convert_diffusers_sd_to_bfl(
+    diffusers_sd: dict[str, torch.Tensor], num_double_blocks: int = NUM_DOUBLE_BLOCKS, num_single_blocks: int = NUM_SINGLE_BLOCKS
+) -> dict[str, torch.Tensor]:
+    diffusers_to_bfl_map = make_diffusers_to_bfl_map(num_double_blocks, num_single_blocks)
+
+    # iterate over three safetensors files to reduce memory usage
+    flux_sd = {}
+    for diffusers_key, tensor in diffusers_sd.items():
+        if diffusers_key in diffusers_to_bfl_map:
+            index, bfl_key = diffusers_to_bfl_map[diffusers_key]
+            if bfl_key not in flux_sd:
+                flux_sd[bfl_key] = []
+            flux_sd[bfl_key].append((index, tensor))
+        else:
+            logger.error(f"Error: Key not found in diffusers_to_bfl_map: {diffusers_key}")
+            raise KeyError(f"Key not found in diffusers_to_bfl_map: {diffusers_key}")
+
+    # concat tensors if multiple tensors are mapped to a single key, sort by index
+    for key, values in flux_sd.items():
+        if len(values) == 1:
+            flux_sd[key] = values[0][1]
+        else:
+            flux_sd[key] = torch.cat([value[1] for value in sorted(values, key=lambda x: x[0])])
+
+    # special case for final_layer.adaLN_modulation.1.weight and final_layer.adaLN_modulation.1.bias
+    def swap_scale_shift(weight):
+        shift, scale = weight.chunk(2, dim=0)
+        new_weight = torch.cat([scale, shift], dim=0)
+        return new_weight
+
+    if "final_layer.adaLN_modulation.1.weight" in flux_sd:
+        flux_sd["final_layer.adaLN_modulation.1.weight"] = swap_scale_shift(flux_sd["final_layer.adaLN_modulation.1.weight"])
+    if "final_layer.adaLN_modulation.1.bias" in flux_sd:
+        flux_sd["final_layer.adaLN_modulation.1.bias"] = swap_scale_shift(flux_sd["final_layer.adaLN_modulation.1.bias"])
+
+    return flux_sd
+
+
+# endregion
--- a/library/huggingface_util.py
+++ b/library/huggingface_util.py
@@ -0,0 +1,84 @@
+from typing import Union, BinaryIO
+from huggingface_hub import HfApi
+from pathlib import Path
+import argparse
+import os
+from library.utils import fire_in_thread
+from library.utils import setup_logging
+setup_logging()
+import logging
+logger = logging.getLogger(__name__)
+
+def exists_repo(repo_id: str, repo_type: str, revision: str = "main", token: str = None):
+    api = HfApi(
+        token=token,
+    )
+    try:
+        api.repo_info(repo_id=repo_id, revision=revision, repo_type=repo_type)
+        return True
+    except:
+        return False
+
+
+def upload(
+    args: argparse.Namespace,
+    src: Union[str, Path, bytes, BinaryIO],
+    dest_suffix: str = "",
+    force_sync_upload: bool = False,
+):
+    repo_id = args.huggingface_repo_id
+    repo_type = args.huggingface_repo_type
+    token = args.huggingface_token
+    path_in_repo = args.huggingface_path_in_repo + dest_suffix if args.huggingface_path_in_repo is not None else None
+    private = args.huggingface_repo_visibility is None or args.huggingface_repo_visibility != "public"
+    api = HfApi(token=token)
+    if not exists_repo(repo_id=repo_id, repo_type=repo_type, token=token):
+        try:
+            api.create_repo(repo_id=repo_id, repo_type=repo_type, private=private)
+        except Exception as e:  # とりあえずRepositoryNotFoundErrorは確認したが他にあると困るので
+            logger.error("===========================================")
+            logger.error(f"failed to create HuggingFace repo / HuggingFaceのリポジトリの作成に失敗しました : {e}")
+            logger.error("===========================================")
+
+    is_folder = (type(src) == str and os.path.isdir(src)) or (isinstance(src, Path) and src.is_dir())
+
+    def uploader():
+        try:
+            if is_folder:
+                api.upload_folder(
+                    repo_id=repo_id,
+                    repo_type=repo_type,
+                    folder_path=src,
+                    path_in_repo=path_in_repo,
+                )
+            else:
+                api.upload_file(
+                    repo_id=repo_id,
+                    repo_type=repo_type,
+                    path_or_fileobj=src,
+                    path_in_repo=path_in_repo,
+                )
+        except Exception as e:  # RuntimeErrorを確認済みだが他にあると困るので
+            logger.error("===========================================")
+            logger.error(f"failed to upload to HuggingFace / HuggingFaceへのアップロードに失敗しました : {e}")
+            logger.error("===========================================")
+
+    if args.async_upload and not force_sync_upload:
+        fire_in_thread(uploader)
+    else:
+        uploader()
+
+
+def list_dir(
+    repo_id: str,
+    subfolder: str,
+    repo_type: str,
+    revision: str = "main",
+    token: str = None,
+):
+    api = HfApi(
+        token=token,
+    )
+    repo_info = api.repo_info(repo_id=repo_id, revision=revision, repo_type=repo_type)
+    file_list = [file for file in repo_info.siblings if file.rfilename.startswith(subfolder)]
+    return file_list
--- a/library/hypernetwork.py
+++ b/library/hypernetwork.py
@@ -0,0 +1,223 @@
+import torch
+import torch.nn.functional as F
+from diffusers.models.attention_processor import (
+    Attention,
+    AttnProcessor2_0,
+    SlicedAttnProcessor,
+    XFormersAttnProcessor
+)
+
+try:
+    import xformers.ops
+except:
+    xformers = None
+
+
+loaded_networks = []
+
+
+def apply_single_hypernetwork(
+    hypernetwork, hidden_states, encoder_hidden_states
+):
+    context_k, context_v = hypernetwork.forward(hidden_states, encoder_hidden_states)
+    return context_k, context_v
+
+
+def apply_hypernetworks(context_k, context_v, layer=None):
+    if len(loaded_networks) == 0:
+        return context_v, context_v
+    for hypernetwork in loaded_networks:
+        context_k, context_v = hypernetwork.forward(context_k, context_v)
+
+    context_k = context_k.to(dtype=context_k.dtype)
+    context_v = context_v.to(dtype=context_k.dtype)
+
+    return context_k, context_v
+
+
+
+def xformers_forward(
+    self: XFormersAttnProcessor,
+    attn: Attention,
+    hidden_states: torch.Tensor,
+    encoder_hidden_states: torch.Tensor = None,
+    attention_mask: torch.Tensor = None,
+):
+    batch_size, sequence_length, _ = (
+        hidden_states.shape
+        if encoder_hidden_states is None
+        else encoder_hidden_states.shape
+    )
+
+    attention_mask = attn.prepare_attention_mask(
+        attention_mask, sequence_length, batch_size
+    )
+
+    query = attn.to_q(hidden_states)
+
+    if encoder_hidden_states is None:
+        encoder_hidden_states = hidden_states
+    elif attn.norm_cross:
+        encoder_hidden_states = attn.norm_encoder_hidden_states(encoder_hidden_states)
+
+    context_k, context_v = apply_hypernetworks(hidden_states, encoder_hidden_states)
+
+    key = attn.to_k(context_k)
+    value = attn.to_v(context_v)
+
+    query = attn.head_to_batch_dim(query).contiguous()
+    key = attn.head_to_batch_dim(key).contiguous()
+    value = attn.head_to_batch_dim(value).contiguous()
+
+    hidden_states = xformers.ops.memory_efficient_attention(
+        query,
+        key,
+        value,
+        attn_bias=attention_mask,
+        op=self.attention_op,
+        scale=attn.scale,
+    )
+    hidden_states = hidden_states.to(query.dtype)
+    hidden_states = attn.batch_to_head_dim(hidden_states)
+
+    # linear proj
+    hidden_states = attn.to_out[0](hidden_states)
+    # dropout
+    hidden_states = attn.to_out[1](hidden_states)
+    return hidden_states
+
+
+def sliced_attn_forward(
+    self: SlicedAttnProcessor,
+    attn: Attention,
+    hidden_states: torch.Tensor,
+    encoder_hidden_states: torch.Tensor = None,
+    attention_mask: torch.Tensor = None,
+):
+    batch_size, sequence_length, _ = (
+        hidden_states.shape
+        if encoder_hidden_states is None
+        else encoder_hidden_states.shape
+    )
+    attention_mask = attn.prepare_attention_mask(
+        attention_mask, sequence_length, batch_size
+    )
+
+    query = attn.to_q(hidden_states)
+    dim = query.shape[-1]
+    query = attn.head_to_batch_dim(query)
+
+    if encoder_hidden_states is None:
+        encoder_hidden_states = hidden_states
+    elif attn.norm_cross:
+        encoder_hidden_states = attn.norm_encoder_hidden_states(encoder_hidden_states)
+
+    context_k, context_v = apply_hypernetworks(hidden_states, encoder_hidden_states)
+
+    key = attn.to_k(context_k)
+    value = attn.to_v(context_v)
+    key = attn.head_to_batch_dim(key)
+    value = attn.head_to_batch_dim(value)
+
+    batch_size_attention, query_tokens, _ = query.shape
+    hidden_states = torch.zeros(
+        (batch_size_attention, query_tokens, dim // attn.heads),
+        device=query.device,
+        dtype=query.dtype,
+    )
+
+    for i in range(batch_size_attention // self.slice_size):
+        start_idx = i * self.slice_size
+        end_idx = (i + 1) * self.slice_size
+
+        query_slice = query[start_idx:end_idx]
+        key_slice = key[start_idx:end_idx]
+        attn_mask_slice = (
+            attention_mask[start_idx:end_idx] if attention_mask is not None else None
+        )
+
+        attn_slice = attn.get_attention_scores(query_slice, key_slice, attn_mask_slice)
+
+        attn_slice = torch.bmm(attn_slice, value[start_idx:end_idx])
+
+        hidden_states[start_idx:end_idx] = attn_slice
+
+    hidden_states = attn.batch_to_head_dim(hidden_states)
+
+    # linear proj
+    hidden_states = attn.to_out[0](hidden_states)
+    # dropout
+    hidden_states = attn.to_out[1](hidden_states)
+
+    return hidden_states
+
+
+def v2_0_forward(
+    self: AttnProcessor2_0,
+    attn: Attention,
+    hidden_states,
+    encoder_hidden_states=None,
+    attention_mask=None,
+):
+    batch_size, sequence_length, _ = (
+        hidden_states.shape
+        if encoder_hidden_states is None
+        else encoder_hidden_states.shape
+    )
+    inner_dim = hidden_states.shape[-1]
+
+    if attention_mask is not None:
+        attention_mask = attn.prepare_attention_mask(
+            attention_mask, sequence_length, batch_size
+        )
+        # scaled_dot_product_attention expects attention_mask shape to be
+        # (batch, heads, source_length, target_length)
+        attention_mask = attention_mask.view(
+            batch_size, attn.heads, -1, attention_mask.shape[-1]
+        )
+
+    query = attn.to_q(hidden_states)
+
+    if encoder_hidden_states is None:
+        encoder_hidden_states = hidden_states
+    elif attn.norm_cross:
+        encoder_hidden_states = attn.norm_encoder_hidden_states(encoder_hidden_states)
+
+    context_k, context_v = apply_hypernetworks(hidden_states, encoder_hidden_states)
+
+    key = attn.to_k(context_k)
+    value = attn.to_v(context_v)
+
+    head_dim = inner_dim // attn.heads
+    query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+    key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+    value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+
+    # the output of sdp = (batch, num_heads, seq_len, head_dim)
+    # TODO: add support for attn.scale when we move to Torch 2.1
+    hidden_states = F.scaled_dot_product_attention(
+        query, key, value, attn_mask=attention_mask, dropout_p=0.0, is_causal=False
+    )
+
+    hidden_states = hidden_states.transpose(1, 2).reshape(
+        batch_size, -1, attn.heads * head_dim
+    )
+    hidden_states = hidden_states.to(query.dtype)
+
+    # linear proj
+    hidden_states = attn.to_out[0](hidden_states)
+    # dropout
+    hidden_states = attn.to_out[1](hidden_states)
+    return hidden_states
+
+
+def replace_attentions_for_hypernetwork():
+    import diffusers.models.attention_processor
+
+    diffusers.models.attention_processor.XFormersAttnProcessor.__call__ = (
+        xformers_forward
+    )
+    diffusers.models.attention_processor.SlicedAttnProcessor.__call__ = (
+        sliced_attn_forward
+    )
+    diffusers.models.attention_processor.AttnProcessor2_0.__call__ = v2_0_forward
--- a/library/ipex/init.py
+++ b/library/ipex/init.py
@@ -0,0 +1,204 @@
+import os
+import sys
+import torch
+try:
+    import intel_extension_for_pytorch as ipex # pylint: disable=import-error, unused-import
+    has_ipex = True
+except Exception:
+    has_ipex = False
+from .hijacks import ipex_hijacks
+
+torch_version = float(torch.__version__[:3])
+
+# pylint: disable=protected-access, missing-function-docstring, line-too-long
+
+def ipex_init(): # pylint: disable=too-many-statements
+    try:
+        if hasattr(torch, "cuda") and hasattr(torch.cuda, "is_xpu_hijacked") and torch.cuda.is_xpu_hijacked:
+            return True, "Skipping IPEX hijack"
+        else:
+            try:
+                # force xpu device on torch compile and triton
+                # import inductor utils to get around lazy import
+                from torch._inductor import utils as torch_inductor_utils # pylint: disable=import-error, unused-import # noqa: F401
+                torch._inductor.utils.GPU_TYPES = ["xpu"]
+                torch._inductor.utils.get_gpu_type = lambda *args, **kwargs: "xpu"
+                from triton import backends as triton_backends # pylint: disable=import-error
+                triton_backends.backends["nvidia"].driver.is_active = lambda *args, **kwargs: False
+            except Exception:
+                pass
+            # Replace cuda with xpu:
+            torch.cuda.current_device = torch.xpu.current_device
+            torch.cuda.current_stream = torch.xpu.current_stream
+            torch.cuda.device = torch.xpu.device
+            torch.cuda.device_count = torch.xpu.device_count
+            torch.cuda.device_of = torch.xpu.device_of
+            torch.cuda.get_device_name = torch.xpu.get_device_name
+            torch.cuda.get_device_properties = torch.xpu.get_device_properties
+            torch.cuda.init = torch.xpu.init
+            torch.cuda.is_available = torch.xpu.is_available
+            torch.cuda.is_initialized = torch.xpu.is_initialized
+            torch.cuda.is_current_stream_capturing = lambda: False
+            torch.cuda.stream = torch.xpu.stream
+            torch.cuda.Event = torch.xpu.Event
+            torch.cuda.Stream = torch.xpu.Stream
+            torch.Tensor.cuda = torch.Tensor.xpu
+            torch.Tensor.is_cuda = torch.Tensor.is_xpu
+            torch.nn.Module.cuda = torch.nn.Module.xpu
+            torch.cuda.Optional = torch.xpu.Optional
+            torch.cuda.__cached__ = torch.xpu.__cached__
+            torch.cuda.__loader__ = torch.xpu.__loader__
+            torch.cuda.streams = torch.xpu.streams
+            torch.cuda.Any = torch.xpu.Any
+            torch.cuda.__doc__ = torch.xpu.__doc__
+            torch.cuda.default_generators = torch.xpu.default_generators
+            torch.cuda._get_device_index = torch.xpu._get_device_index
+            torch.cuda.__path__ = torch.xpu.__path__
+            torch.cuda.set_stream = torch.xpu.set_stream
+            torch.cuda.torch = torch.xpu.torch
+            torch.cuda.Union = torch.xpu.Union
+            torch.cuda.__annotations__ = torch.xpu.__annotations__
+            torch.cuda.__package__ = torch.xpu.__package__
+            torch.cuda.__builtins__ = torch.xpu.__builtins__
+            torch.cuda._lazy_init = torch.xpu._lazy_init
+            torch.cuda.StreamContext = torch.xpu.StreamContext
+            torch.cuda._lazy_call = torch.xpu._lazy_call
+            torch.cuda.random = torch.xpu.random
+            torch.cuda._device = torch.xpu._device
+            torch.cuda.__name__ = torch.xpu.__name__
+            torch.cuda._device_t = torch.xpu._device_t
+            torch.cuda.__spec__ = torch.xpu.__spec__
+            torch.cuda.__file__ = torch.xpu.__file__
+            # torch.cuda.is_current_stream_capturing = torch.xpu.is_current_stream_capturing
+
+            if torch_version < 2.3:
+                torch.cuda._initialization_lock = torch.xpu.lazy_init._initialization_lock
+                torch.cuda._initialized = torch.xpu.lazy_init._initialized
+                torch.cuda._is_in_bad_fork = torch.xpu.lazy_init._is_in_bad_fork
+                torch.cuda._lazy_seed_tracker = torch.xpu.lazy_init._lazy_seed_tracker
+                torch.cuda._queued_calls = torch.xpu.lazy_init._queued_calls
+                torch.cuda._tls = torch.xpu.lazy_init._tls
+                torch.cuda.threading = torch.xpu.lazy_init.threading
+                torch.cuda.traceback = torch.xpu.lazy_init.traceback
+                torch.cuda._lazy_new = torch.xpu._lazy_new
+
+                torch.cuda.FloatTensor = torch.xpu.FloatTensor
+                torch.cuda.FloatStorage = torch.xpu.FloatStorage
+                torch.cuda.BFloat16Tensor = torch.xpu.BFloat16Tensor
+                torch.cuda.BFloat16Storage = torch.xpu.BFloat16Storage
+                torch.cuda.HalfTensor = torch.xpu.HalfTensor
+                torch.cuda.HalfStorage = torch.xpu.HalfStorage
+                torch.cuda.ByteTensor = torch.xpu.ByteTensor
+                torch.cuda.ByteStorage = torch.xpu.ByteStorage
+                torch.cuda.DoubleTensor = torch.xpu.DoubleTensor
+                torch.cuda.DoubleStorage = torch.xpu.DoubleStorage
+                torch.cuda.ShortTensor = torch.xpu.ShortTensor
+                torch.cuda.ShortStorage = torch.xpu.ShortStorage
+                torch.cuda.LongTensor = torch.xpu.LongTensor
+                torch.cuda.LongStorage = torch.xpu.LongStorage
+                torch.cuda.IntTensor = torch.xpu.IntTensor
+                torch.cuda.IntStorage = torch.xpu.IntStorage
+                torch.cuda.CharTensor = torch.xpu.CharTensor
+                torch.cuda.CharStorage = torch.xpu.CharStorage
+                torch.cuda.BoolTensor = torch.xpu.BoolTensor
+                torch.cuda.BoolStorage = torch.xpu.BoolStorage
+                torch.cuda.ComplexFloatStorage = torch.xpu.ComplexFloatStorage
+                torch.cuda.ComplexDoubleStorage = torch.xpu.ComplexDoubleStorage
+            else:
+                torch.cuda._initialization_lock = torch.xpu._initialization_lock
+                torch.cuda._initialized = torch.xpu._initialized
+                torch.cuda._is_in_bad_fork = torch.xpu._is_in_bad_fork
+                torch.cuda._lazy_seed_tracker = torch.xpu._lazy_seed_tracker
+                torch.cuda._queued_calls = torch.xpu._queued_calls
+                torch.cuda._tls = torch.xpu._tls
+                torch.cuda.threading = torch.xpu.threading
+                torch.cuda.traceback = torch.xpu.traceback
+
+            if torch_version < 2.5:
+                torch.cuda.os = torch.xpu.os
+                torch.cuda.Device = torch.xpu.Device
+                torch.cuda.warnings = torch.xpu.warnings
+                torch.cuda.classproperty = torch.xpu.classproperty
+                torch.UntypedStorage.cuda = torch.UntypedStorage.xpu
+
+            if torch_version < 2.7:
+                torch.cuda.Tuple = torch.xpu.Tuple
+                torch.cuda.List = torch.xpu.List
+
+
+            # Memory:
+            if 'linux' in sys.platform and "WSL2" in os.popen("uname -a").read():
+                torch.xpu.empty_cache = lambda: None
+            torch.cuda.empty_cache = torch.xpu.empty_cache
+
+            if has_ipex:
+                torch.cuda.memory_summary = torch.xpu.memory_summary
+                torch.cuda.memory_snapshot = torch.xpu.memory_snapshot
+            torch.cuda.memory = torch.xpu.memory
+            torch.cuda.memory_stats = torch.xpu.memory_stats
+            torch.cuda.memory_allocated = torch.xpu.memory_allocated
+            torch.cuda.max_memory_allocated = torch.xpu.max_memory_allocated
+            torch.cuda.memory_reserved = torch.xpu.memory_reserved
+            torch.cuda.memory_cached = torch.xpu.memory_reserved
+            torch.cuda.max_memory_reserved = torch.xpu.max_memory_reserved
+            torch.cuda.max_memory_cached = torch.xpu.max_memory_reserved
+            torch.cuda.reset_peak_memory_stats = torch.xpu.reset_peak_memory_stats
+            torch.cuda.reset_max_memory_cached = torch.xpu.reset_peak_memory_stats
+            torch.cuda.reset_max_memory_allocated = torch.xpu.reset_peak_memory_stats
+            torch.cuda.memory_stats_as_nested_dict = torch.xpu.memory_stats_as_nested_dict
+            torch.cuda.reset_accumulated_memory_stats = torch.xpu.reset_accumulated_memory_stats
+
+            # RNG:
+            torch.cuda.get_rng_state = torch.xpu.get_rng_state
+            torch.cuda.get_rng_state_all = torch.xpu.get_rng_state_all
+            torch.cuda.set_rng_state = torch.xpu.set_rng_state
+            torch.cuda.set_rng_state_all = torch.xpu.set_rng_state_all
+            torch.cuda.manual_seed = torch.xpu.manual_seed
+            torch.cuda.manual_seed_all = torch.xpu.manual_seed_all
+            torch.cuda.seed = torch.xpu.seed
+            torch.cuda.seed_all = torch.xpu.seed_all
+            torch.cuda.initial_seed = torch.xpu.initial_seed
+
+            # C
+            if torch_version < 2.3:
+                torch._C._cuda_getCurrentRawStream = ipex._C._getCurrentRawStream
+                ipex._C._DeviceProperties.multi_processor_count = ipex._C._DeviceProperties.gpu_subslice_count
+                ipex._C._DeviceProperties.major = 12
+                ipex._C._DeviceProperties.minor = 1
+                ipex._C._DeviceProperties.L2_cache_size = 16*1024*1024 # A770 and A750
+            else:
+                torch._C._cuda_getCurrentRawStream = torch._C._xpu_getCurrentRawStream
+                torch._C._XpuDeviceProperties.multi_processor_count = torch._C._XpuDeviceProperties.gpu_subslice_count
+                torch._C._XpuDeviceProperties.major = 12
+                torch._C._XpuDeviceProperties.minor = 1
+                torch._C._XpuDeviceProperties.L2_cache_size = 16*1024*1024 # A770 and A750
+
+            # Fix functions with ipex:
+            # torch.xpu.mem_get_info always returns the total memory as free memory
+            torch.xpu.mem_get_info = lambda device=None: [(torch.xpu.get_device_properties(device).total_memory - torch.xpu.memory_reserved(device)), torch.xpu.get_device_properties(device).total_memory]
+            torch.cuda.mem_get_info = torch.xpu.mem_get_info
+            torch._utils._get_available_device_type = lambda: "xpu"
+            torch.has_cuda = True
+            torch.cuda.has_half = True
+            torch.cuda.is_bf16_supported = getattr(torch.xpu, "is_bf16_supported", lambda *args, **kwargs: True)
+            torch.cuda.is_fp16_supported = lambda *args, **kwargs: True
+            torch.backends.cuda.is_built = lambda *args, **kwargs: True
+            torch.version.cuda = "12.1"
+            torch.cuda.get_arch_list = getattr(torch.xpu, "get_arch_list", lambda: ["pvc", "dg2", "ats-m150"])
+            torch.cuda.get_device_capability = lambda *args, **kwargs: (12,1)
+            torch.cuda.get_device_properties.major = 12
+            torch.cuda.get_device_properties.minor = 1
+            torch.cuda.get_device_properties.L2_cache_size = 16*1024*1024 # A770 and A750
+            torch.cuda.ipc_collect = lambda *args, **kwargs: None
+            torch.cuda.utilization = lambda *args, **kwargs: 0
+
+            device_supports_fp64 = ipex_hijacks()
+            try:
+                from .diffusers import ipex_diffusers
+                ipex_diffusers(device_supports_fp64=device_supports_fp64)
+            except Exception: # pylint: disable=broad-exception-caught
+                pass
+            torch.cuda.is_xpu_hijacked = True
+    except Exception as e:
+        return False, e
+    return True, None
--- a/library/ipex/attention.py
+++ b/library/ipex/attention.py
@@ -0,0 +1,119 @@
+import os
+import torch
+from functools import cache, wraps
+
+# pylint: disable=protected-access, missing-function-docstring, line-too-long
+
+# ARC GPUs can't allocate more than 4GB to a single block so we slice the attention layers
+
+sdpa_slice_trigger_rate = float(os.environ.get('IPEX_SDPA_SLICE_TRIGGER_RATE', 1))
+attention_slice_rate = float(os.environ.get('IPEX_ATTENTION_SLICE_RATE', 0.5))
+
+# Find something divisible with the input_tokens
+@cache
+def find_split_size(original_size, slice_block_size, slice_rate=2):
+    split_size = original_size
+    while True:
+        if (split_size * slice_block_size) <= slice_rate and original_size % split_size == 0:
+            return split_size
+        split_size = split_size - 1
+        if split_size <= 1:
+            return 1
+    return split_size
+
+
+# Find slice sizes for SDPA
+@cache
+def find_sdpa_slice_sizes(query_shape, key_shape, query_element_size, slice_rate=2, trigger_rate=3):
+    batch_size, attn_heads, query_len, _ = query_shape
+    _, _, key_len, _ = key_shape
+
+    slice_batch_size = attn_heads * (query_len * key_len) * query_element_size / 1024 / 1024 / 1024
+
+    split_batch_size = batch_size
+    split_head_size = attn_heads
+    split_query_size = query_len
+
+    do_batch_split = False
+    do_head_split = False
+    do_query_split = False
+
+    if batch_size * slice_batch_size >= trigger_rate:
+        do_batch_split = True
+        split_batch_size = find_split_size(batch_size, slice_batch_size, slice_rate=slice_rate)
+
+        if split_batch_size * slice_batch_size > slice_rate:
+            slice_head_size = split_batch_size * (query_len * key_len) * query_element_size / 1024 / 1024 / 1024
+            do_head_split = True
+            split_head_size = find_split_size(attn_heads, slice_head_size, slice_rate=slice_rate)
+
+            if split_head_size * slice_head_size > slice_rate:
+                slice_query_size = split_batch_size * split_head_size * (key_len) * query_element_size / 1024 / 1024 / 1024
+                do_query_split = True
+                split_query_size = find_split_size(query_len, slice_query_size, slice_rate=slice_rate)
+
+    return do_batch_split, do_head_split, do_query_split, split_batch_size, split_head_size, split_query_size
+
+
+original_scaled_dot_product_attention = torch.nn.functional.scaled_dot_product_attention
+@wraps(torch.nn.functional.scaled_dot_product_attention)
+def dynamic_scaled_dot_product_attention(query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False, **kwargs):
+    if query.device.type != "xpu":
+        return original_scaled_dot_product_attention(query, key, value, attn_mask=attn_mask, dropout_p=dropout_p, is_causal=is_causal, **kwargs)
+    is_unsqueezed = False
+    if query.dim() == 3:
+        query = query.unsqueeze(0)
+        is_unsqueezed = True
+        if key.dim() == 3:
+            key = key.unsqueeze(0)
+        if value.dim() == 3:
+            value = value.unsqueeze(0)
+    do_batch_split, do_head_split, do_query_split, split_batch_size, split_head_size, split_query_size = find_sdpa_slice_sizes(query.shape, key.shape, query.element_size(), slice_rate=attention_slice_rate, trigger_rate=sdpa_slice_trigger_rate)
+
+    # Slice SDPA
+    if do_batch_split:
+        batch_size, attn_heads, query_len, _ = query.shape
+        _, _, _, head_dim = value.shape
+        hidden_states = torch.zeros((batch_size, attn_heads, query_len, head_dim), device=query.device, dtype=query.dtype)
+        if attn_mask is not None:
+            attn_mask = attn_mask.expand((query.shape[0], query.shape[1], query.shape[2], key.shape[-2]))
+        for ib in range(batch_size // split_batch_size):
+            start_idx = ib * split_batch_size
+            end_idx = (ib + 1) * split_batch_size
+            if do_head_split:
+                for ih in range(attn_heads // split_head_size): # pylint: disable=invalid-name
+                    start_idx_h = ih * split_head_size
+                    end_idx_h = (ih + 1) * split_head_size
+                    if do_query_split:
+                        for iq in range(query_len // split_query_size): # pylint: disable=invalid-name
+                            start_idx_q = iq * split_query_size
+                            end_idx_q = (iq + 1) * split_query_size
+                            hidden_states[start_idx:end_idx, start_idx_h:end_idx_h, start_idx_q:end_idx_q, :] = original_scaled_dot_product_attention(
+                                query[start_idx:end_idx, start_idx_h:end_idx_h, start_idx_q:end_idx_q, :],
+                                key[start_idx:end_idx, start_idx_h:end_idx_h, :, :],
+                                value[start_idx:end_idx, start_idx_h:end_idx_h, :, :],
+                                attn_mask=attn_mask[start_idx:end_idx, start_idx_h:end_idx_h, start_idx_q:end_idx_q, :] if attn_mask is not None else attn_mask,
+                                dropout_p=dropout_p, is_causal=is_causal, **kwargs
+                            )
+                    else:
+                        hidden_states[start_idx:end_idx, start_idx_h:end_idx_h, :, :] = original_scaled_dot_product_attention(
+                            query[start_idx:end_idx, start_idx_h:end_idx_h, :, :],
+                            key[start_idx:end_idx, start_idx_h:end_idx_h, :, :],
+                            value[start_idx:end_idx, start_idx_h:end_idx_h, :, :],
+                            attn_mask=attn_mask[start_idx:end_idx, start_idx_h:end_idx_h, :, :] if attn_mask is not None else attn_mask,
+                            dropout_p=dropout_p, is_causal=is_causal, **kwargs
+                        )
+            else:
+                hidden_states[start_idx:end_idx, :, :, :] = original_scaled_dot_product_attention(
+                    query[start_idx:end_idx, :, :, :],
+                    key[start_idx:end_idx, :, :, :],
+                    value[start_idx:end_idx, :, :, :],
+                    attn_mask=attn_mask[start_idx:end_idx, :, :, :] if attn_mask is not None else attn_mask,
+                    dropout_p=dropout_p, is_causal=is_causal, **kwargs
+                )
+        torch.xpu.synchronize(query.device)
+    else:
+        hidden_states = original_scaled_dot_product_attention(query, key, value, attn_mask=attn_mask, dropout_p=dropout_p, is_causal=is_causal, **kwargs)
+    if is_unsqueezed:
+        hidden_states = hidden_states.squeeze(0)
+    return hidden_states
--- a/library/ipex/diffusers.py
+++ b/library/ipex/diffusers.py
@@ -0,0 +1,126 @@
+from functools import wraps
+import torch
+import diffusers # pylint: disable=import-error
+from diffusers.utils import torch_utils # pylint: disable=import-error, unused-import # noqa: F401
+
+# pylint: disable=protected-access, missing-function-docstring, line-too-long
+
+
+# Diffusers FreeU
+# Diffusers is imported before ipex hijacks so fourier_filter needs hijacking too
+original_fourier_filter = diffusers.utils.torch_utils.fourier_filter
+@wraps(diffusers.utils.torch_utils.fourier_filter)
+def fourier_filter(x_in, threshold, scale):
+    return_dtype = x_in.dtype
+    return original_fourier_filter(x_in.to(dtype=torch.float32), threshold, scale).to(dtype=return_dtype)
+
+
+# fp64 error
+class FluxPosEmbed(torch.nn.Module):
+    def __init__(self, theta: int, axes_dim):
+        super().__init__()
+        self.theta = theta
+        self.axes_dim = axes_dim
+
+    def forward(self, ids: torch.Tensor) -> torch.Tensor:
+        n_axes = ids.shape[-1]
+        cos_out = []
+        sin_out = []
+        pos = ids.float()
+        for i in range(n_axes):
+            cos, sin = diffusers.models.embeddings.get_1d_rotary_pos_embed(
+                self.axes_dim[i],
+                pos[:, i],
+                theta=self.theta,
+                repeat_interleave_real=True,
+                use_real=True,
+                freqs_dtype=torch.float32,
+            )
+            cos_out.append(cos)
+            sin_out.append(sin)
+        freqs_cos = torch.cat(cos_out, dim=-1).to(ids.device)
+        freqs_sin = torch.cat(sin_out, dim=-1).to(ids.device)
+        return freqs_cos, freqs_sin
+
+
+def hidream_rope(pos: torch.Tensor, dim: int, theta: int) -> torch.Tensor:
+    assert dim % 2 == 0, "The dimension must be even."
+    return_device = pos.device
+    pos = pos.to("cpu")
+
+    scale = torch.arange(0, dim, 2, dtype=torch.float64, device=pos.device) / dim
+    omega = 1.0 / (theta**scale)
+
+    batch_size, seq_length = pos.shape
+    out = torch.einsum("...n,d->...nd", pos, omega)
+    cos_out = torch.cos(out)
+    sin_out = torch.sin(out)
+
+    stacked_out = torch.stack([cos_out, -sin_out, sin_out, cos_out], dim=-1)
+    out = stacked_out.view(batch_size, -1, dim // 2, 2, 2)
+    return out.to(return_device, dtype=torch.float32)
+
+
+def get_1d_sincos_pos_embed_from_grid(embed_dim, pos, output_type="np"):
+    if output_type == "np":
+        return diffusers.models.embeddings.get_1d_sincos_pos_embed_from_grid_np(embed_dim=embed_dim, pos=pos)
+    if embed_dim % 2 != 0:
+        raise ValueError("embed_dim must be divisible by 2")
+
+    omega = torch.arange(embed_dim // 2, device=pos.device, dtype=torch.float32)
+    omega /= embed_dim / 2.0
+    omega = 1.0 / 10000**omega  # (D/2,)
+
+    pos = pos.reshape(-1)  # (M,)
+    out = torch.outer(pos, omega)  # (M, D/2), outer product
+
+    emb_sin = torch.sin(out)  # (M, D/2)
+    emb_cos = torch.cos(out)  # (M, D/2)
+
+    emb = torch.concat([emb_sin, emb_cos], dim=1)  # (M, D)
+    return emb
+
+
+def apply_rotary_emb(x, freqs_cis, use_real: bool = True, use_real_unbind_dim: int = -1):
+    if use_real:
+        cos, sin = freqs_cis  # [S, D]
+        cos = cos[None, None]
+        sin = sin[None, None]
+        cos, sin = cos.to(x.device), sin.to(x.device)
+
+        if use_real_unbind_dim == -1:
+            # Used for flux, cogvideox, hunyuan-dit
+            x_real, x_imag = x.reshape(*x.shape[:-1], -1, 2).unbind(-1)  # [B, S, H, D//2]
+            x_rotated = torch.stack([-x_imag, x_real], dim=-1).flatten(3)
+        elif use_real_unbind_dim == -2:
+            # Used for Stable Audio, OmniGen, CogView4 and Cosmos
+            x_real, x_imag = x.reshape(*x.shape[:-1], 2, -1).unbind(-2)  # [B, S, H, D//2]
+            x_rotated = torch.cat([-x_imag, x_real], dim=-1)
+        else:
+            raise ValueError(f"`use_real_unbind_dim={use_real_unbind_dim}` but should be -1 or -2.")
+
+        out = (x.float() * cos + x_rotated.float() * sin).to(x.dtype)
+        return out
+    else:
+        # used for lumina
+        # force cpu with Alchemist
+        x_rotated = torch.view_as_complex(x.to("cpu").float().reshape(*x.shape[:-1], -1, 2))
+        freqs_cis = freqs_cis.to("cpu").unsqueeze(2)
+        x_out = torch.view_as_real(x_rotated * freqs_cis).flatten(3)
+        return x_out.type_as(x).to(x.device)
+
+
+def ipex_diffusers(device_supports_fp64=False):
+    diffusers.utils.torch_utils.fourier_filter = fourier_filter
+    if not device_supports_fp64:
+        # get around lazy imports
+        from diffusers.models import embeddings as diffusers_embeddings # pylint: disable=import-error, unused-import # noqa: F401
+        from diffusers.models import transformers as diffusers_transformers # pylint: disable=import-error, unused-import # noqa: F401
+        from diffusers.models import controlnets as diffusers_controlnets # pylint: disable=import-error, unused-import # noqa: F401
+        diffusers.models.embeddings.get_1d_sincos_pos_embed_from_grid = get_1d_sincos_pos_embed_from_grid
+        diffusers.models.embeddings.FluxPosEmbed = FluxPosEmbed
+        diffusers.models.embeddings.apply_rotary_emb = apply_rotary_emb
+        diffusers.models.transformers.transformer_flux.FluxPosEmbed = FluxPosEmbed
+        diffusers.models.transformers.transformer_lumina2.apply_rotary_emb = apply_rotary_emb
+        diffusers.models.controlnets.controlnet_flux.FluxPosEmbed = FluxPosEmbed
+        diffusers.models.transformers.transformer_hidream_image.rope = hidream_rope
--- a/library/ipex/hijacks.py
+++ b/library/ipex/hijacks.py
@@ -0,0 +1,466 @@
+import os
+from functools import wraps
+from contextlib import nullcontext
+import torch
+import numpy as np
+
+torch_version = float(torch.__version__[:3])
+current_xpu_device = f"xpu:{torch.xpu.current_device()}"
+device_supports_fp64 = torch.xpu.has_fp64_dtype() if hasattr(torch.xpu, "has_fp64_dtype") else torch.xpu.get_device_properties(current_xpu_device).has_fp64
+
+if os.environ.get('IPEX_FORCE_ATTENTION_SLICE', '0') == '0':
+    if (torch.xpu.get_device_properties(current_xpu_device).total_memory / 1024 / 1024 / 1024) > 4.1:
+        try:
+            x = torch.ones((33000,33000), dtype=torch.float32, device=current_xpu_device)
+            del x
+            torch.xpu.empty_cache()
+            use_dynamic_attention = False
+        except Exception:
+            use_dynamic_attention = True
+    else:
+        use_dynamic_attention = True
+else:
+    use_dynamic_attention = bool(os.environ.get('IPEX_FORCE_ATTENTION_SLICE', '0') == '1')
+
+# pylint: disable=protected-access, missing-function-docstring, line-too-long, unnecessary-lambda, no-else-return
+
+class DummyDataParallel(torch.nn.Module): # pylint: disable=missing-class-docstring, unused-argument, too-few-public-methods
+    def __new__(cls, module, device_ids=None, output_device=None, dim=0): # pylint: disable=unused-argument
+        if isinstance(device_ids, list) and len(device_ids) > 1:
+            print("IPEX backend doesn't support DataParallel on multiple XPU devices")
+        return module.to(f"xpu:{torch.xpu.current_device()}")
+
+def return_null_context(*args, **kwargs): # pylint: disable=unused-argument
+    return nullcontext()
+
+@property
+def is_cuda(self):
+    return self.device.type == "xpu" or self.device.type == "cuda"
+
+def check_device_type(device, device_type: str) -> bool:
+    if device is None or type(device) not in {str, int, torch.device}:
+        return False
+    else:
+        return bool(torch.device(device).type == device_type)
+
+def check_cuda(device) -> bool:
+    return bool(isinstance(device, int) or check_device_type(device, "cuda"))
+
+def return_xpu(device): # keep the device instance type, aka return string if the input is string
+    return f"xpu:{torch.xpu.current_device()}" if device is None else f"xpu:{device.split(':')[-1]}" if isinstance(device, str) and ":" in device else f"xpu:{device}" if isinstance(device, int) else torch.device(f"xpu:{device.index}" if device.index is not None else "xpu") if isinstance(device, torch.device) else "xpu"
+
+
+# Autocast
+original_autocast_init = torch.amp.autocast_mode.autocast.__init__
+@wraps(torch.amp.autocast_mode.autocast.__init__)
+def autocast_init(self, device_type=None, dtype=None, enabled=True, cache_enabled=None):
+    if device_type is None or check_cuda(device_type):
+        return original_autocast_init(self, device_type="xpu", dtype=dtype, enabled=enabled, cache_enabled=cache_enabled)
+    else:
+        return original_autocast_init(self, device_type=device_type, dtype=dtype, enabled=enabled, cache_enabled=cache_enabled)
+
+
+original_grad_scaler_init = torch.amp.grad_scaler.GradScaler.__init__
+@wraps(torch.amp.grad_scaler.GradScaler.__init__)
+def GradScaler_init(self, device: str = None, init_scale: float = 2.0**16, growth_factor: float = 2.0, backoff_factor: float = 0.5, growth_interval: int = 2000, enabled: bool = True):
+    if device is None or check_cuda(device):
+        return original_grad_scaler_init(self, device=return_xpu(device), init_scale=init_scale, growth_factor=growth_factor, backoff_factor=backoff_factor, growth_interval=growth_interval, enabled=enabled)
+    else:
+        return original_grad_scaler_init(self, device=device, init_scale=init_scale, growth_factor=growth_factor, backoff_factor=backoff_factor, growth_interval=growth_interval, enabled=enabled)
+
+
+original_is_autocast_enabled = torch.is_autocast_enabled
+@wraps(torch.is_autocast_enabled)
+def torch_is_autocast_enabled(device_type=None):
+    if device_type is None or check_cuda(device_type):
+        return original_is_autocast_enabled(return_xpu(device_type))
+    else:
+        return original_is_autocast_enabled(device_type)
+
+
+original_get_autocast_dtype = torch.get_autocast_dtype
+@wraps(torch.get_autocast_dtype)
+def torch_get_autocast_dtype(device_type=None):
+    if device_type is None or check_cuda(device_type) or check_device_type(device_type, "xpu"):
+        return torch.bfloat16
+    else:
+        return original_get_autocast_dtype(device_type)
+
+
+# Latent Antialias CPU Offload:
+# IPEX 2.5 and above has partial support but doesn't really work most of the time.
+original_interpolate = torch.nn.functional.interpolate
+@wraps(torch.nn.functional.interpolate)
+def interpolate(tensor, size=None, scale_factor=None, mode='nearest', align_corners=None, recompute_scale_factor=None, antialias=False): # pylint: disable=too-many-arguments
+    if mode in {'bicubic', 'bilinear'}:
+        return_device = tensor.device
+        return_dtype = tensor.dtype
+        return original_interpolate(tensor.to("cpu", dtype=torch.float32), size=size, scale_factor=scale_factor, mode=mode,
+        align_corners=align_corners, recompute_scale_factor=recompute_scale_factor, antialias=antialias).to(return_device, dtype=return_dtype)
+    else:
+        return original_interpolate(tensor, size=size, scale_factor=scale_factor, mode=mode,
+        align_corners=align_corners, recompute_scale_factor=recompute_scale_factor, antialias=antialias)
+
+
+# Diffusers Float64 (Alchemist GPUs doesn't support 64 bit):
+original_from_numpy = torch.from_numpy
+@wraps(torch.from_numpy)
+def from_numpy(ndarray):
+    if ndarray.dtype == float:
+        return original_from_numpy(ndarray.astype("float32"))
+    else:
+        return original_from_numpy(ndarray)
+
+original_as_tensor = torch.as_tensor
+@wraps(torch.as_tensor)
+def as_tensor(data, dtype=None, device=None):
+    if check_cuda(device):
+        device = return_xpu(device)
+    if isinstance(data, np.ndarray) and data.dtype == float and not check_device_type(device, "cpu"):
+        return original_as_tensor(data, dtype=torch.float32, device=device)
+    else:
+        return original_as_tensor(data, dtype=dtype, device=device)
+
+
+if not use_dynamic_attention:
+    original_scaled_dot_product_attention = torch.nn.functional.scaled_dot_product_attention
+else:
+    # 32 bit attention workarounds for Alchemist:
+    try:
+        from .attention import dynamic_scaled_dot_product_attention as original_scaled_dot_product_attention
+    except Exception: # pylint: disable=broad-exception-caught
+        original_scaled_dot_product_attention = torch.nn.functional.scaled_dot_product_attention
+
+@wraps(torch.nn.functional.scaled_dot_product_attention)
+def scaled_dot_product_attention(query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False, **kwargs):
+    if query.dtype != key.dtype:
+        key = key.to(dtype=query.dtype)
+    if query.dtype != value.dtype:
+        value = value.to(dtype=query.dtype)
+    if attn_mask is not None and query.dtype != attn_mask.dtype:
+        attn_mask = attn_mask.to(dtype=query.dtype)
+    return original_scaled_dot_product_attention(query, key, value, attn_mask=attn_mask, dropout_p=dropout_p, is_causal=is_causal, **kwargs)
+
+# Data Type Errors:
+original_torch_bmm = torch.bmm
+@wraps(torch.bmm)
+def torch_bmm(input, mat2, *, out=None):
+    if input.dtype != mat2.dtype:
+        mat2 = mat2.to(dtype=input.dtype)
+    return original_torch_bmm(input, mat2, out=out)
+
+# Diffusers FreeU
+original_fft_fftn = torch.fft.fftn
+@wraps(torch.fft.fftn)
+def fft_fftn(input, s=None, dim=None, norm=None, *, out=None):
+    return_dtype = input.dtype
+    return original_fft_fftn(input.to(dtype=torch.float32), s=s, dim=dim, norm=norm, out=out).to(dtype=return_dtype)
+
+# Diffusers FreeU
+original_fft_ifftn = torch.fft.ifftn
+@wraps(torch.fft.ifftn)
+def fft_ifftn(input, s=None, dim=None, norm=None, *, out=None):
+    return_dtype = input.dtype
+    return original_fft_ifftn(input.to(dtype=torch.float32), s=s, dim=dim, norm=norm, out=out).to(dtype=return_dtype)
+
+# A1111 FP16
+original_functional_group_norm = torch.nn.functional.group_norm
+@wraps(torch.nn.functional.group_norm)
+def functional_group_norm(input, num_groups, weight=None, bias=None, eps=1e-05):
+    if weight is not None and input.dtype != weight.data.dtype:
+        input = input.to(dtype=weight.data.dtype)
+    if bias is not None and weight is not None and bias.data.dtype != weight.data.dtype:
+        bias.data = bias.data.to(dtype=weight.data.dtype)
+    return original_functional_group_norm(input, num_groups, weight=weight, bias=bias, eps=eps)
+
+# A1111 BF16
+original_functional_layer_norm = torch.nn.functional.layer_norm
+@wraps(torch.nn.functional.layer_norm)
+def functional_layer_norm(input, normalized_shape, weight=None, bias=None, eps=1e-05):
+    if weight is not None and input.dtype != weight.data.dtype:
+        input = input.to(dtype=weight.data.dtype)
+    if bias is not None and weight is not None and bias.data.dtype != weight.data.dtype:
+        bias.data = bias.data.to(dtype=weight.data.dtype)
+    return original_functional_layer_norm(input, normalized_shape, weight=weight, bias=bias, eps=eps)
+
+# Training
+original_functional_linear = torch.nn.functional.linear
+@wraps(torch.nn.functional.linear)
+def functional_linear(input, weight, bias=None):
+    if input.dtype != weight.data.dtype:
+        input = input.to(dtype=weight.data.dtype)
+    if bias is not None and bias.data.dtype != weight.data.dtype:
+        bias.data = bias.data.to(dtype=weight.data.dtype)
+    return original_functional_linear(input, weight, bias=bias)
+
+original_functional_conv1d = torch.nn.functional.conv1d
+@wraps(torch.nn.functional.conv1d)
+def functional_conv1d(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1):
+    if input.dtype != weight.data.dtype:
+        input = input.to(dtype=weight.data.dtype)
+    if bias is not None and bias.data.dtype != weight.data.dtype:
+        bias.data = bias.data.to(dtype=weight.data.dtype)
+    return original_functional_conv1d(input, weight, bias=bias, stride=stride, padding=padding, dilation=dilation, groups=groups)
+
+original_functional_conv2d = torch.nn.functional.conv2d
+@wraps(torch.nn.functional.conv2d)
+def functional_conv2d(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1):
+    if input.dtype != weight.data.dtype:
+        input = input.to(dtype=weight.data.dtype)
+    if bias is not None and bias.data.dtype != weight.data.dtype:
+        bias.data = bias.data.to(dtype=weight.data.dtype)
+    return original_functional_conv2d(input, weight, bias=bias, stride=stride, padding=padding, dilation=dilation, groups=groups)
+
+# LTX Video
+original_functional_conv3d = torch.nn.functional.conv3d
+@wraps(torch.nn.functional.conv3d)
+def functional_conv3d(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1):
+    if input.dtype != weight.data.dtype:
+        input = input.to(dtype=weight.data.dtype)
+    if bias is not None and bias.data.dtype != weight.data.dtype:
+        bias.data = bias.data.to(dtype=weight.data.dtype)
+    return original_functional_conv3d(input, weight, bias=bias, stride=stride, padding=padding, dilation=dilation, groups=groups)
+
+# SwinIR BF16:
+original_functional_pad = torch.nn.functional.pad
+@wraps(torch.nn.functional.pad)
+def functional_pad(input, pad, mode='constant', value=None):
+    if mode == 'reflect' and input.dtype == torch.bfloat16:
+        return original_functional_pad(input.to(torch.float32), pad, mode=mode, value=value).to(dtype=torch.bfloat16)
+    else:
+        return original_functional_pad(input, pad, mode=mode, value=value)
+
+
+original_torch_tensor = torch.tensor
+@wraps(torch.tensor)
+def torch_tensor(data, *args, dtype=None, device=None, **kwargs):
+    global device_supports_fp64
+    if check_cuda(device):
+        device = return_xpu(device)
+    if not device_supports_fp64:
+        if check_device_type(device, "xpu"):
+            if dtype == torch.float64:
+                dtype = torch.float32
+            elif dtype is None and (hasattr(data, "dtype") and (data.dtype == torch.float64 or data.dtype == float)):
+                dtype = torch.float32
+    return original_torch_tensor(data, *args, dtype=dtype, device=device, **kwargs)
+
+torch.Tensor.original_Tensor_to = torch.Tensor.to
+@wraps(torch.Tensor.to)
+def Tensor_to(self, device=None, *args, **kwargs):
+    if check_cuda(device):
+        return self.original_Tensor_to(return_xpu(device), *args, **kwargs)
+    else:
+        return self.original_Tensor_to(device, *args, **kwargs)
+
+original_Tensor_cuda = torch.Tensor.cuda
+@wraps(torch.Tensor.cuda)
+def Tensor_cuda(self, device=None, *args, **kwargs):
+    if device is None or check_cuda(device):
+        return self.to(return_xpu(device), *args, **kwargs)
+    else:
+        return original_Tensor_cuda(self, device, *args, **kwargs)
+
+original_Tensor_pin_memory = torch.Tensor.pin_memory
+@wraps(torch.Tensor.pin_memory)
+def Tensor_pin_memory(self, device=None, *args, **kwargs):
+    if device is None or check_cuda(device):
+        return original_Tensor_pin_memory(self, return_xpu(device), *args, **kwargs)
+    else:
+        return original_Tensor_pin_memory(self, device, *args, **kwargs)
+
+original_UntypedStorage_init = torch.UntypedStorage.__init__
+@wraps(torch.UntypedStorage.__init__)
+def UntypedStorage_init(*args, device=None, **kwargs):
+    if check_cuda(device):
+        return original_UntypedStorage_init(*args, device=return_xpu(device), **kwargs)
+    else:
+        return original_UntypedStorage_init(*args, device=device, **kwargs)
+
+if torch_version >= 2.4:
+    original_UntypedStorage_to = torch.UntypedStorage.to
+    @wraps(torch.UntypedStorage.to)
+    def UntypedStorage_to(self, *args, device=None, **kwargs):
+        if check_cuda(device):
+            return original_UntypedStorage_to(self, *args, device=return_xpu(device), **kwargs)
+        else:
+            return original_UntypedStorage_to(self, *args, device=device, **kwargs)
+
+    original_UntypedStorage_cuda = torch.UntypedStorage.cuda
+    @wraps(torch.UntypedStorage.cuda)
+    def UntypedStorage_cuda(self, device=None, non_blocking=False, **kwargs):
+        if device is None or check_cuda(device):
+            return self.to(device=return_xpu(device), non_blocking=non_blocking, **kwargs)
+        else:
+            return original_UntypedStorage_cuda(self, device=device, non_blocking=non_blocking, **kwargs)
+
+original_torch_empty = torch.empty
+@wraps(torch.empty)
+def torch_empty(*args, device=None, **kwargs):
+    if check_cuda(device):
+        return original_torch_empty(*args, device=return_xpu(device), **kwargs)
+    else:
+        return original_torch_empty(*args, device=device, **kwargs)
+
+original_torch_randn = torch.randn
+@wraps(torch.randn)
+def torch_randn(*args, device=None, dtype=None, **kwargs):
+    if dtype is bytes:
+        dtype = None
+    if check_cuda(device):
+        return original_torch_randn(*args, device=return_xpu(device), **kwargs)
+    else:
+        return original_torch_randn(*args, device=device, **kwargs)
+
+original_torch_ones = torch.ones
+@wraps(torch.ones)
+def torch_ones(*args, device=None, **kwargs):
+    if check_cuda(device):
+        return original_torch_ones(*args, device=return_xpu(device), **kwargs)
+    else:
+        return original_torch_ones(*args, device=device, **kwargs)
+
+original_torch_zeros = torch.zeros
+@wraps(torch.zeros)
+def torch_zeros(*args, device=None, **kwargs):
+    if check_cuda(device):
+        return original_torch_zeros(*args, device=return_xpu(device), **kwargs)
+    else:
+        return original_torch_zeros(*args, device=device, **kwargs)
+
+original_torch_full = torch.full
+@wraps(torch.full)
+def torch_full(*args, device=None, **kwargs):
+    if check_cuda(device):
+        return original_torch_full(*args, device=return_xpu(device), **kwargs)
+    else:
+        return original_torch_full(*args, device=device, **kwargs)
+
+original_torch_linspace = torch.linspace
+@wraps(torch.linspace)
+def torch_linspace(*args, device=None, **kwargs):
+    if check_cuda(device):
+        return original_torch_linspace(*args, device=return_xpu(device), **kwargs)
+    else:
+        return original_torch_linspace(*args, device=device, **kwargs)
+
+original_torch_eye = torch.eye
+@wraps(torch.eye)
+def torch_eye(*args, device=None, **kwargs):
+    if check_cuda(device):
+        return original_torch_eye(*args, device=return_xpu(device), **kwargs)
+    else:
+        return original_torch_eye(*args, device=device, **kwargs)
+
+original_torch_load = torch.load
+@wraps(torch.load)
+def torch_load(f, map_location=None, *args, **kwargs):
+    if map_location is None or check_cuda(map_location):
+        return original_torch_load(f, *args, map_location=return_xpu(map_location), **kwargs)
+    else:
+        return original_torch_load(f, *args, map_location=map_location, **kwargs)
+
+@wraps(torch.cuda.synchronize)
+def torch_cuda_synchronize(device=None):
+    if check_cuda(device):
+        return torch.xpu.synchronize(return_xpu(device))
+    else:
+        return torch.xpu.synchronize(device)
+
+@wraps(torch.cuda.device)
+def torch_cuda_device(device):
+    if check_cuda(device):
+        return torch.xpu.device(return_xpu(device))
+    else:
+        return torch.xpu.device(device)
+
+@wraps(torch.cuda.set_device)
+def torch_cuda_set_device(device):
+    if check_cuda(device):
+        torch.xpu.set_device(return_xpu(device))
+    else:
+        torch.xpu.set_device(device)
+
+# torch.Generator has to be a class for isinstance checks
+original_torch_Generator = torch.Generator
+class torch_Generator(original_torch_Generator):
+    def __new__(self, device=None):
+        # can't hijack __init__ because of C override so use return super().__new__
+        if check_cuda(device):
+            return super().__new__(self, return_xpu(device))
+        else:
+            return super().__new__(self, device)
+
+
+# Hijack Functions:
+def ipex_hijacks():
+    global device_supports_fp64
+    if torch_version >= 2.4:
+        torch.UntypedStorage.cuda = UntypedStorage_cuda
+        torch.UntypedStorage.to = UntypedStorage_to
+    torch.tensor = torch_tensor
+    torch.Tensor.to = Tensor_to
+    torch.Tensor.cuda = Tensor_cuda
+    torch.Tensor.pin_memory = Tensor_pin_memory
+    torch.UntypedStorage.__init__ = UntypedStorage_init
+    torch.empty = torch_empty
+    torch.randn = torch_randn
+    torch.ones = torch_ones
+    torch.zeros = torch_zeros
+    torch.full = torch_full
+    torch.linspace = torch_linspace
+    torch.eye = torch_eye
+    torch.load = torch_load
+    torch.cuda.synchronize = torch_cuda_synchronize
+    torch.cuda.device = torch_cuda_device
+    torch.cuda.set_device = torch_cuda_set_device
+
+    torch.Generator = torch_Generator
+    torch._C.Generator = torch_Generator
+
+    torch.backends.cuda.sdp_kernel = return_null_context
+    torch.nn.DataParallel = DummyDataParallel
+    torch.UntypedStorage.is_cuda = is_cuda
+    torch.amp.autocast_mode.autocast.__init__ = autocast_init
+
+    torch.nn.functional.interpolate = interpolate
+    torch.nn.functional.scaled_dot_product_attention = scaled_dot_product_attention
+    torch.nn.functional.group_norm = functional_group_norm
+    torch.nn.functional.layer_norm = functional_layer_norm
+    torch.nn.functional.linear = functional_linear
+    torch.nn.functional.conv1d = functional_conv1d
+    torch.nn.functional.conv2d = functional_conv2d
+    torch.nn.functional.conv3d = functional_conv3d
+    torch.nn.functional.pad = functional_pad
+
+    torch.bmm = torch_bmm
+    torch.fft.fftn = fft_fftn
+    torch.fft.ifftn = fft_ifftn
+    if not device_supports_fp64:
+        torch.from_numpy = from_numpy
+        torch.as_tensor = as_tensor
+
+    # AMP:
+    torch.amp.grad_scaler.GradScaler.__init__ = GradScaler_init
+    torch.is_autocast_enabled = torch_is_autocast_enabled
+    torch.get_autocast_gpu_dtype = torch_get_autocast_dtype
+    torch.get_autocast_dtype = torch_get_autocast_dtype
+
+    if hasattr(torch.xpu, "amp"):
+        if not hasattr(torch.xpu.amp, "custom_fwd"):
+            torch.xpu.amp.custom_fwd = torch.cuda.amp.custom_fwd
+            torch.xpu.amp.custom_bwd = torch.cuda.amp.custom_bwd
+        if not hasattr(torch.xpu.amp, "GradScaler"):
+            torch.xpu.amp.GradScaler = torch.amp.grad_scaler.GradScaler
+        torch.cuda.amp = torch.xpu.amp
+    else:
+        if not hasattr(torch.amp, "custom_fwd"):
+            torch.amp.custom_fwd = torch.cuda.amp.custom_fwd
+            torch.amp.custom_bwd = torch.cuda.amp.custom_bwd
+        torch.cuda.amp = torch.amp
+
+    if not hasattr(torch.cuda.amp, "common"):
+        torch.cuda.amp.common = nullcontext()
+    torch.cuda.amp.common.amp_definitely_not_available = lambda: False
+
+    return device_supports_fp64
--- a/library/jpeg_xl_util.py
+++ b/library/jpeg_xl_util.py
@@ -0,0 +1,186 @@
+# Modified from https://github.com/Fraetor/jxl_decode Original license: MIT
+# Added partial read support for up to 200x speedup
+
+import os
+from typing import List, Tuple
+
+class JXLBitstream:
+    """
+    A stream of bits with methods for easy handling.
+    """
+
+    def __init__(self, file, offset: int = 0, offsets: List[List[int]] = None):
+        self.shift = 0
+        self.bitstream = bytearray()
+        self.file = file
+        self.offset = offset
+        self.offsets = offsets
+        if self.offsets:
+            self.offset = self.offsets[0][1]
+            self.previous_data_len = 0
+            self.index = 0
+        self.file.seek(self.offset)
+
+    def get_bits(self, length: int = 1) -> int:
+        if self.offsets and self.shift + length > self.previous_data_len + self.offsets[self.index][2]:
+            self.partial_to_read_length = length
+            if self.shift < self.previous_data_len + self.offsets[self.index][2]:
+                self.partial_read(0, length)
+            self.bitstream.extend(self.file.read(self.partial_to_read_length))
+        else:
+            self.bitstream.extend(self.file.read(length))
+        bitmask = 2**length - 1
+        bits = (int.from_bytes(self.bitstream, "little") >> self.shift) & bitmask
+        self.shift += length
+        return bits
+
+    def partial_read(self, current_length: int, length: int) -> None:
+        self.previous_data_len += self.offsets[self.index][2]
+        to_read_length = self.previous_data_len - (self.shift + current_length)
+        self.bitstream.extend(self.file.read(to_read_length))
+        current_length += to_read_length
+        self.partial_to_read_length -= to_read_length
+        self.index += 1
+        self.file.seek(self.offsets[self.index][1])
+        if self.shift + length > self.previous_data_len + self.offsets[self.index][2]:
+            self.partial_read(current_length, length)
+
+
+def decode_codestream(file, offset: int = 0, offsets: List[List[int]] = None) -> Tuple[int,int]:
+    """
+    Decodes the actual codestream.
+    JXL codestream specification: http://www-internal/2022/18181-1
+    """
+
+    # Convert codestream to int within an object to get some handy methods.
+    codestream = JXLBitstream(file, offset=offset, offsets=offsets)
+
+    # Skip signature
+    codestream.get_bits(16)
+
+    # SizeHeader
+    div8 = codestream.get_bits(1)
+    if div8:
+        height = 8 * (1 + codestream.get_bits(5))
+    else:
+        distribution = codestream.get_bits(2)
+        match distribution:
+            case 0:
+                height = 1 + codestream.get_bits(9)
+            case 1:
+                height = 1 + codestream.get_bits(13)
+            case 2:
+                height = 1 + codestream.get_bits(18)
+            case 3:
+                height = 1 + codestream.get_bits(30)
+    ratio = codestream.get_bits(3)
+    if div8 and not ratio:
+        width = 8 * (1 + codestream.get_bits(5))
+    elif not ratio:
+        distribution = codestream.get_bits(2)
+        match distribution:
+            case 0:
+                width = 1 + codestream.get_bits(9)
+            case 1:
+                width = 1 + codestream.get_bits(13)
+            case 2:
+                width = 1 + codestream.get_bits(18)
+            case 3:
+                width = 1 + codestream.get_bits(30)
+    else:
+        match ratio:
+            case 1:
+                width = height
+            case 2:
+                width = (height * 12) // 10
+            case 3:
+                width = (height * 4) // 3
+            case 4:
+                width = (height * 3) // 2
+            case 5:
+                width = (height * 16) // 9
+            case 6:
+                width = (height * 5) // 4
+            case 7:
+                width = (height * 2) // 1
+    return width, height
+
+
+def decode_container(file) -> Tuple[int,int]:
+    """
+    Parses the ISOBMFF container, extracts the codestream, and decodes it.
+    JXL container specification: http://www-internal/2022/18181-2
+    """
+
+    def parse_box(file, file_start: int) -> dict:
+        file.seek(file_start)
+        LBox = int.from_bytes(file.read(4), "big")
+        XLBox = None
+        if 1 < LBox <= 8:
+            raise ValueError(f"Invalid LBox at byte {file_start}.")
+        if LBox == 1:
+            file.seek(file_start + 8)
+            XLBox = int.from_bytes(file.read(8), "big")
+            if XLBox <= 16:
+                raise ValueError(f"Invalid XLBox at byte {file_start}.")
+        if XLBox:
+            header_length = 16
+            box_length = XLBox
+        else:
+            header_length = 8
+            if LBox == 0:
+                box_length = os.fstat(file.fileno()).st_size - file_start
+            else:
+                box_length = LBox
+        file.seek(file_start + 4)
+        box_type = file.read(4)
+        file.seek(file_start)
+        return {
+            "length": box_length,
+            "type": box_type,
+            "offset": header_length,
+        }
+
+    file.seek(0)
+    # Reject files missing required boxes. These two boxes are required to be at
+    # the start and contain no values, so we can manually check there presence.
+    # Signature box. (Redundant as has already been checked.)
+    if file.read(12) != bytes.fromhex("0000000C 4A584C20 0D0A870A"):
+        raise ValueError("Invalid signature box.")
+    # File Type box.
+    if file.read(20) != bytes.fromhex(
+        "00000014 66747970 6A786C20 00000000 6A786C20"
+    ):
+        raise ValueError("Invalid file type box.")
+
+    offset = 0
+    offsets = []
+    data_offset_not_found = True
+    container_pointer = 32
+    file_size = os.fstat(file.fileno()).st_size
+    while data_offset_not_found:
+        box = parse_box(file, container_pointer)
+        match box["type"]:
+            case b"jxlc":
+                offset = container_pointer + box["offset"]
+                data_offset_not_found = False
+            case b"jxlp":
+                file.seek(container_pointer + box["offset"])
+                index = int.from_bytes(file.read(4), "big")
+                offsets.append([index, container_pointer + box["offset"] + 4, box["length"] - box["offset"] - 4])
+        container_pointer += box["length"]
+        if container_pointer >= file_size:
+            data_offset_not_found = False
+
+    if offsets:
+        offsets.sort(key=lambda i: i[0])
+    file.seek(0)
+
+    return decode_codestream(file, offset=offset, offsets=offsets)
+
+
+def get_jxl_size(path: str) -> Tuple[int,int]:
+    with open(path, "rb") as file:
+        if file.read(2) == bytes.fromhex("FF0A"):
+            return decode_codestream(file)
+        return decode_container(file)
--- a/library/lpw_stable_diffusion.py
+++ b/library/lpw_stable_diffusion.py
--- a/library/lumina_models.py
+++ b/library/lumina_models.py
--- a/library/lumina_train_util.py
+++ b/library/lumina_train_util.py
--- a/library/lumina_util.py
+++ b/library/lumina_util.py
@@ -0,0 +1,233 @@
+import json
+import os
+from dataclasses import replace
+from typing import List, Optional, Tuple, Union
+
+import einops
+import torch
+from accelerate import init_empty_weights
+from safetensors import safe_open
+from safetensors.torch import load_file
+from transformers import Gemma2Config, Gemma2Model
+
+from library.utils import setup_logging
+from library import lumina_models, flux_models
+from library.utils import load_safetensors
+import logging
+
+setup_logging()
+logger = logging.getLogger(__name__)
+
+MODEL_VERSION_LUMINA_V2 = "lumina2"
+
+
+def load_lumina_model(
+    ckpt_path: str,
+    dtype: Optional[torch.dtype],
+    device: torch.device,
+    disable_mmap: bool = False,
+    use_flash_attn: bool = False,
+    use_sage_attn: bool = False,
+):
+    """
+    Load the Lumina model from the checkpoint path.
+
+    Args:
+        ckpt_path (str): Path to the checkpoint.
+        dtype (torch.dtype): The data type for the model.
+        device (torch.device): The device to load the model on.
+        disable_mmap (bool, optional): Whether to disable mmap. Defaults to False.
+        use_flash_attn (bool, optional): Whether to use flash attention. Defaults to False.
+
+    Returns:
+        model (lumina_models.NextDiT): The loaded model.
+    """
+    logger.info("Building Lumina")
+    with torch.device("meta"):
+        model = lumina_models.NextDiT_2B_GQA_patch2_Adaln_Refiner(use_flash_attn=use_flash_attn, use_sage_attn=use_sage_attn).to(dtype)
+
+    logger.info(f"Loading state dict from {ckpt_path}")
+    state_dict = load_safetensors(ckpt_path, device=device, disable_mmap=disable_mmap, dtype=dtype)
+    info = model.load_state_dict(state_dict, strict=False, assign=True)
+    logger.info(f"Loaded Lumina: {info}")
+    return model
+
+
+def load_ae(
+    ckpt_path: str,
+    dtype: torch.dtype,
+    device: Union[str, torch.device],
+    disable_mmap: bool = False,
+) -> flux_models.AutoEncoder:
+    """
+    Load the AutoEncoder model from the checkpoint path.
+
+    Args:
+        ckpt_path (str): Path to the checkpoint.
+        dtype (torch.dtype): The data type for the model.
+        device (Union[str, torch.device]): The device to load the model on.
+        disable_mmap (bool, optional): Whether to disable mmap. Defaults to False.
+
+    Returns:
+        ae (flux_models.AutoEncoder): The loaded model.
+    """
+    logger.info("Building AutoEncoder")
+    with torch.device("meta"):
+        # dev and schnell have the same AE params
+        ae = flux_models.AutoEncoder(flux_models.configs["schnell"].ae_params).to(dtype)
+
+    logger.info(f"Loading state dict from {ckpt_path}")
+    sd = load_safetensors(ckpt_path, device=device, disable_mmap=disable_mmap, dtype=dtype)
+    info = ae.load_state_dict(sd, strict=False, assign=True)
+    logger.info(f"Loaded AE: {info}")
+    return ae
+
+
+def load_gemma2(
+    ckpt_path: Optional[str],
+    dtype: torch.dtype,
+    device: Union[str, torch.device],
+    disable_mmap: bool = False,
+    state_dict: Optional[dict] = None,
+) -> Gemma2Model:
+    """
+    Load the Gemma2 model from the checkpoint path.
+
+    Args:
+        ckpt_path (str): Path to the checkpoint.
+        dtype (torch.dtype): The data type for the model.
+        device (Union[str, torch.device]): The device to load the model on.
+        disable_mmap (bool, optional): Whether to disable mmap. Defaults to False.
+        state_dict (Optional[dict], optional): The state dict to load. Defaults to None.
+
+    Returns:
+        gemma2 (Gemma2Model): The loaded model
+    """
+    logger.info("Building Gemma2")
+    GEMMA2_CONFIG = {
+        "_name_or_path": "google/gemma-2-2b",
+        "architectures": ["Gemma2Model"],
+        "attention_bias": False,
+        "attention_dropout": 0.0,
+        "attn_logit_softcapping": 50.0,
+        "bos_token_id": 2,
+        "cache_implementation": "hybrid",
+        "eos_token_id": 1,
+        "final_logit_softcapping": 30.0,
+        "head_dim": 256,
+        "hidden_act": "gelu_pytorch_tanh",
+        "hidden_activation": "gelu_pytorch_tanh",
+        "hidden_size": 2304,
+        "initializer_range": 0.02,
+        "intermediate_size": 9216,
+        "max_position_embeddings": 8192,
+        "model_type": "gemma2",
+        "num_attention_heads": 8,
+        "num_hidden_layers": 26,
+        "num_key_value_heads": 4,
+        "pad_token_id": 0,
+        "query_pre_attn_scalar": 256,
+        "rms_norm_eps": 1e-06,
+        "rope_theta": 10000.0,
+        "sliding_window": 4096,
+        "torch_dtype": "float32",
+        "transformers_version": "4.44.2",
+        "use_cache": True,
+        "vocab_size": 256000,
+    }
+
+    config = Gemma2Config(**GEMMA2_CONFIG)
+    with init_empty_weights():
+        gemma2 = Gemma2Model._from_config(config)
+
+    if state_dict is not None:
+        sd = state_dict
+    else:
+        logger.info(f"Loading state dict from {ckpt_path}")
+        sd = load_safetensors(ckpt_path, device=str(device), disable_mmap=disable_mmap, dtype=dtype)
+
+    for key in list(sd.keys()):
+        new_key = key.replace("model.", "")
+        if new_key == key:
+            break  # the model doesn't have annoying prefix
+        sd[new_key] = sd.pop(key)
+
+    info = gemma2.load_state_dict(sd, strict=False, assign=True)
+    logger.info(f"Loaded Gemma2: {info}")
+    return gemma2
+
+
+def unpack_latents(x: torch.Tensor, packed_latent_height: int, packed_latent_width: int) -> torch.Tensor:
+    """
+    x: [b (h w) (c ph pw)] -> [b c (h ph) (w pw)], ph=2, pw=2
+    """
+    x = einops.rearrange(x, "b (h w) (c ph pw) -> b c (h ph) (w pw)", h=packed_latent_height, w=packed_latent_width, ph=2, pw=2)
+    return x
+
+
+def pack_latents(x: torch.Tensor) -> torch.Tensor:
+    """
+    x: [b c (h ph) (w pw)] -> [b (h w) (c ph pw)], ph=2, pw=2
+    """
+    x = einops.rearrange(x, "b c (h ph) (w pw) -> b (h w) (c ph pw)", ph=2, pw=2)
+    return x
+
+
+
+DIFFUSERS_TO_ALPHA_VLLM_MAP: dict[str, str] = {
+    # Embedding layers
+    "time_caption_embed.caption_embedder.0.weight": "cap_embedder.0.weight",
+    "time_caption_embed.caption_embedder.1.weight": "cap_embedder.1.weight",
+    "text_embedder.1.bias": "cap_embedder.1.bias",
+    "patch_embedder.proj.weight": "x_embedder.weight",
+    "patch_embedder.proj.bias": "x_embedder.bias",
+    # Attention modulation
+    "transformer_blocks.().adaln_modulation.1.weight": "layers.().adaLN_modulation.1.weight",
+    "transformer_blocks.().adaln_modulation.1.bias": "layers.().adaLN_modulation.1.bias",
+    # Final layers
+    "final_adaln_modulation.1.weight": "final_layer.adaLN_modulation.1.weight",
+    "final_adaln_modulation.1.bias": "final_layer.adaLN_modulation.1.bias",
+    "final_linear.weight": "final_layer.linear.weight",
+    "final_linear.bias": "final_layer.linear.bias",
+    # Noise refiner
+    "single_transformer_blocks.().adaln_modulation.1.weight": "noise_refiner.().adaLN_modulation.1.weight",
+    "single_transformer_blocks.().adaln_modulation.1.bias": "noise_refiner.().adaLN_modulation.1.bias",
+    "single_transformer_blocks.().attn.to_qkv.weight": "noise_refiner.().attention.qkv.weight",
+    "single_transformer_blocks.().attn.to_out.0.weight": "noise_refiner.().attention.out.weight",
+    # Normalization
+    "transformer_blocks.().norm1.weight": "layers.().attention_norm1.weight",
+    "transformer_blocks.().norm2.weight": "layers.().attention_norm2.weight",
+    # FFN
+    "transformer_blocks.().ff.net.0.proj.weight": "layers.().feed_forward.w1.weight",
+    "transformer_blocks.().ff.net.2.weight": "layers.().feed_forward.w2.weight",
+    "transformer_blocks.().ff.net.4.weight": "layers.().feed_forward.w3.weight",
+}
+
+
+def convert_diffusers_sd_to_alpha_vllm(sd: dict, num_double_blocks: int) -> dict:
+    """Convert Diffusers checkpoint to Alpha-VLLM format"""
+    logger.info("Converting Diffusers checkpoint to Alpha-VLLM format")
+    new_sd = sd.copy()  # Preserve original keys
+
+    for diff_key, alpha_key in DIFFUSERS_TO_ALPHA_VLLM_MAP.items():
+        # Handle block-specific patterns
+        if '().' in diff_key:
+            for block_idx in range(num_double_blocks):
+                block_alpha_key = alpha_key.replace('().', f'{block_idx}.')
+                block_diff_key = diff_key.replace('().', f'{block_idx}.')
+                
+                # Search for and convert block-specific keys
+                for input_key, value in list(sd.items()):
+                    if input_key == block_diff_key:
+                        new_sd[block_alpha_key] = value
+        else:
+            # Handle static keys
+            if diff_key in sd:
+                print(f"Replacing {diff_key} with {alpha_key}")
+                new_sd[alpha_key] = sd[diff_key]
+            else:
+                print(f"Not found: {diff_key}")
+
+
+    logger.info(f"Converted {len(new_sd)} keys to Alpha-VLLM format")
+    return new_sd
--- a/library/model_util.py
+++ b/library/model_util.py
--- a/library/original_unet.py
+++ b/library/original_unet.py
--- a/library/sai_model_spec.py
+++ b/library/sai_model_spec.py
@@ -0,0 +1,346 @@
+# based on https://github.com/Stability-AI/ModelSpec
+import datetime
+import hashlib
+from io import BytesIO
+import os
+from typing import List, Optional, Tuple, Union
+import safetensors
+from library.utils import setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+r"""
+# Metadata Example
+metadata = {
+    # === Must ===
+    "modelspec.sai_model_spec": "1.0.0", # Required version ID for the spec
+    "modelspec.architecture": "stable-diffusion-xl-v1-base", # Architecture, reference the ID of the original model of the arch to match the ID
+    "modelspec.implementation": "sgm",
+    "modelspec.title": "Example Model Version 1.0", # Clean, human-readable title. May use your own phrasing/language/etc
+    # === Should ===
+    "modelspec.author": "Example Corp", # Your name or company name
+    "modelspec.description": "This is my example model to show you how to do it!", # Describe the model in your own words/language/etc. Focus on what users need to know
+    "modelspec.date": "2023-07-20", # ISO-8601 compliant date of when the model was created
+    # === Can ===
+    "modelspec.license": "ExampleLicense-1.0", # eg CreativeML Open RAIL, etc.
+    "modelspec.usage_hint": "Use keyword 'example'" # In your own language, very short hints about how the user should use the model
+}
+"""
+
+BASE_METADATA = {
+    # === Must ===
+    "modelspec.sai_model_spec": "1.0.0",  # Required version ID for the spec
+    "modelspec.architecture": None,
+    "modelspec.implementation": None,
+    "modelspec.title": None,
+    "modelspec.resolution": None,
+    # === Should ===
+    "modelspec.description": None,
+    "modelspec.author": None,
+    "modelspec.date": None,
+    # === Can ===
+    "modelspec.license": None,
+    "modelspec.tags": None,
+    "modelspec.merged_from": None,
+    "modelspec.prediction_type": None,
+    "modelspec.timestep_range": None,
+    "modelspec.encoder_layer": None,
+}
+
+# 別に使うやつだけ定義
+MODELSPEC_TITLE = "modelspec.title"
+
+ARCH_SD_V1 = "stable-diffusion-v1"
+ARCH_SD_V2_512 = "stable-diffusion-v2-512"
+ARCH_SD_V2_768_V = "stable-diffusion-v2-768-v"
+ARCH_SD_XL_V1_BASE = "stable-diffusion-xl-v1-base"
+ARCH_SD3_M = "stable-diffusion-3"  # may be followed by "-m" or "-5-large" etc.
+# ARCH_SD3_UNKNOWN = "stable-diffusion-3"
+ARCH_FLUX_1_DEV = "flux-1-dev"
+ARCH_FLUX_1_UNKNOWN = "flux-1"
+ARCH_LUMINA_2 = "lumina-2"
+ARCH_LUMINA_UNKNOWN = "lumina"
+
+ADAPTER_LORA = "lora"
+ADAPTER_TEXTUAL_INVERSION = "textual-inversion"
+
+IMPL_STABILITY_AI = "https://github.com/Stability-AI/generative-models"
+IMPL_COMFY_UI = "https://github.com/comfyanonymous/ComfyUI"
+IMPL_DIFFUSERS = "diffusers"
+IMPL_FLUX = "https://github.com/black-forest-labs/flux"
+IMPL_LUMINA = "https://github.com/Alpha-VLLM/Lumina-Image-2.0"
+
+PRED_TYPE_EPSILON = "epsilon"
+PRED_TYPE_V = "v"
+
+
+def load_bytes_in_safetensors(tensors):
+    bytes = safetensors.torch.save(tensors)
+    b = BytesIO(bytes)
+
+    b.seek(0)
+    header = b.read(8)
+    n = int.from_bytes(header, "little")
+
+    offset = n + 8
+    b.seek(offset)
+
+    return b.read()
+
+
+def precalculate_safetensors_hashes(state_dict):
+    # calculate each tensor one by one to reduce memory usage
+    hash_sha256 = hashlib.sha256()
+    for tensor in state_dict.values():
+        single_tensor_sd = {"tensor": tensor}
+        bytes_for_tensor = load_bytes_in_safetensors(single_tensor_sd)
+        hash_sha256.update(bytes_for_tensor)
+
+    return f"0x{hash_sha256.hexdigest()}"
+
+
+def update_hash_sha256(metadata: dict, state_dict: dict):
+    raise NotImplementedError
+
+
+def build_metadata(
+    state_dict: Optional[dict],
+    v2: bool,
+    v_parameterization: bool,
+    sdxl: bool,
+    lora: bool,
+    textual_inversion: bool,
+    timestamp: float,
+    title: Optional[str] = None,
+    reso: Optional[Union[int, Tuple[int, int]]] = None,
+    is_stable_diffusion_ckpt: Optional[bool] = None,
+    author: Optional[str] = None,
+    description: Optional[str] = None,
+    license: Optional[str] = None,
+    tags: Optional[str] = None,
+    merged_from: Optional[str] = None,
+    timesteps: Optional[Tuple[int, int]] = None,
+    clip_skip: Optional[int] = None,
+    sd3: Optional[str] = None,
+    flux: Optional[str] = None,
+    lumina: Optional[str] = None,
+):
+    """
+    sd3: only supports "m", flux: only supports "dev"
+    """
+    # if state_dict is None, hash is not calculated
+
+    metadata = {}
+    metadata.update(BASE_METADATA)
+
+    # TODO メモリを消費せずかつ正しいハッシュ計算の方法がわかったら実装する
+    # if state_dict is not None:
+    # hash = precalculate_safetensors_hashes(state_dict)
+    # metadata["modelspec.hash_sha256"] = hash
+
+    if sdxl:
+        arch = ARCH_SD_XL_V1_BASE
+    elif sd3 is not None:
+        arch = ARCH_SD3_M + "-" + sd3
+    elif flux is not None:
+        if flux == "dev":
+            arch = ARCH_FLUX_1_DEV
+        else:
+            arch = ARCH_FLUX_1_UNKNOWN
+    elif lumina is not None:
+        if lumina == "lumina2":
+            arch = ARCH_LUMINA_2
+        else:
+            arch = ARCH_LUMINA_UNKNOWN
+    elif v2:
+        if v_parameterization:
+            arch = ARCH_SD_V2_768_V
+        else:
+            arch = ARCH_SD_V2_512
+    else:
+        arch = ARCH_SD_V1
+
+    if lora:
+        arch += f"/{ADAPTER_LORA}"
+    elif textual_inversion:
+        arch += f"/{ADAPTER_TEXTUAL_INVERSION}"
+
+    metadata["modelspec.architecture"] = arch
+
+    if not lora and not textual_inversion and is_stable_diffusion_ckpt is None:
+        is_stable_diffusion_ckpt = True  # default is stable diffusion ckpt if not lora and not textual_inversion
+
+    if flux is not None:
+        # Flux
+        impl = IMPL_FLUX
+    elif lumina is not None:
+        # Lumina
+        impl = IMPL_LUMINA
+    elif (lora and sdxl) or textual_inversion or is_stable_diffusion_ckpt:
+        # Stable Diffusion ckpt, TI, SDXL LoRA
+        impl = IMPL_STABILITY_AI
+    else:
+        # v1/v2 LoRA or Diffusers
+        impl = IMPL_DIFFUSERS
+    metadata["modelspec.implementation"] = impl
+
+    if title is None:
+        if lora:
+            title = "LoRA"
+        elif textual_inversion:
+            title = "TextualInversion"
+        else:
+            title = "Checkpoint"
+        title += f"@{timestamp}"
+    metadata[MODELSPEC_TITLE] = title
+
+    if author is not None:
+        metadata["modelspec.author"] = author
+    else:
+        del metadata["modelspec.author"]
+
+    if description is not None:
+        metadata["modelspec.description"] = description
+    else:
+        del metadata["modelspec.description"]
+
+    if merged_from is not None:
+        metadata["modelspec.merged_from"] = merged_from
+    else:
+        del metadata["modelspec.merged_from"]
+
+    if license is not None:
+        metadata["modelspec.license"] = license
+    else:
+        del metadata["modelspec.license"]
+
+    if tags is not None:
+        metadata["modelspec.tags"] = tags
+    else:
+        del metadata["modelspec.tags"]
+
+    # remove microsecond from time
+    int_ts = int(timestamp)
+
+    # time to iso-8601 compliant date
+    date = datetime.datetime.fromtimestamp(int_ts).isoformat()
+    metadata["modelspec.date"] = date
+
+    if reso is not None:
+        # comma separated to tuple
+        if isinstance(reso, str):
+            reso = tuple(map(int, reso.split(",")))
+        if len(reso) == 1:
+            reso = (reso[0], reso[0])
+    else:
+        # resolution is defined in dataset, so use default
+        if sdxl or sd3 is not None or flux is not None or lumina is not None:
+            reso = 1024
+        elif v2 and v_parameterization:
+            reso = 768
+        else:
+            reso = 512
+    if isinstance(reso, int):
+        reso = (reso, reso)
+
+    metadata["modelspec.resolution"] = f"{reso[0]}x{reso[1]}"
+
+    if flux is not None:
+        del metadata["modelspec.prediction_type"]
+    elif v_parameterization:
+        metadata["modelspec.prediction_type"] = PRED_TYPE_V
+    else:
+        metadata["modelspec.prediction_type"] = PRED_TYPE_EPSILON
+
+    if timesteps is not None:
+        if isinstance(timesteps, str) or isinstance(timesteps, int):
+            timesteps = (timesteps, timesteps)
+        if len(timesteps) == 1:
+            timesteps = (timesteps[0], timesteps[0])
+        metadata["modelspec.timestep_range"] = f"{timesteps[0]},{timesteps[1]}"
+    else:
+        del metadata["modelspec.timestep_range"]
+
+    if clip_skip is not None:
+        metadata["modelspec.encoder_layer"] = f"{clip_skip}"
+    else:
+        del metadata["modelspec.encoder_layer"]
+
+    # # assert all values are filled
+    # assert all([v is not None for v in metadata.values()]), metadata
+    if not all([v is not None for v in metadata.values()]):
+        logger.error(f"Internal error: some metadata values are None: {metadata}")
+
+    return metadata
+
+
+# region utils
+
+
+def get_title(metadata: dict) -> Optional[str]:
+    return metadata.get(MODELSPEC_TITLE, None)
+
+
+def load_metadata_from_safetensors(model: str) -> dict:
+    if not model.endswith(".safetensors"):
+        return {}
+
+    with safetensors.safe_open(model, framework="pt") as f:
+        metadata = f.metadata()
+    if metadata is None:
+        metadata = {}
+    return metadata
+
+
+def build_merged_from(models: List[str]) -> str:
+    def get_title(model: str):
+        metadata = load_metadata_from_safetensors(model)
+        title = metadata.get(MODELSPEC_TITLE, None)
+        if title is None:
+            title = os.path.splitext(os.path.basename(model))[0]  # use filename
+        return title
+
+    titles = [get_title(model) for model in models]
+    return ", ".join(titles)
+
+
+# endregion
+
+
+r"""
+if __name__ == "__main__":
+    import argparse
+    import torch
+    from safetensors.torch import load_file
+    from library import train_util
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--ckpt", type=str, required=True)
+    args = parser.parse_args()
+
+    print(f"Loading {args.ckpt}")
+    state_dict = load_file(args.ckpt)
+
+    print(f"Calculating metadata")
+    metadata = get(state_dict, False, False, False, False, "sgm", False, False, "title", "date", 256, 1000, 0)
+    print(metadata)
+    del state_dict
+
+    # by reference implementation
+    with open(args.ckpt, mode="rb") as file_data:
+        file_hash = hashlib.sha256()
+        head_len = struct.unpack("Q", file_data.read(8))  # int64 header length prefix
+        header = json.loads(file_data.read(head_len[0]))  # header itself, json string
+        content = (
+            file_data.read()
+        )  # All other content is tightly packed tensors. Copy to RAM for simplicity, but you can avoid this read with a more careful FS-dependent impl.
+        file_hash.update(content)
+        # ===== Update the hash for modelspec =====
+        by_ref = f"0x{file_hash.hexdigest()}"
+    print(by_ref)
+    print("is same?", by_ref == metadata["modelspec.hash_sha256"])
+
+"""
--- a/library/sd3_models.py
+++ b/library/sd3_models.py
--- a/library/sd3_train_utils.py
+++ b/library/sd3_train_utils.py
@@ -0,0 +1,945 @@
+import argparse
+import math
+import os
+import toml
+import json
+import time
+from typing import Dict, List, Optional, Tuple, Union
+
+import torch
+from safetensors.torch import save_file
+from accelerate import Accelerator, PartialState
+from tqdm import tqdm
+from PIL import Image
+from transformers import CLIPTextModelWithProjection, T5EncoderModel
+
+from library.device_utils import init_ipex, clean_memory_on_device
+
+init_ipex()
+
+# from transformers import CLIPTokenizer
+# from library import model_util
+# , sdxl_model_util, train_util, sdxl_original_unet
+# from library.sdxl_lpw_stable_diffusion import SdxlStableDiffusionLongPromptWeightingPipeline
+from .utils import setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+from library import sd3_models, sd3_utils, strategy_base, train_util
+
+
+def save_models(
+    ckpt_path: str,
+    mmdit: Optional[sd3_models.MMDiT],
+    vae: Optional[sd3_models.SDVAE],
+    clip_l: Optional[CLIPTextModelWithProjection],
+    clip_g: Optional[CLIPTextModelWithProjection],
+    t5xxl: Optional[T5EncoderModel],
+    sai_metadata: Optional[dict],
+    save_dtype: Optional[torch.dtype] = None,
+):
+    r"""
+    Save models to checkpoint file. Only supports unified checkpoint format.
+    """
+
+    state_dict = {}
+
+    def update_sd(prefix, sd):
+        for k, v in sd.items():
+            key = prefix + k
+            if save_dtype is not None:
+                v = v.detach().clone().to("cpu").to(save_dtype)
+            state_dict[key] = v
+
+    update_sd("model.diffusion_model.", mmdit.state_dict())
+    update_sd("first_stage_model.", vae.state_dict())
+
+    # do not support unified checkpoint format for now
+    # if clip_l is not None:
+    #     update_sd("text_encoders.clip_l.", clip_l.state_dict())
+    # if clip_g is not None:
+    #     update_sd("text_encoders.clip_g.", clip_g.state_dict())
+    # if t5xxl is not None:
+    #     update_sd("text_encoders.t5xxl.", t5xxl.state_dict())
+
+    save_file(state_dict, ckpt_path, metadata=sai_metadata)
+
+    if clip_l is not None:
+        clip_l_path = ckpt_path.replace(".safetensors", "_clip_l.safetensors")
+        save_file(clip_l.state_dict(), clip_l_path)
+    if clip_g is not None:
+        clip_g_path = ckpt_path.replace(".safetensors", "_clip_g.safetensors")
+        save_file(clip_g.state_dict(), clip_g_path)
+    if t5xxl is not None:
+        t5xxl_path = ckpt_path.replace(".safetensors", "_t5xxl.safetensors")
+        t5xxl_state_dict = t5xxl.state_dict()
+
+        # replace "shared.weight" with copy of it to avoid annoying shared tensor error on safetensors.save_file
+        shared_weight = t5xxl_state_dict["shared.weight"]
+        shared_weight_copy = shared_weight.detach().clone()
+        t5xxl_state_dict["shared.weight"] = shared_weight_copy
+
+        save_file(t5xxl_state_dict, t5xxl_path)
+
+
+def save_sd3_model_on_train_end(
+    args: argparse.Namespace,
+    save_dtype: torch.dtype,
+    epoch: int,
+    global_step: int,
+    clip_l: Optional[CLIPTextModelWithProjection],
+    clip_g: Optional[CLIPTextModelWithProjection],
+    t5xxl: Optional[T5EncoderModel],
+    mmdit: sd3_models.MMDiT,
+    vae: sd3_models.SDVAE,
+):
+    def sd_saver(ckpt_file, epoch_no, global_step):
+        sai_metadata = train_util.get_sai_model_spec(
+            None, args, False, False, False, is_stable_diffusion_ckpt=True, sd3=mmdit.model_type
+        )
+        save_models(ckpt_file, mmdit, vae, clip_l, clip_g, t5xxl, sai_metadata, save_dtype)
+
+    train_util.save_sd_model_on_train_end_common(args, True, True, epoch, global_step, sd_saver, None)
+
+
+# epochとstepの保存、メタデータにepoch/stepが含まれ引数が同じになるため、統合している
+# on_epoch_end: Trueならepoch終了時、Falseならstep経過時
+def save_sd3_model_on_epoch_end_or_stepwise(
+    args: argparse.Namespace,
+    on_epoch_end: bool,
+    accelerator,
+    save_dtype: torch.dtype,
+    epoch: int,
+    num_train_epochs: int,
+    global_step: int,
+    clip_l: Optional[CLIPTextModelWithProjection],
+    clip_g: Optional[CLIPTextModelWithProjection],
+    t5xxl: Optional[T5EncoderModel],
+    mmdit: sd3_models.MMDiT,
+    vae: sd3_models.SDVAE,
+):
+    def sd_saver(ckpt_file, epoch_no, global_step):
+        sai_metadata = train_util.get_sai_model_spec(
+            None, args, False, False, False, is_stable_diffusion_ckpt=True, sd3=mmdit.model_type
+        )
+        save_models(ckpt_file, mmdit, vae, clip_l, clip_g, t5xxl, sai_metadata, save_dtype)
+
+    train_util.save_sd_model_on_epoch_end_or_stepwise_common(
+        args,
+        on_epoch_end,
+        accelerator,
+        True,
+        True,
+        epoch,
+        num_train_epochs,
+        global_step,
+        sd_saver,
+        None,
+    )
+
+
+def add_sd3_training_arguments(parser: argparse.ArgumentParser):
+    parser.add_argument(
+        "--clip_l",
+        type=str,
+        required=False,
+        help="CLIP-L model path. if not specified, use ckpt's state_dict / CLIP-Lモデルのパス。指定しない場合はckptのstate_dictを使用",
+    )
+    parser.add_argument(
+        "--clip_g",
+        type=str,
+        required=False,
+        help="CLIP-G model path. if not specified, use ckpt's state_dict / CLIP-Gモデルのパス。指定しない場合はckptのstate_dictを使用",
+    )
+    parser.add_argument(
+        "--t5xxl",
+        type=str,
+        required=False,
+        help="T5-XXL model path. if not specified, use ckpt's state_dict / T5-XXLモデルのパス。指定しない場合はckptのstate_dictを使用",
+    )
+    parser.add_argument(
+        "--save_clip",
+        action="store_true",
+        help="[DOES NOT WORK] unified checkpoint is not supported / 統合チェックポイントはまだサポートされていません",
+    )
+    parser.add_argument(
+        "--save_t5xxl",
+        action="store_true",
+        help="[DOES NOT WORK] unified checkpoint is not supported / 統合チェックポイントはまだサポートされていません",
+    )
+
+    parser.add_argument(
+        "--t5xxl_device",
+        type=str,
+        default=None,
+        help="[DOES NOT WORK] not supported yet. T5-XXL device. if not specified, use accelerator's device / T5-XXLデバイス。指定しない場合はacceleratorのデバイスを使用",
+    )
+    parser.add_argument(
+        "--t5xxl_dtype",
+        type=str,
+        default=None,
+        help="[DOES NOT WORK] not supported yet. T5-XXL dtype. if not specified, use default dtype (from mixed precision) / T5-XXL dtype。指定しない場合はデフォルトのdtype（mixed precisionから）を使用",
+    )
+
+    parser.add_argument(
+        "--t5xxl_max_token_length",
+        type=int,
+        default=256,
+        help="maximum token length for T5-XXL. 256 is the default value / T5-XXLの最大トークン長。デフォルトは256",
+    )
+    parser.add_argument(
+        "--apply_lg_attn_mask",
+        action="store_true",
+        help="apply attention mask (zero embs) to CLIP-L and G / CLIP-LとGにアテンションマスク（ゼロ埋め）を適用する",
+    )
+    parser.add_argument(
+        "--apply_t5_attn_mask",
+        action="store_true",
+        help="apply attention mask (zero embs) to T5-XXL / T5-XXLにアテンションマスク（ゼロ埋め）を適用する",
+    )
+    parser.add_argument(
+        "--clip_l_dropout_rate",
+        type=float,
+        default=0.0,
+        help="Dropout rate for CLIP-L encoder, default is 0.0 / CLIP-Lエンコーダのドロップアウト率、デフォルトは0.0",
+    )
+    parser.add_argument(
+        "--clip_g_dropout_rate",
+        type=float,
+        default=0.0,
+        help="Dropout rate for CLIP-G encoder, default is 0.0 / CLIP-Gエンコーダのドロップアウト率、デフォルトは0.0",
+    )
+    parser.add_argument(
+        "--t5_dropout_rate",
+        type=float,
+        default=0.0,
+        help="Dropout rate for T5 encoder, default is 0.0 / T5エンコーダのドロップアウト率、デフォルトは0.0",
+    )
+    parser.add_argument(
+        "--pos_emb_random_crop_rate",
+        type=float,
+        default=0.0,
+        help="Random crop rate for positional embeddings, default is 0.0. Only for SD3.5M"
+        " / 位置埋め込みのランダムクロップ率、デフォルトは0.0。SD3.5M以外では予期しない動作になります",
+    )
+    parser.add_argument(
+        "--enable_scaled_pos_embed",
+        action="store_true",
+        help="Scale position embeddings for each resolution during multi-resolution training. Only for SD3.5M"
+        " / 複数解像度学習時に解像度ごとに位置埋め込みをスケーリングする。SD3.5M以外では予期しない動作になります",
+    )
+
+    # Dependencies of Diffusers noise sampler has been removed for clarity in training
+
+    parser.add_argument(
+        "--training_shift",
+        type=float,
+        default=1.0,
+        help="Discrete flow shift for training timestep distribution adjustment, applied in addition to the weighting scheme, default is 1.0. /タイムステップ分布のための離散フローシフト、重み付けスキームの上に適用される、デフォルトは1.0。",
+    )
+
+
+def verify_sdxl_training_args(args: argparse.Namespace, supportTextEncoderCaching: bool = True):
+    assert not args.v2, "v2 cannot be enabled in SDXL training / SDXL学習ではv2を有効にすることはできません"
+    if args.v_parameterization:
+        logger.warning("v_parameterization will be unexpected / SDXL学習ではv_parameterizationは想定外の動作になります")
+
+    if args.clip_skip is not None:
+        logger.warning("clip_skip will be unexpected / SDXL学習ではclip_skipは動作しません")
+
+    # if args.multires_noise_iterations:
+    #     logger.info(
+    #         f"Warning: SDXL has been trained with noise_offset={DEFAULT_NOISE_OFFSET}, but noise_offset is disabled due to multires_noise_iterations / SDXLはnoise_offset={DEFAULT_NOISE_OFFSET}で学習されていますが、multires_noise_iterationsが有効になっているためnoise_offsetは無効になります"
+    #     )
+    # else:
+    #     if args.noise_offset is None:
+    #         args.noise_offset = DEFAULT_NOISE_OFFSET
+    #     elif args.noise_offset != DEFAULT_NOISE_OFFSET:
+    #         logger.info(
+    #             f"Warning: SDXL has been trained with noise_offset={DEFAULT_NOISE_OFFSET} / SDXLはnoise_offset={DEFAULT_NOISE_OFFSET}で学習されています"
+    #         )
+    #     logger.info(f"noise_offset is set to {args.noise_offset} / noise_offsetが{args.noise_offset}に設定されました")
+
+    assert (
+        not hasattr(args, "weighted_captions") or not args.weighted_captions
+    ), "weighted_captions cannot be enabled in SDXL training currently / SDXL学習では今のところweighted_captionsを有効にすることはできません"
+
+    if supportTextEncoderCaching:
+        if args.cache_text_encoder_outputs_to_disk and not args.cache_text_encoder_outputs:
+            args.cache_text_encoder_outputs = True
+            logger.warning(
+                "cache_text_encoder_outputs is enabled because cache_text_encoder_outputs_to_disk is enabled / "
+                + "cache_text_encoder_outputs_to_diskが有効になっているためcache_text_encoder_outputsが有効になりました"
+            )
+
+
+# temporary copied from sd3_minimal_inferece.py
+
+
+def get_all_sigmas(sampling: sd3_utils.ModelSamplingDiscreteFlow, steps):
+    start = sampling.timestep(sampling.sigma_max)
+    end = sampling.timestep(sampling.sigma_min)
+    timesteps = torch.linspace(start, end, steps)
+    sigs = []
+    for x in range(len(timesteps)):
+        ts = timesteps[x]
+        sigs.append(sampling.sigma(ts))
+    sigs += [0.0]
+    return torch.FloatTensor(sigs)
+
+
+def max_denoise(model_sampling, sigmas):
+    max_sigma = float(model_sampling.sigma_max)
+    sigma = float(sigmas[0])
+    return math.isclose(max_sigma, sigma, rel_tol=1e-05) or sigma > max_sigma
+
+
+def do_sample(
+    height: int,
+    width: int,
+    seed: int,
+    cond: Tuple[torch.Tensor, torch.Tensor],
+    neg_cond: Tuple[torch.Tensor, torch.Tensor],
+    mmdit: sd3_models.MMDiT,
+    steps: int,
+    guidance_scale: float,
+    dtype: torch.dtype,
+    device: str,
+):
+    latent = torch.zeros(1, 16, height // 8, width // 8, device=device)
+    latent = latent.to(dtype).to(device)
+
+    # noise = get_noise(seed, latent).to(device)
+    if seed is not None:
+        generator = torch.manual_seed(seed)
+    else:
+        generator = None
+    noise = (
+        torch.randn(latent.size(), dtype=torch.float32, layout=latent.layout, generator=generator, device="cpu")
+        .to(latent.dtype)
+        .to(device)
+    )
+
+    model_sampling = sd3_utils.ModelSamplingDiscreteFlow(shift=3.0)  # 3.0 is for SD3
+
+    sigmas = get_all_sigmas(model_sampling, steps).to(device)
+
+    noise_scaled = model_sampling.noise_scaling(sigmas[0], noise, latent, max_denoise(model_sampling, sigmas))
+
+    c_crossattn = torch.cat([cond[0], neg_cond[0]]).to(device).to(dtype)
+    y = torch.cat([cond[1], neg_cond[1]]).to(device).to(dtype)
+
+    x = noise_scaled.to(device).to(dtype)
+    # print(x.shape)
+
+    # with torch.no_grad():
+    for i in tqdm(range(len(sigmas) - 1)):
+        sigma_hat = sigmas[i]
+
+        timestep = model_sampling.timestep(sigma_hat).float()
+        timestep = torch.FloatTensor([timestep, timestep]).to(device)
+
+        x_c_nc = torch.cat([x, x], dim=0)
+        # print(x_c_nc.shape, timestep.shape, c_crossattn.shape, y.shape)
+
+        mmdit.prepare_block_swap_before_forward()
+        model_output = mmdit(x_c_nc, timestep, context=c_crossattn, y=y)
+        model_output = model_output.float()
+        batched = model_sampling.calculate_denoised(sigma_hat, model_output, x)
+
+        pos_out, neg_out = batched.chunk(2)
+        denoised = neg_out + (pos_out - neg_out) * guidance_scale
+        # print(denoised.shape)
+
+        # d = to_d(x, sigma_hat, denoised)
+        dims_to_append = x.ndim - sigma_hat.ndim
+        sigma_hat_dims = sigma_hat[(...,) + (None,) * dims_to_append]
+        # print(dims_to_append, x.shape, sigma_hat.shape, denoised.shape, sigma_hat_dims.shape)
+        """Converts a denoiser output to a Karras ODE derivative."""
+        d = (x - denoised) / sigma_hat_dims
+
+        dt = sigmas[i + 1] - sigma_hat
+
+        # Euler method
+        x = x + d * dt
+        x = x.to(dtype)
+
+    mmdit.prepare_block_swap_before_forward()
+    return x
+
+
+def sample_images(
+    accelerator: Accelerator,
+    args: argparse.Namespace,
+    epoch,
+    steps,
+    mmdit,
+    vae,
+    text_encoders,
+    sample_prompts_te_outputs,
+    prompt_replacement=None,
+):
+    if steps == 0:
+        if not args.sample_at_first:
+            return
+    else:
+        if args.sample_every_n_steps is None and args.sample_every_n_epochs is None:
+            return
+        if args.sample_every_n_epochs is not None:
+            # sample_every_n_steps は無視する
+            if epoch is None or epoch % args.sample_every_n_epochs != 0:
+                return
+        else:
+            if steps % args.sample_every_n_steps != 0 or epoch is not None:  # steps is not divisible or end of epoch
+                return
+
+    logger.info("")
+    logger.info(f"generating sample images at step / サンプル画像生成 ステップ: {steps}")
+    if not os.path.isfile(args.sample_prompts) and sample_prompts_te_outputs is None:
+        logger.error(f"No prompt file / プロンプトファイルがありません: {args.sample_prompts}")
+        return
+
+    distributed_state = PartialState()  # for multi gpu distributed inference. this is a singleton, so it's safe to use it here
+
+    # unwrap unet and text_encoder(s)
+    mmdit = accelerator.unwrap_model(mmdit)
+    text_encoders = None if text_encoders is None else [accelerator.unwrap_model(te) for te in text_encoders]
+    # print([(te.parameters().__next__().device if te is not None else None) for te in text_encoders])
+
+    prompts = train_util.load_prompts(args.sample_prompts)
+
+    save_dir = args.output_dir + "/sample"
+    os.makedirs(save_dir, exist_ok=True)
+
+    # save random state to restore later
+    rng_state = torch.get_rng_state()
+    cuda_rng_state = None
+    try:
+        cuda_rng_state = torch.cuda.get_rng_state() if torch.cuda.is_available() else None
+    except Exception:
+        pass
+
+    if distributed_state.num_processes <= 1:
+        # If only one device is available, just use the original prompt list. We don't need to care about the distribution of prompts.
+        with torch.no_grad(), accelerator.autocast():
+            for prompt_dict in prompts:
+                sample_image_inference(
+                    accelerator,
+                    args,
+                    mmdit,
+                    text_encoders,
+                    vae,
+                    save_dir,
+                    prompt_dict,
+                    epoch,
+                    steps,
+                    sample_prompts_te_outputs,
+                    prompt_replacement,
+                )
+    else:
+        # Creating list with N elements, where each element is a list of prompt_dicts, and N is the number of processes available (number of devices available)
+        # prompt_dicts are assigned to lists based on order of processes, to attempt to time the image creation time to match enum order. Probably only works when steps and sampler are identical.
+        per_process_prompts = []  # list of lists
+        for i in range(distributed_state.num_processes):
+            per_process_prompts.append(prompts[i :: distributed_state.num_processes])
+
+        with torch.no_grad():
+            with distributed_state.split_between_processes(per_process_prompts) as prompt_dict_lists:
+                for prompt_dict in prompt_dict_lists[0]:
+                    sample_image_inference(
+                        accelerator,
+                        args,
+                        mmdit,
+                        text_encoders,
+                        vae,
+                        save_dir,
+                        prompt_dict,
+                        epoch,
+                        steps,
+                        sample_prompts_te_outputs,
+                        prompt_replacement,
+                    )
+
+    torch.set_rng_state(rng_state)
+    if cuda_rng_state is not None:
+        torch.cuda.set_rng_state(cuda_rng_state)
+
+    clean_memory_on_device(accelerator.device)
+
+
+def sample_image_inference(
+    accelerator: Accelerator,
+    args: argparse.Namespace,
+    mmdit: sd3_models.MMDiT,
+    text_encoders: List[Union[CLIPTextModelWithProjection, T5EncoderModel]],
+    vae: sd3_models.SDVAE,
+    save_dir,
+    prompt_dict,
+    epoch,
+    steps,
+    sample_prompts_te_outputs,
+    prompt_replacement,
+):
+    assert isinstance(prompt_dict, dict)
+    negative_prompt = prompt_dict.get("negative_prompt")
+    sample_steps = prompt_dict.get("sample_steps", 30)
+    width = prompt_dict.get("width", 512)
+    height = prompt_dict.get("height", 512)
+    scale = prompt_dict.get("scale", 7.5)
+    seed = prompt_dict.get("seed")
+    # controlnet_image = prompt_dict.get("controlnet_image")
+    prompt: str = prompt_dict.get("prompt", "")
+    # sampler_name: str = prompt_dict.get("sample_sampler", args.sample_sampler)
+
+    if prompt_replacement is not None:
+        prompt = prompt.replace(prompt_replacement[0], prompt_replacement[1])
+        if negative_prompt is not None:
+            negative_prompt = negative_prompt.replace(prompt_replacement[0], prompt_replacement[1])
+
+    if seed is not None:
+        torch.manual_seed(seed)
+        torch.cuda.manual_seed(seed)
+    else:
+        # True random sample image generation
+        torch.seed()
+        torch.cuda.seed()
+
+    if negative_prompt is None:
+        negative_prompt = ""
+
+    height = max(64, height - height % 8)  # round to divisible by 8
+    width = max(64, width - width % 8)  # round to divisible by 8
+    logger.info(f"prompt: {prompt}")
+    logger.info(f"negative_prompt: {negative_prompt}")
+    logger.info(f"height: {height}")
+    logger.info(f"width: {width}")
+    logger.info(f"sample_steps: {sample_steps}")
+    logger.info(f"scale: {scale}")
+    # logger.info(f"sample_sampler: {sampler_name}")
+    if seed is not None:
+        logger.info(f"seed: {seed}")
+
+    # encode prompts
+    tokenize_strategy = strategy_base.TokenizeStrategy.get_strategy()
+    encoding_strategy = strategy_base.TextEncodingStrategy.get_strategy()
+
+    def encode_prompt(prpt):
+        text_encoder_conds = []
+        if sample_prompts_te_outputs and prpt in sample_prompts_te_outputs:
+            text_encoder_conds = sample_prompts_te_outputs[prpt]
+            print(f"Using cached text encoder outputs for prompt: {prpt}")
+        if text_encoders is not None:
+            print(f"Encoding prompt: {prpt}")
+            tokens_and_masks = tokenize_strategy.tokenize(prpt)
+            # strategy has apply_t5_attn_mask option
+            encoded_text_encoder_conds = encoding_strategy.encode_tokens(tokenize_strategy, text_encoders, tokens_and_masks)
+
+            # if text_encoder_conds is not cached, use encoded_text_encoder_conds
+            if len(text_encoder_conds) == 0:
+                text_encoder_conds = encoded_text_encoder_conds
+            else:
+                # if encoded_text_encoder_conds is not None, update cached text_encoder_conds
+                for i in range(len(encoded_text_encoder_conds)):
+                    if encoded_text_encoder_conds[i] is not None:
+                        text_encoder_conds[i] = encoded_text_encoder_conds[i]
+        return text_encoder_conds
+
+    lg_out, t5_out, pooled, l_attn_mask, g_attn_mask, t5_attn_mask = encode_prompt(prompt)
+    cond = encoding_strategy.concat_encodings(lg_out, t5_out, pooled)
+
+    # encode negative prompts
+    lg_out, t5_out, pooled, l_attn_mask, g_attn_mask, t5_attn_mask = encode_prompt(negative_prompt)
+    neg_cond = encoding_strategy.concat_encodings(lg_out, t5_out, pooled)
+
+    # sample image
+    clean_memory_on_device(accelerator.device)
+    with accelerator.autocast(), torch.no_grad():
+        # mmdit may be fp8, so we need weight_dtype here. vae is always in that dtype.
+        latents = do_sample(height, width, seed, cond, neg_cond, mmdit, sample_steps, scale, vae.dtype, accelerator.device)
+
+    # latent to image
+    clean_memory_on_device(accelerator.device)
+    org_vae_device = vae.device  # will be on cpu
+    vae.to(accelerator.device)
+    latents = vae.process_out(latents.to(vae.device, dtype=vae.dtype))
+    image = vae.decode(latents)
+    vae.to(org_vae_device)
+    clean_memory_on_device(accelerator.device)
+
+    image = image.float()
+    image = torch.clamp((image + 1.0) / 2.0, min=0.0, max=1.0)[0]
+    decoded_np = 255.0 * np.moveaxis(image.cpu().numpy(), 0, 2)
+    decoded_np = decoded_np.astype(np.uint8)
+
+    image = Image.fromarray(decoded_np)
+    # adding accelerator.wait_for_everyone() here should sync up and ensure that sample images are saved in the same order as the original prompt list
+    # but adding 'enum' to the filename should be enough
+
+    ts_str = time.strftime("%Y%m%d%H%M%S", time.localtime())
+    num_suffix = f"e{epoch:06d}" if epoch is not None else f"{steps:06d}"
+    seed_suffix = "" if seed is None else f"_{seed}"
+    i: int = prompt_dict["enum"]
+    img_filename = f"{'' if args.output_name is None else args.output_name + '_'}{num_suffix}_{i:02d}_{ts_str}{seed_suffix}.png"
+    image.save(os.path.join(save_dir, img_filename))
+
+    # send images to wandb if enabled
+    if "wandb" in [tracker.name for tracker in accelerator.trackers]:
+        wandb_tracker = accelerator.get_tracker("wandb")
+
+        import wandb
+
+        # not to commit images to avoid inconsistency between training and logging steps
+        wandb_tracker.log({f"sample_{i}": wandb.Image(image, caption=prompt)}, commit=False)  # positive prompt as a caption
+
+
+# region Diffusers
+
+
+from dataclasses import dataclass
+from typing import Optional, Tuple, Union
+
+import numpy as np
+import torch
+
+from diffusers.configuration_utils import ConfigMixin, register_to_config
+from diffusers.schedulers.scheduling_utils import SchedulerMixin
+from diffusers.utils.torch_utils import randn_tensor
+from diffusers.utils import BaseOutput
+
+
+@dataclass
+class FlowMatchEulerDiscreteSchedulerOutput(BaseOutput):
+    """
+    Output class for the scheduler's `step` function output.
+
+    Args:
+        prev_sample (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)` for images):
+            Computed sample `(x_{t-1})` of previous timestep. `prev_sample` should be used as next model input in the
+            denoising loop.
+    """
+
+    prev_sample: torch.FloatTensor
+
+
+class FlowMatchEulerDiscreteScheduler(SchedulerMixin, ConfigMixin):
+    """
+    Euler scheduler.
+
+    This model inherits from [`SchedulerMixin`] and [`ConfigMixin`]. Check the superclass documentation for the generic
+    methods the library implements for all schedulers such as loading and saving.
+
+    Args:
+        num_train_timesteps (`int`, defaults to 1000):
+            The number of diffusion steps to train the model.
+        timestep_spacing (`str`, defaults to `"linspace"`):
+            The way the timesteps should be scaled. Refer to Table 2 of the [Common Diffusion Noise Schedules and
+            Sample Steps are Flawed](https://huggingface.co/papers/2305.08891) for more information.
+        shift (`float`, defaults to 1.0):
+            The shift value for the timestep schedule.
+    """
+
+    _compatibles = []
+    order = 1
+
+    @register_to_config
+    def __init__(
+        self,
+        num_train_timesteps: int = 1000,
+        shift: float = 1.0,
+    ):
+        timesteps = np.linspace(1, num_train_timesteps, num_train_timesteps, dtype=np.float32)[::-1].copy()
+        timesteps = torch.from_numpy(timesteps).to(dtype=torch.float32)
+
+        sigmas = timesteps / num_train_timesteps
+        sigmas = shift * sigmas / (1 + (shift - 1) * sigmas)
+
+        self.timesteps = sigmas * num_train_timesteps
+
+        self._step_index = None
+        self._begin_index = None
+
+        self.sigmas = sigmas.to("cpu")  # to avoid too much CPU/GPU communication
+        self.sigma_min = self.sigmas[-1].item()
+        self.sigma_max = self.sigmas[0].item()
+
+    @property
+    def step_index(self):
+        """
+        The index counter for current timestep. It will increase 1 after each scheduler step.
+        """
+        return self._step_index
+
+    @property
+    def begin_index(self):
+        """
+        The index for the first timestep. It should be set from pipeline with `set_begin_index` method.
+        """
+        return self._begin_index
+
+    # Copied from diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler.set_begin_index
+    def set_begin_index(self, begin_index: int = 0):
+        """
+        Sets the begin index for the scheduler. This function should be run from pipeline before the inference.
+
+        Args:
+            begin_index (`int`):
+                The begin index for the scheduler.
+        """
+        self._begin_index = begin_index
+
+    def scale_noise(
+        self,
+        sample: torch.FloatTensor,
+        timestep: Union[float, torch.FloatTensor],
+        noise: Optional[torch.FloatTensor] = None,
+    ) -> torch.FloatTensor:
+        """
+        Forward process in flow-matching
+
+        Args:
+            sample (`torch.FloatTensor`):
+                The input sample.
+            timestep (`int`, *optional*):
+                The current timestep in the diffusion chain.
+
+        Returns:
+            `torch.FloatTensor`:
+                A scaled input sample.
+        """
+        if self.step_index is None:
+            self._init_step_index(timestep)
+
+        sigma = self.sigmas[self.step_index]
+        sample = sigma * noise + (1.0 - sigma) * sample
+
+        return sample
+
+    def _sigma_to_t(self, sigma):
+        return sigma * self.config.num_train_timesteps
+
+    def set_timesteps(self, num_inference_steps: int, device: Union[str, torch.device] = None):
+        """
+        Sets the discrete timesteps used for the diffusion chain (to be run before inference).
+
+        Args:
+            num_inference_steps (`int`):
+                The number of diffusion steps used when generating samples with a pre-trained model.
+            device (`str` or `torch.device`, *optional*):
+                The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
+        """
+        self.num_inference_steps = num_inference_steps
+
+        timesteps = np.linspace(self._sigma_to_t(self.sigma_max), self._sigma_to_t(self.sigma_min), num_inference_steps)
+
+        sigmas = timesteps / self.config.num_train_timesteps
+        sigmas = self.config.shift * sigmas / (1 + (self.config.shift - 1) * sigmas)
+        sigmas = torch.from_numpy(sigmas).to(dtype=torch.float32, device=device)
+
+        timesteps = sigmas * self.config.num_train_timesteps
+        self.timesteps = timesteps.to(device=device)
+        self.sigmas = torch.cat([sigmas, torch.zeros(1, device=sigmas.device)])
+
+        self._step_index = None
+        self._begin_index = None
+
+    def index_for_timestep(self, timestep, schedule_timesteps=None):
+        if schedule_timesteps is None:
+            schedule_timesteps = self.timesteps
+
+        indices = (schedule_timesteps == timestep).nonzero()
+
+        # The sigma index that is taken for the **very** first `step`
+        # is always the second index (or the last index if there is only 1)
+        # This way we can ensure we don't accidentally skip a sigma in
+        # case we start in the middle of the denoising schedule (e.g. for image-to-image)
+        pos = 1 if len(indices) > 1 else 0
+
+        return indices[pos].item()
+
+    def _init_step_index(self, timestep):
+        if self.begin_index is None:
+            if isinstance(timestep, torch.Tensor):
+                timestep = timestep.to(self.timesteps.device)
+            self._step_index = self.index_for_timestep(timestep)
+        else:
+            self._step_index = self._begin_index
+
+    def step(
+        self,
+        model_output: torch.FloatTensor,
+        timestep: Union[float, torch.FloatTensor],
+        sample: torch.FloatTensor,
+        s_churn: float = 0.0,
+        s_tmin: float = 0.0,
+        s_tmax: float = float("inf"),
+        s_noise: float = 1.0,
+        generator: Optional[torch.Generator] = None,
+        return_dict: bool = True,
+    ) -> Union[FlowMatchEulerDiscreteSchedulerOutput, Tuple]:
+        """
+        Predict the sample from the previous timestep by reversing the SDE. This function propagates the diffusion
+        process from the learned model outputs (most often the predicted noise).
+
+        Args:
+            model_output (`torch.FloatTensor`):
+                The direct output from learned diffusion model.
+            timestep (`float`):
+                The current discrete timestep in the diffusion chain.
+            sample (`torch.FloatTensor`):
+                A current instance of a sample created by the diffusion process.
+            s_churn (`float`):
+            s_tmin  (`float`):
+            s_tmax  (`float`):
+            s_noise (`float`, defaults to 1.0):
+                Scaling factor for noise added to the sample.
+            generator (`torch.Generator`, *optional*):
+                A random number generator.
+            return_dict (`bool`):
+                Whether or not to return a [`~schedulers.scheduling_euler_discrete.EulerDiscreteSchedulerOutput`] or
+                tuple.
+
+        Returns:
+            [`~schedulers.scheduling_euler_discrete.EulerDiscreteSchedulerOutput`] or `tuple`:
+                If return_dict is `True`, [`~schedulers.scheduling_euler_discrete.EulerDiscreteSchedulerOutput`] is
+                returned, otherwise a tuple is returned where the first element is the sample tensor.
+        """
+
+        if isinstance(timestep, int) or isinstance(timestep, torch.IntTensor) or isinstance(timestep, torch.LongTensor):
+            raise ValueError(
+                (
+                    "Passing integer indices (e.g. from `enumerate(timesteps)`) as timesteps to"
+                    " `EulerDiscreteScheduler.step()` is not supported. Make sure to pass"
+                    " one of the `scheduler.timesteps` as a timestep."
+                ),
+            )
+
+        if self.step_index is None:
+            self._init_step_index(timestep)
+
+        # Upcast to avoid precision issues when computing prev_sample
+        sample = sample.to(torch.float32)
+
+        sigma = self.sigmas[self.step_index]
+
+        gamma = min(s_churn / (len(self.sigmas) - 1), 2**0.5 - 1) if s_tmin <= sigma <= s_tmax else 0.0
+
+        noise = randn_tensor(model_output.shape, dtype=model_output.dtype, device=model_output.device, generator=generator)
+
+        eps = noise * s_noise
+        sigma_hat = sigma * (gamma + 1)
+
+        if gamma > 0:
+            sample = sample + eps * (sigma_hat**2 - sigma**2) ** 0.5
+
+        # 1. compute predicted original sample (x_0) from sigma-scaled predicted noise
+        # NOTE: "original_sample" should not be an expected prediction_type but is left in for
+        # backwards compatibility
+
+        # if self.config.prediction_type == "vector_field":
+
+        denoised = sample - model_output * sigma
+        # 2. Convert to an ODE derivative
+        derivative = (sample - denoised) / sigma_hat
+
+        dt = self.sigmas[self.step_index + 1] - sigma_hat
+
+        prev_sample = sample + derivative * dt
+        # Cast sample back to model compatible dtype
+        prev_sample = prev_sample.to(model_output.dtype)
+
+        # upon completion increase step index by one
+        self._step_index += 1
+
+        if not return_dict:
+            return (prev_sample,)
+
+        return FlowMatchEulerDiscreteSchedulerOutput(prev_sample=prev_sample)
+
+    def __len__(self):
+        return self.config.num_train_timesteps
+
+
+def get_sigmas(noise_scheduler, timesteps, device, n_dim=4, dtype=torch.float32):
+    sigmas = noise_scheduler.sigmas.to(device=device, dtype=dtype)
+    schedule_timesteps = noise_scheduler.timesteps.to(device)
+    timesteps = timesteps.to(device)
+    step_indices = [(schedule_timesteps == t).nonzero().item() for t in timesteps]
+
+    sigma = sigmas[step_indices].flatten()
+    while len(sigma.shape) < n_dim:
+        sigma = sigma.unsqueeze(-1)
+    return sigma
+
+
+def compute_density_for_timestep_sampling(
+    weighting_scheme: str, batch_size: int, logit_mean: float = None, logit_std: float = None, mode_scale: float = None
+):
+    """Compute the density for sampling the timesteps when doing SD3 training.
+
+    Courtesy: This was contributed by Rafie Walker in https://github.com/huggingface/diffusers/pull/8528.
+
+    SD3 paper reference: https://arxiv.org/abs/2403.03206v1.
+    """
+    if weighting_scheme == "logit_normal":
+        # See 3.1 in the SD3 paper ($rf/lognorm(0.00,1.00)$).
+        u = torch.normal(mean=logit_mean, std=logit_std, size=(batch_size,), device="cpu")
+        u = torch.nn.functional.sigmoid(u)
+    elif weighting_scheme == "mode":
+        u = torch.rand(size=(batch_size,), device="cpu")
+        u = 1 - u - mode_scale * (torch.cos(math.pi * u / 2) ** 2 - 1 + u)
+    else:
+        u = torch.rand(size=(batch_size,), device="cpu")
+    return u
+
+
+def compute_loss_weighting_for_sd3(weighting_scheme: str, sigmas=None):
+    """Computes loss weighting scheme for SD3 training.
+
+    Courtesy: This was contributed by Rafie Walker in https://github.com/huggingface/diffusers/pull/8528.
+
+    SD3 paper reference: https://arxiv.org/abs/2403.03206v1.
+    """
+    if weighting_scheme == "sigma_sqrt":
+        weighting = (sigmas**-2.0).float()
+    elif weighting_scheme == "cosmap":
+        bot = 1 - 2 * sigmas + 2 * sigmas**2
+        weighting = 2 / (math.pi * bot)
+    else:
+        weighting = torch.ones_like(sigmas)
+    return weighting
+
+
+# endregion
+
+
+def get_noisy_model_input_and_timesteps(args, latents, noise, device, dtype) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+    bsz = latents.shape[0]
+
+    # Sample a random timestep for each image
+    # for weighting schemes where we sample timesteps non-uniformly
+    u = compute_density_for_timestep_sampling(
+        weighting_scheme=args.weighting_scheme,
+        batch_size=bsz,
+        logit_mean=args.logit_mean,
+        logit_std=args.logit_std,
+        mode_scale=args.mode_scale,
+    )
+    t_min = args.min_timestep if args.min_timestep is not None else 0
+    t_max = args.max_timestep if args.max_timestep is not None else 1000
+    shift = args.training_shift
+
+    # weighting shift, value >1 will shift distribution to noisy side (focus more on overall structure), value <1 will shift towards less-noisy side (focus more on details)
+    u = (u * shift) / (1 + (shift - 1) * u)
+
+    indices = (u * (t_max - t_min) + t_min).long()
+    timesteps = indices.to(device=device, dtype=dtype)
+
+    # sigmas according to flowmatching
+    sigmas = timesteps / 1000
+    sigmas = sigmas.view(-1, 1, 1, 1)
+    noisy_model_input = sigmas * noise + (1.0 - sigmas) * latents
+
+    return noisy_model_input, timesteps, sigmas
--- a/library/sd3_utils.py
+++ b/library/sd3_utils.py
@@ -0,0 +1,302 @@
+from dataclasses import dataclass
+import math
+import re
+from typing import Dict, List, Optional, Union
+import torch
+import safetensors
+from safetensors.torch import load_file
+from accelerate import init_empty_weights
+from transformers import CLIPTextModel, CLIPTextModelWithProjection, CLIPConfig, CLIPTextConfig
+
+from .utils import setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+from library import sd3_models
+
+# TODO move some of functions to model_util.py
+from library import sdxl_model_util
+
+# region models
+
+# TODO remove dependency on flux_utils
+from library.utils import load_safetensors
+from library.flux_utils import load_t5xxl as flux_utils_load_t5xxl
+
+
+def analyze_state_dict_state(state_dict: Dict, prefix: str = ""):
+    logger.info(f"Analyzing state dict state...")
+
+    # analyze configs
+    patch_size = state_dict[f"{prefix}x_embedder.proj.weight"].shape[2]
+    depth = state_dict[f"{prefix}x_embedder.proj.weight"].shape[0] // 64
+    num_patches = state_dict[f"{prefix}pos_embed"].shape[1]
+    pos_embed_max_size = round(math.sqrt(num_patches))
+    adm_in_channels = state_dict[f"{prefix}y_embedder.mlp.0.weight"].shape[1]
+    context_shape = state_dict[f"{prefix}context_embedder.weight"].shape
+    qk_norm = "rms" if f"{prefix}joint_blocks.0.context_block.attn.ln_k.weight" in state_dict.keys() else None
+
+    #  x_block_self_attn_layers.append(int(key.split(".x_block.attn2.ln_k.weight")[0].split(".")[-1]))
+    x_block_self_attn_layers = []
+    re_attn = re.compile(r"\.(\d+)\.x_block\.attn2\.ln_k\.weight")
+    for key in list(state_dict.keys()):
+        m = re_attn.search(key)
+        if m:
+            x_block_self_attn_layers.append(int(m.group(1)))
+
+    context_embedder_in_features = context_shape[1]
+    context_embedder_out_features = context_shape[0]
+
+    # only supports 3-5-large, medium or 3-medium
+    if qk_norm is not None:
+        if len(x_block_self_attn_layers) == 0:
+            model_type = "3-5-large"
+        else:
+            model_type = "3-5-medium"
+    else:
+        model_type = "3-medium"
+
+    params = sd3_models.SD3Params(
+        patch_size=patch_size,
+        depth=depth,
+        num_patches=num_patches,
+        pos_embed_max_size=pos_embed_max_size,
+        adm_in_channels=adm_in_channels,
+        qk_norm=qk_norm,
+        x_block_self_attn_layers=x_block_self_attn_layers,
+        context_embedder_in_features=context_embedder_in_features,
+        context_embedder_out_features=context_embedder_out_features,
+        model_type=model_type,
+    )
+    logger.info(f"Analyzed state dict state: {params}")
+    return params
+
+
+def load_mmdit(
+    state_dict: Dict, dtype: Optional[Union[str, torch.dtype]], device: Union[str, torch.device], attn_mode: str = "torch"
+) -> sd3_models.MMDiT:
+    mmdit_sd = {}
+
+    mmdit_prefix = "model.diffusion_model."
+    for k in list(state_dict.keys()):
+        if k.startswith(mmdit_prefix):
+            mmdit_sd[k[len(mmdit_prefix) :]] = state_dict.pop(k)
+
+    # load MMDiT
+    logger.info("Building MMDit")
+    params = analyze_state_dict_state(mmdit_sd)
+    with init_empty_weights():
+        mmdit = sd3_models.create_sd3_mmdit(params, attn_mode)
+
+    logger.info("Loading state dict...")
+    info = mmdit.load_state_dict(mmdit_sd, strict=False, assign=True)
+    logger.info(f"Loaded MMDiT: {info}")
+    return mmdit
+
+
+def load_clip_l(
+    clip_l_path: Optional[str],
+    dtype: Optional[Union[str, torch.dtype]],
+    device: Union[str, torch.device],
+    disable_mmap: bool = False,
+    state_dict: Optional[Dict] = None,
+):
+    clip_l_sd = None
+    if clip_l_path is None:
+        if "text_encoders.clip_l.transformer.text_model.embeddings.position_embedding.weight" in state_dict:
+            # found clip_l: remove prefix "text_encoders.clip_l."
+            logger.info("clip_l is included in the checkpoint")
+            clip_l_sd = {}
+            prefix = "text_encoders.clip_l."
+            for k in list(state_dict.keys()):
+                if k.startswith(prefix):
+                    clip_l_sd[k[len(prefix) :]] = state_dict.pop(k)
+        elif clip_l_path is None:
+            logger.info("clip_l is not included in the checkpoint and clip_l_path is not provided")
+            return None
+
+    # load clip_l
+    logger.info("Building CLIP-L")
+    config = CLIPTextConfig(
+        vocab_size=49408,
+        hidden_size=768,
+        intermediate_size=3072,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        max_position_embeddings=77,
+        hidden_act="quick_gelu",
+        layer_norm_eps=1e-05,
+        dropout=0.0,
+        attention_dropout=0.0,
+        initializer_range=0.02,
+        initializer_factor=1.0,
+        pad_token_id=1,
+        bos_token_id=0,
+        eos_token_id=2,
+        model_type="clip_text_model",
+        projection_dim=768,
+        # torch_dtype="float32",
+        # transformers_version="4.25.0.dev0",
+    )
+    with init_empty_weights():
+        clip = CLIPTextModelWithProjection(config)
+
+    if clip_l_sd is None:
+        logger.info(f"Loading state dict from {clip_l_path}")
+        clip_l_sd = load_safetensors(clip_l_path, device=str(device), disable_mmap=disable_mmap, dtype=dtype)
+
+    if "text_projection.weight" not in clip_l_sd:
+        logger.info("Adding text_projection.weight to clip_l_sd")
+        clip_l_sd["text_projection.weight"] = torch.eye(768, dtype=dtype, device=device)
+
+    info = clip.load_state_dict(clip_l_sd, strict=False, assign=True)
+    logger.info(f"Loaded CLIP-L: {info}")
+    return clip
+
+
+def load_clip_g(
+    clip_g_path: Optional[str],
+    dtype: Optional[Union[str, torch.dtype]],
+    device: Union[str, torch.device],
+    disable_mmap: bool = False,
+    state_dict: Optional[Dict] = None,
+):
+    clip_g_sd = None
+    if state_dict is not None:
+        if "text_encoders.clip_g.transformer.text_model.embeddings.position_embedding.weight" in state_dict:
+            # found clip_g: remove prefix "text_encoders.clip_g."
+            logger.info("clip_g is included in the checkpoint")
+            clip_g_sd = {}
+            prefix = "text_encoders.clip_g."
+            for k in list(state_dict.keys()):
+                if k.startswith(prefix):
+                    clip_g_sd[k[len(prefix) :]] = state_dict.pop(k)
+        elif clip_g_path is None:
+            logger.info("clip_g is not included in the checkpoint and clip_g_path is not provided")
+            return None
+
+    # load clip_g
+    logger.info("Building CLIP-G")
+    config = CLIPTextConfig(
+        vocab_size=49408,
+        hidden_size=1280,
+        intermediate_size=5120,
+        num_hidden_layers=32,
+        num_attention_heads=20,
+        max_position_embeddings=77,
+        hidden_act="gelu",
+        layer_norm_eps=1e-05,
+        dropout=0.0,
+        attention_dropout=0.0,
+        initializer_range=0.02,
+        initializer_factor=1.0,
+        pad_token_id=1,
+        bos_token_id=0,
+        eos_token_id=2,
+        model_type="clip_text_model",
+        projection_dim=1280,
+        # torch_dtype="float32",
+        # transformers_version="4.25.0.dev0",
+    )
+    with init_empty_weights():
+        clip = CLIPTextModelWithProjection(config)
+
+    if clip_g_sd is None:
+        logger.info(f"Loading state dict from {clip_g_path}")
+        clip_g_sd = load_safetensors(clip_g_path, device=str(device), disable_mmap=disable_mmap, dtype=dtype)
+    info = clip.load_state_dict(clip_g_sd, strict=False, assign=True)
+    logger.info(f"Loaded CLIP-G: {info}")
+    return clip
+
+
+def load_t5xxl(
+    t5xxl_path: Optional[str],
+    dtype: Optional[Union[str, torch.dtype]],
+    device: Union[str, torch.device],
+    disable_mmap: bool = False,
+    state_dict: Optional[Dict] = None,
+):
+    t5xxl_sd = None
+    if state_dict is not None:
+        if "text_encoders.t5xxl.transformer.encoder.block.0.layer.0.SelfAttention.k.weight" in state_dict:
+            # found t5xxl: remove prefix "text_encoders.t5xxl."
+            logger.info("t5xxl is included in the checkpoint")
+            t5xxl_sd = {}
+            prefix = "text_encoders.t5xxl."
+            for k in list(state_dict.keys()):
+                if k.startswith(prefix):
+                    t5xxl_sd[k[len(prefix) :]] = state_dict.pop(k)
+        elif t5xxl_path is None:
+            logger.info("t5xxl is not included in the checkpoint and t5xxl_path is not provided")
+            return None
+
+    return flux_utils_load_t5xxl(t5xxl_path, dtype, device, disable_mmap, state_dict=t5xxl_sd)
+
+
+def load_vae(
+    vae_path: Optional[str],
+    vae_dtype: Optional[Union[str, torch.dtype]],
+    device: Optional[Union[str, torch.device]],
+    disable_mmap: bool = False,
+    state_dict: Optional[Dict] = None,
+):
+    vae_sd = {}
+    if vae_path:
+        logger.info(f"Loading VAE from {vae_path}...")
+        vae_sd = load_safetensors(vae_path, device, disable_mmap)
+    else:
+        # remove prefix "first_stage_model."
+        vae_sd = {}
+        vae_prefix = "first_stage_model."
+        for k in list(state_dict.keys()):
+            if k.startswith(vae_prefix):
+                vae_sd[k[len(vae_prefix) :]] = state_dict.pop(k)
+
+    logger.info("Building VAE")
+    vae = sd3_models.SDVAE(vae_dtype, device)
+    logger.info("Loading state dict...")
+    info = vae.load_state_dict(vae_sd)
+    logger.info(f"Loaded VAE: {info}")
+    vae.to(device=device, dtype=vae_dtype)  # make sure it's in the right device and dtype
+    return vae
+
+
+# endregion
+
+
+class ModelSamplingDiscreteFlow:
+    """Helper for sampler scheduling (ie timestep/sigma calculations) for Discrete Flow models"""
+
+    def __init__(self, shift=1.0):
+        self.shift = shift
+        timesteps = 1000
+        self.sigmas = self.sigma(torch.arange(1, timesteps + 1, 1))
+
+    @property
+    def sigma_min(self):
+        return self.sigmas[0]
+
+    @property
+    def sigma_max(self):
+        return self.sigmas[-1]
+
+    def timestep(self, sigma):
+        return sigma * 1000
+
+    def sigma(self, timestep: torch.Tensor):
+        timestep = timestep / 1000.0
+        if self.shift == 1.0:
+            return timestep
+        return self.shift * timestep / (1 + (self.shift - 1) * timestep)
+
+    def calculate_denoised(self, sigma, model_output, model_input):
+        sigma = sigma.view(sigma.shape[:1] + (1,) * (model_output.ndim - 1))
+        return model_input - model_output * sigma
+
+    def noise_scaling(self, sigma, noise, latent_image, max_denoise=False):
+        # assert max_denoise is False, "max_denoise not implemented"
+        # max_denoise is always True, I'm not sure why it's there
+        return sigma * noise + (1.0 - sigma) * latent_image
--- a/library/sdxl_lpw_stable_diffusion.py
+++ b/library/sdxl_lpw_stable_diffusion.py
--- a/library/sdxl_model_util.py
+++ b/library/sdxl_model_util.py
@@ -0,0 +1,583 @@
+import torch
+import safetensors
+from accelerate import init_empty_weights
+from accelerate.utils.modeling import set_module_tensor_to_device
+from safetensors.torch import load_file, save_file
+from transformers import CLIPTextModel, CLIPTextConfig, CLIPTextModelWithProjection, CLIPTokenizer
+from typing import List
+from diffusers import AutoencoderKL, EulerDiscreteScheduler, UNet2DConditionModel
+from library import model_util
+from library import sdxl_original_unet
+from library.utils import setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+VAE_SCALE_FACTOR = 0.13025
+MODEL_VERSION_SDXL_BASE_V1_0 = "sdxl_base_v1-0"
+
+# Diffusersの設定を読み込むための参照モデル
+DIFFUSERS_REF_MODEL_ID_SDXL = "stabilityai/stable-diffusion-xl-base-1.0"
+
+DIFFUSERS_SDXL_UNET_CONFIG = {
+    "act_fn": "silu",
+    "addition_embed_type": "text_time",
+    "addition_embed_type_num_heads": 64,
+    "addition_time_embed_dim": 256,
+    "attention_head_dim": [5, 10, 20],
+    "block_out_channels": [320, 640, 1280],
+    "center_input_sample": False,
+    "class_embed_type": None,
+    "class_embeddings_concat": False,
+    "conv_in_kernel": 3,
+    "conv_out_kernel": 3,
+    "cross_attention_dim": 2048,
+    "cross_attention_norm": None,
+    "down_block_types": ["DownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D"],
+    "downsample_padding": 1,
+    "dual_cross_attention": False,
+    "encoder_hid_dim": None,
+    "encoder_hid_dim_type": None,
+    "flip_sin_to_cos": True,
+    "freq_shift": 0,
+    "in_channels": 4,
+    "layers_per_block": 2,
+    "mid_block_only_cross_attention": None,
+    "mid_block_scale_factor": 1,
+    "mid_block_type": "UNetMidBlock2DCrossAttn",
+    "norm_eps": 1e-05,
+    "norm_num_groups": 32,
+    "num_attention_heads": None,
+    "num_class_embeds": None,
+    "only_cross_attention": False,
+    "out_channels": 4,
+    "projection_class_embeddings_input_dim": 2816,
+    "resnet_out_scale_factor": 1.0,
+    "resnet_skip_time_act": False,
+    "resnet_time_scale_shift": "default",
+    "sample_size": 128,
+    "time_cond_proj_dim": None,
+    "time_embedding_act_fn": None,
+    "time_embedding_dim": None,
+    "time_embedding_type": "positional",
+    "timestep_post_act": None,
+    "transformer_layers_per_block": [1, 2, 10],
+    "up_block_types": ["CrossAttnUpBlock2D", "CrossAttnUpBlock2D", "UpBlock2D"],
+    "upcast_attention": False,
+    "use_linear_projection": True,
+}
+
+
+def convert_sdxl_text_encoder_2_checkpoint(checkpoint, max_length):
+    SDXL_KEY_PREFIX = "conditioner.embedders.1.model."
+
+    # SD2のと、基本的には同じ。logit_scaleを後で使うので、それを追加で返す
+    # logit_scaleはcheckpointの保存時に使用する
+    def convert_key(key):
+        # common conversion
+        key = key.replace(SDXL_KEY_PREFIX + "transformer.", "text_model.encoder.")
+        key = key.replace(SDXL_KEY_PREFIX, "text_model.")
+
+        if "resblocks" in key:
+            # resblocks conversion
+            key = key.replace(".resblocks.", ".layers.")
+            if ".ln_" in key:
+                key = key.replace(".ln_", ".layer_norm")
+            elif ".mlp." in key:
+                key = key.replace(".c_fc.", ".fc1.")
+                key = key.replace(".c_proj.", ".fc2.")
+            elif ".attn.out_proj" in key:
+                key = key.replace(".attn.out_proj.", ".self_attn.out_proj.")
+            elif ".attn.in_proj" in key:
+                key = None  # 特殊なので後で処理する
+            else:
+                raise ValueError(f"unexpected key in SD: {key}")
+        elif ".positional_embedding" in key:
+            key = key.replace(".positional_embedding", ".embeddings.position_embedding.weight")
+        elif ".text_projection" in key:
+            key = key.replace("text_model.text_projection", "text_projection.weight")
+        elif ".logit_scale" in key:
+            key = None  # 後で処理する
+        elif ".token_embedding" in key:
+            key = key.replace(".token_embedding.weight", ".embeddings.token_embedding.weight")
+        elif ".ln_final" in key:
+            key = key.replace(".ln_final", ".final_layer_norm")
+        # ckpt from comfy has this key: text_model.encoder.text_model.embeddings.position_ids
+        elif ".embeddings.position_ids" in key:
+            key = None  # remove this key: position_ids is not used in newer transformers
+        return key
+
+    keys = list(checkpoint.keys())
+    new_sd = {}
+    for key in keys:
+        new_key = convert_key(key)
+        if new_key is None:
+            continue
+        new_sd[new_key] = checkpoint[key]
+
+    # attnの変換
+    for key in keys:
+        if ".resblocks" in key and ".attn.in_proj_" in key:
+            # 三つに分割
+            values = torch.chunk(checkpoint[key], 3)
+
+            key_suffix = ".weight" if "weight" in key else ".bias"
+            key_pfx = key.replace(SDXL_KEY_PREFIX + "transformer.resblocks.", "text_model.encoder.layers.")
+            key_pfx = key_pfx.replace("_weight", "")
+            key_pfx = key_pfx.replace("_bias", "")
+            key_pfx = key_pfx.replace(".attn.in_proj", ".self_attn.")
+            new_sd[key_pfx + "q_proj" + key_suffix] = values[0]
+            new_sd[key_pfx + "k_proj" + key_suffix] = values[1]
+            new_sd[key_pfx + "v_proj" + key_suffix] = values[2]
+
+    # logit_scale はDiffusersには含まれないが、保存時に戻したいので別途返す
+    logit_scale = checkpoint.get(SDXL_KEY_PREFIX + "logit_scale", None)
+
+    # temporary workaround for text_projection.weight.weight for Playground-v2
+    if "text_projection.weight.weight" in new_sd:
+        logger.info("convert_sdxl_text_encoder_2_checkpoint: convert text_projection.weight.weight to text_projection.weight")
+        new_sd["text_projection.weight"] = new_sd["text_projection.weight.weight"]
+        del new_sd["text_projection.weight.weight"]
+
+    return new_sd, logit_scale
+
+
+# load state_dict without allocating new tensors
+def _load_state_dict_on_device(model, state_dict, device, dtype=None):
+    # dtype will use fp32 as default
+    missing_keys = list(model.state_dict().keys() - state_dict.keys())
+    unexpected_keys = list(state_dict.keys() - model.state_dict().keys())
+
+    # similar to model.load_state_dict()
+    if not missing_keys and not unexpected_keys:
+        for k in list(state_dict.keys()):
+            set_module_tensor_to_device(model, k, device, value=state_dict.pop(k), dtype=dtype)
+        return "<All keys matched successfully>"
+
+    # error_msgs
+    error_msgs: List[str] = []
+    if missing_keys:
+        error_msgs.insert(0, "Missing key(s) in state_dict: {}. ".format(", ".join('"{}"'.format(k) for k in missing_keys)))
+    if unexpected_keys:
+        error_msgs.insert(0, "Unexpected key(s) in state_dict: {}. ".format(", ".join('"{}"'.format(k) for k in unexpected_keys)))
+
+    raise RuntimeError("Error(s) in loading state_dict for {}:\n\t{}".format(model.__class__.__name__, "\n\t".join(error_msgs)))
+
+
+def load_models_from_sdxl_checkpoint(model_version, ckpt_path, map_location, dtype=None, disable_mmap=False):
+    # model_version is reserved for future use
+    # dtype is used for full_fp16/bf16 integration. Text Encoder will remain fp32, because it runs on CPU when caching
+
+    # Load the state dict
+    if model_util.is_safetensors(ckpt_path):
+        checkpoint = None
+        if disable_mmap:
+            state_dict = safetensors.torch.load(open(ckpt_path, "rb").read())
+        else:
+            try:
+                state_dict = load_file(ckpt_path, device=map_location)
+            except:
+                state_dict = load_file(ckpt_path)  # prevent device invalid Error
+        epoch = None
+        global_step = None
+    else:
+        checkpoint = torch.load(ckpt_path, map_location=map_location)
+        if "state_dict" in checkpoint:
+            state_dict = checkpoint["state_dict"]
+            epoch = checkpoint.get("epoch", 0)
+            global_step = checkpoint.get("global_step", 0)
+        else:
+            state_dict = checkpoint
+            epoch = 0
+            global_step = 0
+        checkpoint = None
+
+    # U-Net
+    logger.info("building U-Net")
+    with init_empty_weights():
+        unet = sdxl_original_unet.SdxlUNet2DConditionModel()
+
+    logger.info("loading U-Net from checkpoint")
+    unet_sd = {}
+    for k in list(state_dict.keys()):
+        if k.startswith("model.diffusion_model."):
+            unet_sd[k.replace("model.diffusion_model.", "")] = state_dict.pop(k)
+    info = _load_state_dict_on_device(unet, unet_sd, device=map_location, dtype=dtype)
+    logger.info(f"U-Net: {info}")
+
+    # Text Encoders
+    logger.info("building text encoders")
+
+    # Text Encoder 1 is same to Stability AI's SDXL
+    text_model1_cfg = CLIPTextConfig(
+        vocab_size=49408,
+        hidden_size=768,
+        intermediate_size=3072,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        max_position_embeddings=77,
+        hidden_act="quick_gelu",
+        layer_norm_eps=1e-05,
+        dropout=0.0,
+        attention_dropout=0.0,
+        initializer_range=0.02,
+        initializer_factor=1.0,
+        pad_token_id=1,
+        bos_token_id=0,
+        eos_token_id=2,
+        model_type="clip_text_model",
+        projection_dim=768,
+        # torch_dtype="float32",
+        # transformers_version="4.25.0.dev0",
+    )
+    with init_empty_weights():
+        text_model1 = CLIPTextModel._from_config(text_model1_cfg)
+
+    # Text Encoder 2 is different from Stability AI's SDXL. SDXL uses open clip, but we use the model from HuggingFace.
+    # Note: Tokenizer from HuggingFace is different from SDXL. We must use open clip's tokenizer.
+    text_model2_cfg = CLIPTextConfig(
+        vocab_size=49408,
+        hidden_size=1280,
+        intermediate_size=5120,
+        num_hidden_layers=32,
+        num_attention_heads=20,
+        max_position_embeddings=77,
+        hidden_act="gelu",
+        layer_norm_eps=1e-05,
+        dropout=0.0,
+        attention_dropout=0.0,
+        initializer_range=0.02,
+        initializer_factor=1.0,
+        pad_token_id=1,
+        bos_token_id=0,
+        eos_token_id=2,
+        model_type="clip_text_model",
+        projection_dim=1280,
+        # torch_dtype="float32",
+        # transformers_version="4.25.0.dev0",
+    )
+    with init_empty_weights():
+        text_model2 = CLIPTextModelWithProjection(text_model2_cfg)
+
+    logger.info("loading text encoders from checkpoint")
+    te1_sd = {}
+    te2_sd = {}
+    for k in list(state_dict.keys()):
+        if k.startswith("conditioner.embedders.0.transformer."):
+            te1_sd[k.replace("conditioner.embedders.0.transformer.", "")] = state_dict.pop(k)
+        elif k.startswith("conditioner.embedders.1.model."):
+            te2_sd[k] = state_dict.pop(k)
+
+    # 最新の transformers では position_ids を含むとエラーになるので削除 / remove position_ids for latest transformers
+    if "text_model.embeddings.position_ids" in te1_sd:
+        te1_sd.pop("text_model.embeddings.position_ids")
+
+    info1 = _load_state_dict_on_device(text_model1, te1_sd, device=map_location)  # remain fp32
+    logger.info(f"text encoder 1: {info1}")
+
+    converted_sd, logit_scale = convert_sdxl_text_encoder_2_checkpoint(te2_sd, max_length=77)
+    info2 = _load_state_dict_on_device(text_model2, converted_sd, device=map_location)  # remain fp32
+    logger.info(f"text encoder 2: {info2}")
+
+    # prepare vae
+    logger.info("building VAE")
+    vae_config = model_util.create_vae_diffusers_config()
+    with init_empty_weights():
+        vae = AutoencoderKL(**vae_config)
+
+    logger.info("loading VAE from checkpoint")
+    converted_vae_checkpoint = model_util.convert_ldm_vae_checkpoint(state_dict, vae_config)
+    info = _load_state_dict_on_device(vae, converted_vae_checkpoint, device=map_location, dtype=dtype)
+    logger.info(f"VAE: {info}")
+
+    ckpt_info = (epoch, global_step) if epoch is not None else None
+    return text_model1, text_model2, vae, unet, logit_scale, ckpt_info
+
+
+def make_unet_conversion_map():
+    unet_conversion_map_layer = []
+
+    for i in range(3):  # num_blocks is 3 in sdxl
+        # loop over downblocks/upblocks
+        for j in range(2):
+            # loop over resnets/attentions for downblocks
+            hf_down_res_prefix = f"down_blocks.{i}.resnets.{j}."
+            sd_down_res_prefix = f"input_blocks.{3*i + j + 1}.0."
+            unet_conversion_map_layer.append((sd_down_res_prefix, hf_down_res_prefix))
+
+            if i < 3:
+                # no attention layers in down_blocks.3
+                hf_down_atn_prefix = f"down_blocks.{i}.attentions.{j}."
+                sd_down_atn_prefix = f"input_blocks.{3*i + j + 1}.1."
+                unet_conversion_map_layer.append((sd_down_atn_prefix, hf_down_atn_prefix))
+
+        for j in range(3):
+            # loop over resnets/attentions for upblocks
+            hf_up_res_prefix = f"up_blocks.{i}.resnets.{j}."
+            sd_up_res_prefix = f"output_blocks.{3*i + j}.0."
+            unet_conversion_map_layer.append((sd_up_res_prefix, hf_up_res_prefix))
+
+            # if i > 0: commentout for sdxl
+            # no attention layers in up_blocks.0
+            hf_up_atn_prefix = f"up_blocks.{i}.attentions.{j}."
+            sd_up_atn_prefix = f"output_blocks.{3*i + j}.1."
+            unet_conversion_map_layer.append((sd_up_atn_prefix, hf_up_atn_prefix))
+
+        if i < 3:
+            # no downsample in down_blocks.3
+            hf_downsample_prefix = f"down_blocks.{i}.downsamplers.0.conv."
+            sd_downsample_prefix = f"input_blocks.{3*(i+1)}.0.op."
+            unet_conversion_map_layer.append((sd_downsample_prefix, hf_downsample_prefix))
+
+            # no upsample in up_blocks.3
+            hf_upsample_prefix = f"up_blocks.{i}.upsamplers.0."
+            sd_upsample_prefix = f"output_blocks.{3*i + 2}.{2}."  # change for sdxl
+            unet_conversion_map_layer.append((sd_upsample_prefix, hf_upsample_prefix))
+
+    hf_mid_atn_prefix = "mid_block.attentions.0."
+    sd_mid_atn_prefix = "middle_block.1."
+    unet_conversion_map_layer.append((sd_mid_atn_prefix, hf_mid_atn_prefix))
+
+    for j in range(2):
+        hf_mid_res_prefix = f"mid_block.resnets.{j}."
+        sd_mid_res_prefix = f"middle_block.{2*j}."
+        unet_conversion_map_layer.append((sd_mid_res_prefix, hf_mid_res_prefix))
+
+    unet_conversion_map_resnet = [
+        # (stable-diffusion, HF Diffusers)
+        ("in_layers.0.", "norm1."),
+        ("in_layers.2.", "conv1."),
+        ("out_layers.0.", "norm2."),
+        ("out_layers.3.", "conv2."),
+        ("emb_layers.1.", "time_emb_proj."),
+        ("skip_connection.", "conv_shortcut."),
+    ]
+
+    unet_conversion_map = []
+    for sd, hf in unet_conversion_map_layer:
+        if "resnets" in hf:
+            for sd_res, hf_res in unet_conversion_map_resnet:
+                unet_conversion_map.append((sd + sd_res, hf + hf_res))
+        else:
+            unet_conversion_map.append((sd, hf))
+
+    for j in range(2):
+        hf_time_embed_prefix = f"time_embedding.linear_{j+1}."
+        sd_time_embed_prefix = f"time_embed.{j*2}."
+        unet_conversion_map.append((sd_time_embed_prefix, hf_time_embed_prefix))
+
+    for j in range(2):
+        hf_label_embed_prefix = f"add_embedding.linear_{j+1}."
+        sd_label_embed_prefix = f"label_emb.0.{j*2}."
+        unet_conversion_map.append((sd_label_embed_prefix, hf_label_embed_prefix))
+
+    unet_conversion_map.append(("input_blocks.0.0.", "conv_in."))
+    unet_conversion_map.append(("out.0.", "conv_norm_out."))
+    unet_conversion_map.append(("out.2.", "conv_out."))
+
+    return unet_conversion_map
+
+
+def convert_diffusers_unet_state_dict_to_sdxl(du_sd):
+    unet_conversion_map = make_unet_conversion_map()
+
+    conversion_map = {hf: sd for sd, hf in unet_conversion_map}
+    return convert_unet_state_dict(du_sd, conversion_map)
+
+
+def convert_unet_state_dict(src_sd, conversion_map):
+    converted_sd = {}
+    for src_key, value in src_sd.items():
+        # さすがに全部回すのは時間がかかるので右から要素を削りつつprefixを探す
+        src_key_fragments = src_key.split(".")[:-1]  # remove weight/bias
+        while len(src_key_fragments) > 0:
+            src_key_prefix = ".".join(src_key_fragments) + "."
+            if src_key_prefix in conversion_map:
+                converted_prefix = conversion_map[src_key_prefix]
+                converted_key = converted_prefix + src_key[len(src_key_prefix) :]
+                converted_sd[converted_key] = value
+                break
+            src_key_fragments.pop(-1)
+        assert len(src_key_fragments) > 0, f"key {src_key} not found in conversion map"
+
+    return converted_sd
+
+
+def convert_sdxl_unet_state_dict_to_diffusers(sd):
+    unet_conversion_map = make_unet_conversion_map()
+
+    conversion_dict = {sd: hf for sd, hf in unet_conversion_map}
+    return convert_unet_state_dict(sd, conversion_dict)
+
+
+def convert_text_encoder_2_state_dict_to_sdxl(checkpoint, logit_scale):
+    def convert_key(key):
+        # position_idsの除去
+        if ".position_ids" in key:
+            return None
+
+        # common
+        key = key.replace("text_model.encoder.", "transformer.")
+        key = key.replace("text_model.", "")
+        if "layers" in key:
+            # resblocks conversion
+            key = key.replace(".layers.", ".resblocks.")
+            if ".layer_norm" in key:
+                key = key.replace(".layer_norm", ".ln_")
+            elif ".mlp." in key:
+                key = key.replace(".fc1.", ".c_fc.")
+                key = key.replace(".fc2.", ".c_proj.")
+            elif ".self_attn.out_proj" in key:
+                key = key.replace(".self_attn.out_proj.", ".attn.out_proj.")
+            elif ".self_attn." in key:
+                key = None  # 特殊なので後で処理する
+            else:
+                raise ValueError(f"unexpected key in DiffUsers model: {key}")
+        elif ".position_embedding" in key:
+            key = key.replace("embeddings.position_embedding.weight", "positional_embedding")
+        elif ".token_embedding" in key:
+            key = key.replace("embeddings.token_embedding.weight", "token_embedding.weight")
+        elif "text_projection" in key:  # no dot in key
+            key = key.replace("text_projection.weight", "text_projection")
+        elif "final_layer_norm" in key:
+            key = key.replace("final_layer_norm", "ln_final")
+        return key
+
+    keys = list(checkpoint.keys())
+    new_sd = {}
+    for key in keys:
+        new_key = convert_key(key)
+        if new_key is None:
+            continue
+        new_sd[new_key] = checkpoint[key]
+
+    # attnの変換
+    for key in keys:
+        if "layers" in key and "q_proj" in key:
+            # 三つを結合
+            key_q = key
+            key_k = key.replace("q_proj", "k_proj")
+            key_v = key.replace("q_proj", "v_proj")
+
+            value_q = checkpoint[key_q]
+            value_k = checkpoint[key_k]
+            value_v = checkpoint[key_v]
+            value = torch.cat([value_q, value_k, value_v])
+
+            new_key = key.replace("text_model.encoder.layers.", "transformer.resblocks.")
+            new_key = new_key.replace(".self_attn.q_proj.", ".attn.in_proj_")
+            new_sd[new_key] = value
+
+    if logit_scale is not None:
+        new_sd["logit_scale"] = logit_scale
+
+    return new_sd
+
+
+def save_stable_diffusion_checkpoint(
+    output_file,
+    text_encoder1,
+    text_encoder2,
+    unet,
+    epochs,
+    steps,
+    ckpt_info,
+    vae,
+    logit_scale,
+    metadata,
+    save_dtype=None,
+):
+    state_dict = {}
+
+    def update_sd(prefix, sd):
+        for k, v in sd.items():
+            key = prefix + k
+            if save_dtype is not None:
+                v = v.detach().clone().to("cpu").to(save_dtype)
+            state_dict[key] = v
+
+    # Convert the UNet model
+    update_sd("model.diffusion_model.", unet.state_dict())
+
+    # Convert the text encoders
+    update_sd("conditioner.embedders.0.transformer.", text_encoder1.state_dict())
+
+    text_enc2_dict = convert_text_encoder_2_state_dict_to_sdxl(text_encoder2.state_dict(), logit_scale)
+    update_sd("conditioner.embedders.1.model.", text_enc2_dict)
+
+    # Convert the VAE
+    vae_dict = model_util.convert_vae_state_dict(vae.state_dict())
+    update_sd("first_stage_model.", vae_dict)
+
+    # Put together new checkpoint
+    key_count = len(state_dict.keys())
+    new_ckpt = {"state_dict": state_dict}
+
+    # epoch and global_step are sometimes not int
+    if ckpt_info is not None:
+        epochs += ckpt_info[0]
+        steps += ckpt_info[1]
+
+    new_ckpt["epoch"] = epochs
+    new_ckpt["global_step"] = steps
+
+    if model_util.is_safetensors(output_file):
+        save_file(state_dict, output_file, metadata)
+    else:
+        torch.save(new_ckpt, output_file)
+
+    return key_count
+
+
+def save_diffusers_checkpoint(
+    output_dir, text_encoder1, text_encoder2, unet, pretrained_model_name_or_path, vae=None, use_safetensors=False, save_dtype=None
+):
+    from diffusers import StableDiffusionXLPipeline
+
+    # convert U-Net
+    unet_sd = unet.state_dict()
+    du_unet_sd = convert_sdxl_unet_state_dict_to_diffusers(unet_sd)
+
+    diffusers_unet = UNet2DConditionModel(**DIFFUSERS_SDXL_UNET_CONFIG)
+    if save_dtype is not None:
+        diffusers_unet.to(save_dtype)
+    diffusers_unet.load_state_dict(du_unet_sd)
+
+    # create pipeline to save
+    if pretrained_model_name_or_path is None:
+        pretrained_model_name_or_path = DIFFUSERS_REF_MODEL_ID_SDXL
+
+    scheduler = EulerDiscreteScheduler.from_pretrained(pretrained_model_name_or_path, subfolder="scheduler")
+    tokenizer1 = CLIPTokenizer.from_pretrained(pretrained_model_name_or_path, subfolder="tokenizer")
+    tokenizer2 = CLIPTokenizer.from_pretrained(pretrained_model_name_or_path, subfolder="tokenizer_2")
+    if vae is None:
+        vae = AutoencoderKL.from_pretrained(pretrained_model_name_or_path, subfolder="vae")
+
+    # prevent local path from being saved
+    def remove_name_or_path(model):
+        if hasattr(model, "config"):
+            model.config._name_or_path = None
+            model.config._name_or_path = None
+
+    remove_name_or_path(diffusers_unet)
+    remove_name_or_path(text_encoder1)
+    remove_name_or_path(text_encoder2)
+    remove_name_or_path(scheduler)
+    remove_name_or_path(tokenizer1)
+    remove_name_or_path(tokenizer2)
+    remove_name_or_path(vae)
+
+    pipeline = StableDiffusionXLPipeline(
+        unet=diffusers_unet,
+        text_encoder=text_encoder1,
+        text_encoder_2=text_encoder2,
+        vae=vae,
+        scheduler=scheduler,
+        tokenizer=tokenizer1,
+        tokenizer_2=tokenizer2,
+    )
+    if save_dtype is not None:
+        pipeline.to(None, save_dtype)
+    pipeline.save_pretrained(output_dir, safe_serialization=use_safetensors)
--- a/library/sdxl_original_control_net.py
+++ b/library/sdxl_original_control_net.py
@@ -0,0 +1,272 @@
+# some parts are modified from Diffusers library (Apache License 2.0)
+
+import math
+from types import SimpleNamespace
+from typing import Any, Optional
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import functional as F
+from einops import rearrange
+from library.utils import setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+from library import sdxl_original_unet
+from library.sdxl_model_util import convert_sdxl_unet_state_dict_to_diffusers, convert_diffusers_unet_state_dict_to_sdxl
+
+
+class ControlNetConditioningEmbedding(nn.Module):
+    def __init__(self):
+        super().__init__()
+
+        dims = [16, 32, 96, 256]
+
+        self.conv_in = nn.Conv2d(3, dims[0], kernel_size=3, padding=1)
+        self.blocks = nn.ModuleList([])
+
+        for i in range(len(dims) - 1):
+            channel_in = dims[i]
+            channel_out = dims[i + 1]
+            self.blocks.append(nn.Conv2d(channel_in, channel_in, kernel_size=3, padding=1))
+            self.blocks.append(nn.Conv2d(channel_in, channel_out, kernel_size=3, padding=1, stride=2))
+
+        self.conv_out = nn.Conv2d(dims[-1], 320, kernel_size=3, padding=1)
+        nn.init.zeros_(self.conv_out.weight)  # zero module weight
+        nn.init.zeros_(self.conv_out.bias)  # zero module bias
+
+    def forward(self, x):
+        x = self.conv_in(x)
+        x = F.silu(x)
+        for block in self.blocks:
+            x = block(x)
+            x = F.silu(x)
+        x = self.conv_out(x)
+        return x
+
+
+class SdxlControlNet(sdxl_original_unet.SdxlUNet2DConditionModel):
+    def __init__(self, multiplier: Optional[float] = None, **kwargs):
+        super().__init__(**kwargs)
+        self.multiplier = multiplier
+
+        # remove unet layers
+        self.output_blocks = nn.ModuleList([])
+        del self.out
+
+        self.controlnet_cond_embedding = ControlNetConditioningEmbedding()
+
+        dims = [320, 320, 320, 320, 640, 640, 640, 1280, 1280]
+        self.controlnet_down_blocks = nn.ModuleList([])
+        for dim in dims:
+            self.controlnet_down_blocks.append(nn.Conv2d(dim, dim, kernel_size=1))
+            nn.init.zeros_(self.controlnet_down_blocks[-1].weight)  # zero module weight
+            nn.init.zeros_(self.controlnet_down_blocks[-1].bias)  # zero module bias
+
+        self.controlnet_mid_block = nn.Conv2d(1280, 1280, kernel_size=1)
+        nn.init.zeros_(self.controlnet_mid_block.weight)  # zero module weight
+        nn.init.zeros_(self.controlnet_mid_block.bias)  # zero module bias
+
+    def init_from_unet(self, unet: sdxl_original_unet.SdxlUNet2DConditionModel):
+        unet_sd = unet.state_dict()
+        unet_sd = {k: v for k, v in unet_sd.items() if not k.startswith("out")}
+        sd = super().state_dict()
+        sd.update(unet_sd)
+        info = super().load_state_dict(sd, strict=True, assign=True)
+        return info
+
+    def load_state_dict(self, state_dict: dict, strict: bool = True, assign: bool = True) -> Any:
+        # convert state_dict to SAI format
+        unet_sd = {}
+        for k in list(state_dict.keys()):
+            if not k.startswith("controlnet_"):
+                unet_sd[k] = state_dict.pop(k)
+        unet_sd = convert_diffusers_unet_state_dict_to_sdxl(unet_sd)
+        state_dict.update(unet_sd)
+        super().load_state_dict(state_dict, strict=strict, assign=assign)
+
+    def state_dict(self, destination=None, prefix="", keep_vars=False):
+        # convert state_dict to Diffusers format
+        state_dict = super().state_dict(destination, prefix, keep_vars)
+        control_net_sd = {}
+        for k in list(state_dict.keys()):
+            if k.startswith("controlnet_"):
+                control_net_sd[k] = state_dict.pop(k)
+        state_dict = convert_sdxl_unet_state_dict_to_diffusers(state_dict)
+        state_dict.update(control_net_sd)
+        return state_dict
+
+    def forward(
+        self,
+        x: torch.Tensor,
+        timesteps: Optional[torch.Tensor] = None,
+        context: Optional[torch.Tensor] = None,
+        y: Optional[torch.Tensor] = None,
+        cond_image: Optional[torch.Tensor] = None,
+        **kwargs,
+    ) -> torch.Tensor:
+        # broadcast timesteps to batch dimension
+        timesteps = timesteps.expand(x.shape[0])
+
+        t_emb = sdxl_original_unet.get_timestep_embedding(timesteps, self.model_channels, downscale_freq_shift=0)
+        t_emb = t_emb.to(x.dtype)
+        emb = self.time_embed(t_emb)
+
+        assert x.shape[0] == y.shape[0], f"batch size mismatch: {x.shape[0]} != {y.shape[0]}"
+        assert x.dtype == y.dtype, f"dtype mismatch: {x.dtype} != {y.dtype}"
+        emb = emb + self.label_emb(y)
+
+        def call_module(module, h, emb, context):
+            x = h
+            for layer in module:
+                if isinstance(layer, sdxl_original_unet.ResnetBlock2D):
+                    x = layer(x, emb)
+                elif isinstance(layer, sdxl_original_unet.Transformer2DModel):
+                    x = layer(x, context)
+                else:
+                    x = layer(x)
+            return x
+
+        h = x
+        multiplier = self.multiplier if self.multiplier is not None else 1.0
+        hs = []
+        for i, module in enumerate(self.input_blocks):
+            h = call_module(module, h, emb, context)
+            if i == 0:
+                h = self.controlnet_cond_embedding(cond_image) + h
+            hs.append(self.controlnet_down_blocks[i](h) * multiplier)
+
+        h = call_module(self.middle_block, h, emb, context)
+        h = self.controlnet_mid_block(h) * multiplier
+
+        return hs, h
+
+
+class SdxlControlledUNet(sdxl_original_unet.SdxlUNet2DConditionModel):
+    """
+    This class is for training purpose only.
+    """
+
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+
+    def forward(self, x, timesteps=None, context=None, y=None, input_resi_add=None, mid_add=None, **kwargs):
+        # broadcast timesteps to batch dimension
+        timesteps = timesteps.expand(x.shape[0])
+
+        hs = []
+        t_emb = sdxl_original_unet.get_timestep_embedding(timesteps, self.model_channels, downscale_freq_shift=0)
+        t_emb = t_emb.to(x.dtype)
+        emb = self.time_embed(t_emb)
+
+        assert x.shape[0] == y.shape[0], f"batch size mismatch: {x.shape[0]} != {y.shape[0]}"
+        assert x.dtype == y.dtype, f"dtype mismatch: {x.dtype} != {y.dtype}"
+        emb = emb + self.label_emb(y)
+
+        def call_module(module, h, emb, context):
+            x = h
+            for layer in module:
+                if isinstance(layer, sdxl_original_unet.ResnetBlock2D):
+                    x = layer(x, emb)
+                elif isinstance(layer, sdxl_original_unet.Transformer2DModel):
+                    x = layer(x, context)
+                else:
+                    x = layer(x)
+            return x
+
+        h = x
+        for module in self.input_blocks:
+            h = call_module(module, h, emb, context)
+            hs.append(h)
+
+        h = call_module(self.middle_block, h, emb, context)
+        h = h + mid_add
+
+        for module in self.output_blocks:
+            resi = hs.pop() + input_resi_add.pop()
+            h = torch.cat([h, resi], dim=1)
+            h = call_module(module, h, emb, context)
+
+        h = h.type(x.dtype)
+        h = call_module(self.out, h, emb, context)
+
+        return h
+
+
+if __name__ == "__main__":
+    import time
+
+    logger.info("create unet")
+    unet = SdxlControlledUNet()
+    unet.to("cuda", torch.bfloat16)
+    unet.set_use_sdpa(True)
+    unet.set_gradient_checkpointing(True)
+    unet.train()
+
+    logger.info("create control_net")
+    control_net = SdxlControlNet()
+    control_net.to("cuda")
+    control_net.set_use_sdpa(True)
+    control_net.set_gradient_checkpointing(True)
+    control_net.train()
+
+    logger.info("Initialize control_net from unet")
+    control_net.init_from_unet(unet)
+
+    unet.requires_grad_(False)
+    control_net.requires_grad_(True)
+
+    # 使用メモリ量確認用の疑似学習ループ
+    logger.info("preparing optimizer")
+
+    # optimizer = torch.optim.SGD(unet.parameters(), lr=1e-3, nesterov=True, momentum=0.9) # not working
+
+    import bitsandbytes
+
+    optimizer = bitsandbytes.adam.Adam8bit(control_net.parameters(), lr=1e-3)  # not working
+    # optimizer = bitsandbytes.optim.RMSprop8bit(unet.parameters(), lr=1e-3)  # working at 23.5 GB with torch2
+    # optimizer=bitsandbytes.optim.Adagrad8bit(unet.parameters(), lr=1e-3)  # working at 23.5 GB with torch2
+
+    # import transformers
+    # optimizer = transformers.optimization.Adafactor(unet.parameters(), relative_step=True)  # working at 22.2GB with torch2
+
+    scaler = torch.cuda.amp.GradScaler(enabled=True)
+
+    logger.info("start training")
+    steps = 10
+    batch_size = 1
+
+    for step in range(steps):
+        logger.info(f"step {step}")
+        if step == 1:
+            time_start = time.perf_counter()
+
+        x = torch.randn(batch_size, 4, 128, 128).cuda()  # 1024x1024
+        t = torch.randint(low=0, high=1000, size=(batch_size,), device="cuda")
+        txt = torch.randn(batch_size, 77, 2048).cuda()
+        vector = torch.randn(batch_size, sdxl_original_unet.ADM_IN_CHANNELS).cuda()
+        cond_img = torch.rand(batch_size, 3, 1024, 1024).cuda()
+
+        with torch.cuda.amp.autocast(enabled=True, dtype=torch.bfloat16):
+            input_resi_add, mid_add = control_net(x, t, txt, vector, cond_img)
+            output = unet(x, t, txt, vector, input_resi_add, mid_add)
+            target = torch.randn_like(output)
+            loss = torch.nn.functional.mse_loss(output, target)
+
+        scaler.scale(loss).backward()
+        scaler.step(optimizer)
+        scaler.update()
+        optimizer.zero_grad(set_to_none=True)
+
+    time_end = time.perf_counter()
+    logger.info(f"elapsed time: {time_end - time_start} [sec] for last {steps - 1} steps")
+
+    logger.info("finish training")
+    sd = control_net.state_dict()
+
+    from safetensors.torch import save_file
+
+    save_file(sd, r"E:\Work\SD\Tmp\sdxl\ctrl\control_net.safetensors")
--- a/library/sdxl_original_unet.py
+++ b/library/sdxl_original_unet.py
--- a/library/sdxl_train_util.py
+++ b/library/sdxl_train_util.py
@@ -0,0 +1,380 @@
+import argparse
+import math
+import os
+from typing import Optional
+
+import torch
+from library.device_utils import init_ipex, clean_memory_on_device
+
+init_ipex()
+
+from accelerate import init_empty_weights
+from tqdm import tqdm
+from transformers import CLIPTokenizer
+from library import model_util, sdxl_model_util, train_util, sdxl_original_unet
+from .utils import setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+TOKENIZER1_PATH = "openai/clip-vit-large-patch14"
+TOKENIZER2_PATH = "laion/CLIP-ViT-bigG-14-laion2B-39B-b160k"
+
+# DEFAULT_NOISE_OFFSET = 0.0357
+
+
+def load_target_model(args, accelerator, model_version: str, weight_dtype):
+    model_dtype = match_mixed_precision(args, weight_dtype)  # prepare fp16/bf16
+    for pi in range(accelerator.state.num_processes):
+        if pi == accelerator.state.local_process_index:
+            logger.info(f"loading model for process {accelerator.state.local_process_index}/{accelerator.state.num_processes}")
+
+            (
+                load_stable_diffusion_format,
+                text_encoder1,
+                text_encoder2,
+                vae,
+                unet,
+                logit_scale,
+                ckpt_info,
+            ) = _load_target_model(
+                args.pretrained_model_name_or_path,
+                args.vae,
+                model_version,
+                weight_dtype,
+                accelerator.device if args.lowram else "cpu",
+                model_dtype,
+                args.disable_mmap_load_safetensors,
+            )
+
+            # work on low-ram device
+            if args.lowram:
+                text_encoder1.to(accelerator.device)
+                text_encoder2.to(accelerator.device)
+                unet.to(accelerator.device)
+                vae.to(accelerator.device)
+
+            clean_memory_on_device(accelerator.device)
+        accelerator.wait_for_everyone()
+
+    return load_stable_diffusion_format, text_encoder1, text_encoder2, vae, unet, logit_scale, ckpt_info
+
+
+def _load_target_model(
+    name_or_path: str, vae_path: Optional[str], model_version: str, weight_dtype, device="cpu", model_dtype=None, disable_mmap=False
+):
+    # model_dtype only work with full fp16/bf16
+    name_or_path = os.readlink(name_or_path) if os.path.islink(name_or_path) else name_or_path
+    load_stable_diffusion_format = os.path.isfile(name_or_path)  # determine SD or Diffusers
+
+    if load_stable_diffusion_format:
+        logger.info(f"load StableDiffusion checkpoint: {name_or_path}")
+        (
+            text_encoder1,
+            text_encoder2,
+            vae,
+            unet,
+            logit_scale,
+            ckpt_info,
+        ) = sdxl_model_util.load_models_from_sdxl_checkpoint(model_version, name_or_path, device, model_dtype, disable_mmap)
+    else:
+        # Diffusers model is loaded to CPU
+        from diffusers import StableDiffusionXLPipeline
+
+        variant = "fp16" if weight_dtype == torch.float16 else None
+        logger.info(f"load Diffusers pretrained models: {name_or_path}, variant={variant}")
+        try:
+            try:
+                pipe = StableDiffusionXLPipeline.from_pretrained(
+                    name_or_path, torch_dtype=model_dtype, variant=variant, tokenizer=None
+                )
+            except EnvironmentError as ex:
+                if variant is not None:
+                    logger.info("try to load fp32 model")
+                    pipe = StableDiffusionXLPipeline.from_pretrained(name_or_path, variant=None, tokenizer=None)
+                else:
+                    raise ex
+        except EnvironmentError as ex:
+            logger.error(
+                f"model is not found as a file or in Hugging Face, perhaps file name is wrong? / 指定したモデル名のファイル、またはHugging Faceのモデルが見つかりません。ファイル名が誤っているかもしれません: {name_or_path}"
+            )
+            raise ex
+
+        text_encoder1 = pipe.text_encoder
+        text_encoder2 = pipe.text_encoder_2
+
+        # convert to fp32 for cache text_encoders outputs
+        if text_encoder1.dtype != torch.float32:
+            text_encoder1 = text_encoder1.to(dtype=torch.float32)
+        if text_encoder2.dtype != torch.float32:
+            text_encoder2 = text_encoder2.to(dtype=torch.float32)
+
+        vae = pipe.vae
+        unet = pipe.unet
+        del pipe
+
+        # Diffusers U-Net to original U-Net
+        state_dict = sdxl_model_util.convert_diffusers_unet_state_dict_to_sdxl(unet.state_dict())
+        with init_empty_weights():
+            unet = sdxl_original_unet.SdxlUNet2DConditionModel()  # overwrite unet
+        sdxl_model_util._load_state_dict_on_device(unet, state_dict, device=device, dtype=model_dtype)
+        logger.info("U-Net converted to original U-Net")
+
+        logit_scale = None
+        ckpt_info = None
+
+    # VAEを読み込む
+    if vae_path is not None:
+        vae = model_util.load_vae(vae_path, weight_dtype)
+        logger.info("additional VAE loaded")
+
+    return load_stable_diffusion_format, text_encoder1, text_encoder2, vae, unet, logit_scale, ckpt_info
+
+
+def load_tokenizers(args: argparse.Namespace):
+    logger.info("prepare tokenizers")
+
+    original_paths = [TOKENIZER1_PATH, TOKENIZER2_PATH]
+    tokeniers = []
+    for i, original_path in enumerate(original_paths):
+        tokenizer: CLIPTokenizer = None
+        if args.tokenizer_cache_dir:
+            local_tokenizer_path = os.path.join(args.tokenizer_cache_dir, original_path.replace("/", "_"))
+            if os.path.exists(local_tokenizer_path):
+                logger.info(f"load tokenizer from cache: {local_tokenizer_path}")
+                tokenizer = CLIPTokenizer.from_pretrained(local_tokenizer_path)
+
+        if tokenizer is None:
+            tokenizer = CLIPTokenizer.from_pretrained(original_path)
+
+        if args.tokenizer_cache_dir and not os.path.exists(local_tokenizer_path):
+            logger.info(f"save Tokenizer to cache: {local_tokenizer_path}")
+            tokenizer.save_pretrained(local_tokenizer_path)
+
+        if i == 1:
+            tokenizer.pad_token_id = 0  # fix pad token id to make same as open clip tokenizer
+
+        tokeniers.append(tokenizer)
+
+    if hasattr(args, "max_token_length") and args.max_token_length is not None:
+        logger.info(f"update token length: {args.max_token_length}")
+
+    return tokeniers
+
+
+def match_mixed_precision(args, weight_dtype):
+    if args.full_fp16:
+        assert (
+            weight_dtype == torch.float16
+        ), "full_fp16 requires mixed precision='fp16' / full_fp16を使う場合はmixed_precision='fp16'を指定してください。"
+        return weight_dtype
+    elif args.full_bf16:
+        assert (
+            weight_dtype == torch.bfloat16
+        ), "full_bf16 requires mixed precision='bf16' / full_bf16を使う場合はmixed_precision='bf16'を指定してください。"
+        return weight_dtype
+    else:
+        return None
+
+
+def timestep_embedding(timesteps, dim, max_period=10000):
+    """
+    Create sinusoidal timestep embeddings.
+    :param timesteps: a 1-D Tensor of N indices, one per batch element.
+                      These may be fractional.
+    :param dim: the dimension of the output.
+    :param max_period: controls the minimum frequency of the embeddings.
+    :return: an [N x dim] Tensor of positional embeddings.
+    """
+    half = dim // 2
+    freqs = torch.exp(-math.log(max_period) * torch.arange(start=0, end=half, dtype=torch.float32) / half).to(
+        device=timesteps.device
+    )
+    args = timesteps[:, None].float() * freqs[None]
+    embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
+    if dim % 2:
+        embedding = torch.cat([embedding, torch.zeros_like(embedding[:, :1])], dim=-1)
+    return embedding
+
+
+def get_timestep_embedding(x, outdim):
+    assert len(x.shape) == 2
+    b, dims = x.shape[0], x.shape[1]
+    x = torch.flatten(x)
+    emb = timestep_embedding(x, outdim)
+    emb = torch.reshape(emb, (b, dims * outdim))
+    return emb
+
+
+def get_size_embeddings(orig_size, crop_size, target_size, device):
+    emb1 = get_timestep_embedding(orig_size, 256)
+    emb2 = get_timestep_embedding(crop_size, 256)
+    emb3 = get_timestep_embedding(target_size, 256)
+    vector = torch.cat([emb1, emb2, emb3], dim=1).to(device)
+    return vector
+
+
+def save_sd_model_on_train_end(
+    args: argparse.Namespace,
+    src_path: str,
+    save_stable_diffusion_format: bool,
+    use_safetensors: bool,
+    save_dtype: torch.dtype,
+    epoch: int,
+    global_step: int,
+    text_encoder1,
+    text_encoder2,
+    unet,
+    vae,
+    logit_scale,
+    ckpt_info,
+):
+    def sd_saver(ckpt_file, epoch_no, global_step):
+        sai_metadata = train_util.get_sai_model_spec(None, args, True, False, False, is_stable_diffusion_ckpt=True)
+        sdxl_model_util.save_stable_diffusion_checkpoint(
+            ckpt_file,
+            text_encoder1,
+            text_encoder2,
+            unet,
+            epoch_no,
+            global_step,
+            ckpt_info,
+            vae,
+            logit_scale,
+            sai_metadata,
+            save_dtype,
+        )
+
+    def diffusers_saver(out_dir):
+        sdxl_model_util.save_diffusers_checkpoint(
+            out_dir,
+            text_encoder1,
+            text_encoder2,
+            unet,
+            src_path,
+            vae,
+            use_safetensors=use_safetensors,
+            save_dtype=save_dtype,
+        )
+
+    train_util.save_sd_model_on_train_end_common(
+        args, save_stable_diffusion_format, use_safetensors, epoch, global_step, sd_saver, diffusers_saver
+    )
+
+
+# epochとstepの保存、メタデータにepoch/stepが含まれ引数が同じになるため、統合している
+# on_epoch_end: Trueならepoch終了時、Falseならstep経過時
+def save_sd_model_on_epoch_end_or_stepwise(
+    args: argparse.Namespace,
+    on_epoch_end: bool,
+    accelerator,
+    src_path,
+    save_stable_diffusion_format: bool,
+    use_safetensors: bool,
+    save_dtype: torch.dtype,
+    epoch: int,
+    num_train_epochs: int,
+    global_step: int,
+    text_encoder1,
+    text_encoder2,
+    unet,
+    vae,
+    logit_scale,
+    ckpt_info,
+):
+    def sd_saver(ckpt_file, epoch_no, global_step):
+        sai_metadata = train_util.get_sai_model_spec(None, args, True, False, False, is_stable_diffusion_ckpt=True)
+        sdxl_model_util.save_stable_diffusion_checkpoint(
+            ckpt_file,
+            text_encoder1,
+            text_encoder2,
+            unet,
+            epoch_no,
+            global_step,
+            ckpt_info,
+            vae,
+            logit_scale,
+            sai_metadata,
+            save_dtype,
+        )
+
+    def diffusers_saver(out_dir):
+        sdxl_model_util.save_diffusers_checkpoint(
+            out_dir,
+            text_encoder1,
+            text_encoder2,
+            unet,
+            src_path,
+            vae,
+            use_safetensors=use_safetensors,
+            save_dtype=save_dtype,
+        )
+
+    train_util.save_sd_model_on_epoch_end_or_stepwise_common(
+        args,
+        on_epoch_end,
+        accelerator,
+        save_stable_diffusion_format,
+        use_safetensors,
+        epoch,
+        num_train_epochs,
+        global_step,
+        sd_saver,
+        diffusers_saver,
+    )
+
+
+def add_sdxl_training_arguments(parser: argparse.ArgumentParser, support_text_encoder_caching: bool = True):
+    parser.add_argument(
+        "--cache_text_encoder_outputs", action="store_true", help="cache text encoder outputs / text encoderの出力をキャッシュする"
+    )
+    parser.add_argument(
+        "--cache_text_encoder_outputs_to_disk",
+        action="store_true",
+        help="cache text encoder outputs to disk / text encoderの出力をディスクにキャッシュする",
+    )
+    parser.add_argument(
+        "--disable_mmap_load_safetensors",
+        action="store_true",
+        help="disable mmap load for safetensors. Speed up model loading in WSL environment / safetensorsのmmapロードを無効にする。WSL環境等でモデル読み込みを高速化できる",
+    )
+
+
+def verify_sdxl_training_args(args: argparse.Namespace, supportTextEncoderCaching: bool = True):
+    assert not args.v2, "v2 cannot be enabled in SDXL training / SDXL学習ではv2を有効にすることはできません"
+
+    if args.clip_skip is not None:
+        logger.warning("clip_skip will be unexpected / SDXL学習ではclip_skipは動作しません")
+
+    # if args.multires_noise_iterations:
+    #     logger.info(
+    #         f"Warning: SDXL has been trained with noise_offset={DEFAULT_NOISE_OFFSET}, but noise_offset is disabled due to multires_noise_iterations / SDXLはnoise_offset={DEFAULT_NOISE_OFFSET}で学習されていますが、multires_noise_iterationsが有効になっているためnoise_offsetは無効になります"
+    #     )
+    # else:
+    #     if args.noise_offset is None:
+    #         args.noise_offset = DEFAULT_NOISE_OFFSET
+    #     elif args.noise_offset != DEFAULT_NOISE_OFFSET:
+    #         logger.info(
+    #             f"Warning: SDXL has been trained with noise_offset={DEFAULT_NOISE_OFFSET} / SDXLはnoise_offset={DEFAULT_NOISE_OFFSET}で学習されています"
+    #         )
+    #     logger.info(f"noise_offset is set to {args.noise_offset} / noise_offsetが{args.noise_offset}に設定されました")
+
+    # assert (
+    #     not hasattr(args, "weighted_captions") or not args.weighted_captions
+    # ), "weighted_captions cannot be enabled in SDXL training currently / SDXL学習では今のところweighted_captionsを有効にすることはできません"
+
+    if supportTextEncoderCaching:
+        if args.cache_text_encoder_outputs_to_disk and not args.cache_text_encoder_outputs:
+            args.cache_text_encoder_outputs = True
+            logger.warning(
+                "cache_text_encoder_outputs is enabled because cache_text_encoder_outputs_to_disk is enabled / "
+                + "cache_text_encoder_outputs_to_diskが有効になっているためcache_text_encoder_outputsが有効になりました"
+            )
+
+
+def sample_images(*args, **kwargs):
+    from library.sdxl_lpw_stable_diffusion import SdxlStableDiffusionLongPromptWeightingPipeline
+
+    return train_util.sample_images_common(SdxlStableDiffusionLongPromptWeightingPipeline, *args, **kwargs)
--- a/library/slicing_vae.py
+++ b/library/slicing_vae.py
@@ -0,0 +1,682 @@
+# Modified from Diffusers to reduce VRAM usage
+
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from dataclasses import dataclass
+from typing import Optional, Tuple, Union
+
+import numpy as np
+import torch
+import torch.nn as nn
+
+
+from diffusers.configuration_utils import ConfigMixin, register_to_config
+from diffusers.models.modeling_utils import ModelMixin
+from diffusers.models.unet_2d_blocks import UNetMidBlock2D, get_down_block, get_up_block
+from diffusers.models.vae import DecoderOutput, DiagonalGaussianDistribution
+from diffusers.models.autoencoder_kl import AutoencoderKLOutput
+from .utils import setup_logging
+setup_logging()
+import logging
+logger = logging.getLogger(__name__)
+
+def slice_h(x, num_slices):
+    # slice with pad 1 both sides: to eliminate side effect of padding of conv2d
+    # Conv2dのpaddingの副作用を排除するために、両側にpad 1しながらHをスライスする
+    # NCHWでもNHWCでもどちらでも動く
+    size = (x.shape[2] + num_slices - 1) // num_slices
+    sliced = []
+    for i in range(num_slices):
+        if i == 0:
+            sliced.append(x[:, :, : size + 1, :])
+        else:
+            end = size * (i + 1) + 1
+            if x.shape[2] - end < 3:  # if the last slice is too small, use the rest of the tensor 最後が細すぎるとconv2dできないので全部使う
+                end = x.shape[2]
+            sliced.append(x[:, :, size * i - 1 : end, :])
+            if end >= x.shape[2]:
+                break
+    return sliced
+
+
+def cat_h(sliced):
+    # padding分を除いて結合する
+    cat = []
+    for i, x in enumerate(sliced):
+        if i == 0:
+            cat.append(x[:, :, :-1, :])
+        elif i == len(sliced) - 1:
+            cat.append(x[:, :, 1:, :])
+        else:
+            cat.append(x[:, :, 1:-1, :])
+        del x
+    x = torch.cat(cat, dim=2)
+    return x
+
+
+def resblock_forward(_self, num_slices, input_tensor, temb, **kwargs):
+    assert _self.upsample is None and _self.downsample is None
+    assert _self.norm1.num_groups == _self.norm2.num_groups
+    assert temb is None
+
+    # make sure norms are on cpu
+    org_device = input_tensor.device
+    cpu_device = torch.device("cpu")
+    _self.norm1.to(cpu_device)
+    _self.norm2.to(cpu_device)
+
+    # GroupNormがCPUでfp16で動かない対策
+    org_dtype = input_tensor.dtype
+    if org_dtype == torch.float16:
+        _self.norm1.to(torch.float32)
+        _self.norm2.to(torch.float32)
+
+    # すべてのテンソルをCPUに移動する
+    input_tensor = input_tensor.to(cpu_device)
+    hidden_states = input_tensor
+
+    # どうもこれは結果が異なるようだ……
+    # def sliced_norm1(norm, x):
+    #     num_div = 4 if up_block_idx <= 2 else x.shape[1] // norm.num_groups
+    #     sliced_tensor = torch.chunk(x, num_div, dim=1)
+    #     sliced_weight = torch.chunk(norm.weight, num_div, dim=0)
+    #     sliced_bias = torch.chunk(norm.bias, num_div, dim=0)
+    #     logger.info(sliced_tensor[0].shape, num_div, sliced_weight[0].shape, sliced_bias[0].shape)
+    #     normed_tensor = []
+    #     for i in range(num_div):
+    #         n = torch.group_norm(sliced_tensor[i], norm.num_groups, sliced_weight[i], sliced_bias[i], norm.eps)
+    #         normed_tensor.append(n)
+    #         del n
+    #     x = torch.cat(normed_tensor, dim=1)
+    #     return num_div, x
+
+    # normを分割すると結果が変わるので、ここだけは分割しない。GPUで計算するとVRAMが足りなくなるので、CPUで計算する。幸いCPUでもそこまで遅くない
+    if org_dtype == torch.float16:
+        hidden_states = hidden_states.to(torch.float32)
+    hidden_states = _self.norm1(hidden_states)  # run on cpu
+    if org_dtype == torch.float16:
+        hidden_states = hidden_states.to(torch.float16)
+
+    sliced = slice_h(hidden_states, num_slices)
+    del hidden_states
+
+    for i in range(len(sliced)):
+        x = sliced[i]
+        sliced[i] = None
+
+        # 計算する部分だけGPUに移動する、以下同様
+        x = x.to(org_device)
+        x = _self.nonlinearity(x)
+        x = _self.conv1(x)
+        x = x.to(cpu_device)
+        sliced[i] = x
+        del x
+
+    hidden_states = cat_h(sliced)
+    del sliced
+
+    if org_dtype == torch.float16:
+        hidden_states = hidden_states.to(torch.float32)
+    hidden_states = _self.norm2(hidden_states)  # run on cpu
+    if org_dtype == torch.float16:
+        hidden_states = hidden_states.to(torch.float16)
+
+    sliced = slice_h(hidden_states, num_slices)
+    del hidden_states
+
+    for i in range(len(sliced)):
+        x = sliced[i]
+        sliced[i] = None
+
+        x = x.to(org_device)
+        x = _self.nonlinearity(x)
+        x = _self.dropout(x)
+        x = _self.conv2(x)
+        x = x.to(cpu_device)
+        sliced[i] = x
+        del x
+
+    hidden_states = cat_h(sliced)
+    del sliced
+
+    # make shortcut
+    if _self.conv_shortcut is not None:
+        sliced = list(torch.chunk(input_tensor, num_slices, dim=2))  # no padding in conv_shortcut パディングがないので普通にスライスする
+        del input_tensor
+
+        for i in range(len(sliced)):
+            x = sliced[i]
+            sliced[i] = None
+
+            x = x.to(org_device)
+            x = _self.conv_shortcut(x)
+            x = x.to(cpu_device)
+            sliced[i] = x
+            del x
+
+        input_tensor = torch.cat(sliced, dim=2)
+        del sliced
+
+    output_tensor = (input_tensor + hidden_states) / _self.output_scale_factor
+
+    output_tensor = output_tensor.to(org_device)  # 次のレイヤーがGPUで計算する
+    return output_tensor
+
+
+class SlicingEncoder(nn.Module):
+    def __init__(
+        self,
+        in_channels=3,
+        out_channels=3,
+        down_block_types=("DownEncoderBlock2D",),
+        block_out_channels=(64,),
+        layers_per_block=2,
+        norm_num_groups=32,
+        act_fn="silu",
+        double_z=True,
+        num_slices=2,
+    ):
+        super().__init__()
+        self.layers_per_block = layers_per_block
+
+        self.conv_in = torch.nn.Conv2d(in_channels, block_out_channels[0], kernel_size=3, stride=1, padding=1)
+
+        self.mid_block = None
+        self.down_blocks = nn.ModuleList([])
+
+        # down
+        output_channel = block_out_channels[0]
+        for i, down_block_type in enumerate(down_block_types):
+            input_channel = output_channel
+            output_channel = block_out_channels[i]
+            is_final_block = i == len(block_out_channels) - 1
+
+            down_block = get_down_block(
+                down_block_type,
+                num_layers=self.layers_per_block,
+                in_channels=input_channel,
+                out_channels=output_channel,
+                add_downsample=not is_final_block,
+                resnet_eps=1e-6,
+                downsample_padding=0,
+                resnet_act_fn=act_fn,
+                resnet_groups=norm_num_groups,
+                attention_head_dim=output_channel,
+                temb_channels=None,
+            )
+            self.down_blocks.append(down_block)
+
+        # mid
+        self.mid_block = UNetMidBlock2D(
+            in_channels=block_out_channels[-1],
+            resnet_eps=1e-6,
+            resnet_act_fn=act_fn,
+            output_scale_factor=1,
+            resnet_time_scale_shift="default",
+            attention_head_dim=block_out_channels[-1],
+            resnet_groups=norm_num_groups,
+            temb_channels=None,
+        )
+        self.mid_block.attentions[0].set_use_memory_efficient_attention_xformers(True)  # とりあえずDiffusersのxformersを使う
+
+        # out
+        self.conv_norm_out = nn.GroupNorm(num_channels=block_out_channels[-1], num_groups=norm_num_groups, eps=1e-6)
+        self.conv_act = nn.SiLU()
+
+        conv_out_channels = 2 * out_channels if double_z else out_channels
+        self.conv_out = nn.Conv2d(block_out_channels[-1], conv_out_channels, 3, padding=1)
+
+        # replace forward of ResBlocks
+        def wrapper(func, module, num_slices):
+            def forward(*args, **kwargs):
+                return func(module, num_slices, *args, **kwargs)
+
+            return forward
+
+        self.num_slices = num_slices
+        div = num_slices / (2 ** (len(self.down_blocks) - 1))  # 深い層はそこまで分割しなくていいので適宜減らす
+        # logger.info(f"initial divisor: {div}")
+        if div >= 2:
+            div = int(div)
+            for resnet in self.mid_block.resnets:
+                resnet.forward = wrapper(resblock_forward, resnet, div)
+            # midblock doesn't have downsample
+
+        for i, down_block in enumerate(self.down_blocks[::-1]):
+            if div >= 2:
+                div = int(div)
+                # logger.info(f"down block: {i} divisor: {div}")
+                for resnet in down_block.resnets:
+                    resnet.forward = wrapper(resblock_forward, resnet, div)
+                if down_block.downsamplers is not None:
+                    # logger.info("has downsample")
+                    for downsample in down_block.downsamplers:
+                        downsample.forward = wrapper(self.downsample_forward, downsample, div * 2)
+            div *= 2
+
+    def forward(self, x):
+        sample = x
+        del x
+
+        org_device = sample.device
+        cpu_device = torch.device("cpu")
+
+        # sample = self.conv_in(sample)
+        sample = sample.to(cpu_device)
+        sliced = slice_h(sample, self.num_slices)
+        del sample
+
+        for i in range(len(sliced)):
+            x = sliced[i]
+            sliced[i] = None
+
+            x = x.to(org_device)
+            x = self.conv_in(x)
+            x = x.to(cpu_device)
+            sliced[i] = x
+            del x
+
+        sample = cat_h(sliced)
+        del sliced
+
+        sample = sample.to(org_device)
+
+        # down
+        for down_block in self.down_blocks:
+            sample = down_block(sample)
+
+        # middle
+        sample = self.mid_block(sample)
+
+        # post-process
+        # ここも省メモリ化したいが、恐らくそこまでメモリを食わないので省略
+        sample = self.conv_norm_out(sample)
+        sample = self.conv_act(sample)
+        sample = self.conv_out(sample)
+
+        return sample
+
+    def downsample_forward(self, _self, num_slices, hidden_states):
+        assert hidden_states.shape[1] == _self.channels
+        assert _self.use_conv and _self.padding == 0
+        logger.info(f"downsample forward {num_slices} {hidden_states.shape}")
+
+        org_device = hidden_states.device
+        cpu_device = torch.device("cpu")
+
+        hidden_states = hidden_states.to(cpu_device)
+        pad = (0, 1, 0, 1)
+        hidden_states = torch.nn.functional.pad(hidden_states, pad, mode="constant", value=0)
+
+        # slice with even number because of stride 2
+        # strideが2なので偶数でスライスする
+        # slice with pad 1 both sides: to eliminate side effect of padding of conv2d
+        size = (hidden_states.shape[2] + num_slices - 1) // num_slices
+        size = size + 1 if size % 2 == 1 else size
+
+        sliced = []
+        for i in range(num_slices):
+            if i == 0:
+                sliced.append(hidden_states[:, :, : size + 1, :])
+            else:
+                end = size * (i + 1) + 1
+                if hidden_states.shape[2] - end < 4:  # if the last slice is too small, use the rest of the tensor
+                    end = hidden_states.shape[2]
+                sliced.append(hidden_states[:, :, size * i - 1 : end, :])
+                if end >= hidden_states.shape[2]:
+                    break
+        del hidden_states
+
+        for i in range(len(sliced)):
+            x = sliced[i]
+            sliced[i] = None
+
+            x = x.to(org_device)
+            x = _self.conv(x)
+            x = x.to(cpu_device)
+
+            # ここだけ雰囲気が違うのはCopilotのせい
+            if i == 0:
+                hidden_states = x
+            else:
+                hidden_states = torch.cat([hidden_states, x], dim=2)
+
+        hidden_states = hidden_states.to(org_device)
+        # logger.info(f"downsample forward done {hidden_states.shape}")
+        return hidden_states
+
+
+class SlicingDecoder(nn.Module):
+    def __init__(
+        self,
+        in_channels=3,
+        out_channels=3,
+        up_block_types=("UpDecoderBlock2D",),
+        block_out_channels=(64,),
+        layers_per_block=2,
+        norm_num_groups=32,
+        act_fn="silu",
+        num_slices=2,
+    ):
+        super().__init__()
+        self.layers_per_block = layers_per_block
+
+        self.conv_in = nn.Conv2d(in_channels, block_out_channels[-1], kernel_size=3, stride=1, padding=1)
+
+        self.mid_block = None
+        self.up_blocks = nn.ModuleList([])
+
+        # mid
+        self.mid_block = UNetMidBlock2D(
+            in_channels=block_out_channels[-1],
+            resnet_eps=1e-6,
+            resnet_act_fn=act_fn,
+            output_scale_factor=1,
+            resnet_time_scale_shift="default",
+            attention_head_dim=block_out_channels[-1],
+            resnet_groups=norm_num_groups,
+            temb_channels=None,
+        )
+        self.mid_block.attentions[0].set_use_memory_efficient_attention_xformers(True)  # とりあえずDiffusersのxformersを使う
+
+        # up
+        reversed_block_out_channels = list(reversed(block_out_channels))
+        output_channel = reversed_block_out_channels[0]
+        for i, up_block_type in enumerate(up_block_types):
+            prev_output_channel = output_channel
+            output_channel = reversed_block_out_channels[i]
+
+            is_final_block = i == len(block_out_channels) - 1
+
+            up_block = get_up_block(
+                up_block_type,
+                num_layers=self.layers_per_block + 1,
+                in_channels=prev_output_channel,
+                out_channels=output_channel,
+                prev_output_channel=None,
+                add_upsample=not is_final_block,
+                resnet_eps=1e-6,
+                resnet_act_fn=act_fn,
+                resnet_groups=norm_num_groups,
+                attention_head_dim=output_channel,
+                temb_channels=None,
+            )
+            self.up_blocks.append(up_block)
+            prev_output_channel = output_channel
+
+        # out
+        self.conv_norm_out = nn.GroupNorm(num_channels=block_out_channels[0], num_groups=norm_num_groups, eps=1e-6)
+        self.conv_act = nn.SiLU()
+        self.conv_out = nn.Conv2d(block_out_channels[0], out_channels, 3, padding=1)
+
+        # replace forward of ResBlocks
+        def wrapper(func, module, num_slices):
+            def forward(*args, **kwargs):
+                return func(module, num_slices, *args, **kwargs)
+
+            return forward
+
+        self.num_slices = num_slices
+        div = num_slices / (2 ** (len(self.up_blocks) - 1))
+        logger.info(f"initial divisor: {div}")
+        if div >= 2:
+            div = int(div)
+            for resnet in self.mid_block.resnets:
+                resnet.forward = wrapper(resblock_forward, resnet, div)
+            # midblock doesn't have upsample
+
+        for i, up_block in enumerate(self.up_blocks):
+            if div >= 2:
+                div = int(div)
+                # logger.info(f"up block: {i} divisor: {div}")
+                for resnet in up_block.resnets:
+                    resnet.forward = wrapper(resblock_forward, resnet, div)
+                if up_block.upsamplers is not None:
+                    # logger.info("has upsample")
+                    for upsample in up_block.upsamplers:
+                        upsample.forward = wrapper(self.upsample_forward, upsample, div * 2)
+            div *= 2
+
+    def forward(self, z):
+        sample = z
+        del z
+        sample = self.conv_in(sample)
+
+        # middle
+        sample = self.mid_block(sample)
+
+        # up
+        for i, up_block in enumerate(self.up_blocks):
+            sample = up_block(sample)
+
+        # post-process
+        sample = self.conv_norm_out(sample)
+        sample = self.conv_act(sample)
+
+        # conv_out with slicing because of VRAM usage
+        # conv_outはとてもVRAM使うのでスライスして対応
+        org_device = sample.device
+        cpu_device = torch.device("cpu")
+        sample = sample.to(cpu_device)
+
+        sliced = slice_h(sample, self.num_slices)
+        del sample
+        for i in range(len(sliced)):
+            x = sliced[i]
+            sliced[i] = None
+
+            x = x.to(org_device)
+            x = self.conv_out(x)
+            x = x.to(cpu_device)
+            sliced[i] = x
+        sample = cat_h(sliced)
+        del sliced
+
+        sample = sample.to(org_device)
+        return sample
+
+    def upsample_forward(self, _self, num_slices, hidden_states, output_size=None):
+        assert hidden_states.shape[1] == _self.channels
+        assert _self.use_conv_transpose == False and _self.use_conv
+
+        org_dtype = hidden_states.dtype
+        org_device = hidden_states.device
+        cpu_device = torch.device("cpu")
+
+        hidden_states = hidden_states.to(cpu_device)
+        sliced = slice_h(hidden_states, num_slices)
+        del hidden_states
+
+        for i in range(len(sliced)):
+            x = sliced[i]
+            sliced[i] = None
+
+            x = x.to(org_device)
+
+            # Cast to float32 to as 'upsample_nearest2d_out_frame' op does not support bfloat16
+            # TODO(Suraj): Remove this cast once the issue is fixed in PyTorch
+            # https://github.com/pytorch/pytorch/issues/86679
+            # PyTorch 2で直らないかね……
+            if org_dtype == torch.bfloat16:
+                x = x.to(torch.float32)
+
+            x = torch.nn.functional.interpolate(x, scale_factor=2.0, mode="nearest")
+
+            if org_dtype == torch.bfloat16:
+                x = x.to(org_dtype)
+
+            x = _self.conv(x)
+
+            # upsampleされてるのでpadは2になる
+            if i == 0:
+                x = x[:, :, :-2, :]
+            elif i == num_slices - 1:
+                x = x[:, :, 2:, :]
+            else:
+                x = x[:, :, 2:-2, :]
+
+            x = x.to(cpu_device)
+            sliced[i] = x
+            del x
+
+        hidden_states = torch.cat(sliced, dim=2)
+        # logger.info(f"us hidden_states {hidden_states.shape}")
+        del sliced
+
+        hidden_states = hidden_states.to(org_device)
+        return hidden_states
+
+
+class SlicingAutoencoderKL(ModelMixin, ConfigMixin):
+    r"""Variational Autoencoder (VAE) model with KL loss from the paper Auto-Encoding Variational Bayes by Diederik P. Kingma
+    and Max Welling.
+
+    This model inherits from [`ModelMixin`]. Check the superclass documentation for the generic methods the library
+    implements for all the model (such as downloading or saving, etc.)
+
+    Parameters:
+        in_channels (int, *optional*, defaults to 3): Number of channels in the input image.
+        out_channels (int,  *optional*, defaults to 3): Number of channels in the output.
+        down_block_types (`Tuple[str]`, *optional*, defaults to :
+            obj:`("DownEncoderBlock2D",)`): Tuple of downsample block types.
+        up_block_types (`Tuple[str]`, *optional*, defaults to :
+            obj:`("UpDecoderBlock2D",)`): Tuple of upsample block types.
+        block_out_channels (`Tuple[int]`, *optional*, defaults to :
+            obj:`(64,)`): Tuple of block output channels.
+        act_fn (`str`, *optional*, defaults to `"silu"`): The activation function to use.
+        latent_channels (`int`, *optional*, defaults to `4`): Number of channels in the latent space.
+        sample_size (`int`, *optional*, defaults to `32`): TODO
+    """
+
+    @register_to_config
+    def __init__(
+        self,
+        in_channels: int = 3,
+        out_channels: int = 3,
+        down_block_types: Tuple[str] = ("DownEncoderBlock2D",),
+        up_block_types: Tuple[str] = ("UpDecoderBlock2D",),
+        block_out_channels: Tuple[int] = (64,),
+        layers_per_block: int = 1,
+        act_fn: str = "silu",
+        latent_channels: int = 4,
+        norm_num_groups: int = 32,
+        sample_size: int = 32,
+        num_slices: int = 16,
+    ):
+        super().__init__()
+
+        # pass init params to Encoder
+        self.encoder = SlicingEncoder(
+            in_channels=in_channels,
+            out_channels=latent_channels,
+            down_block_types=down_block_types,
+            block_out_channels=block_out_channels,
+            layers_per_block=layers_per_block,
+            act_fn=act_fn,
+            norm_num_groups=norm_num_groups,
+            double_z=True,
+            num_slices=num_slices,
+        )
+
+        # pass init params to Decoder
+        self.decoder = SlicingDecoder(
+            in_channels=latent_channels,
+            out_channels=out_channels,
+            up_block_types=up_block_types,
+            block_out_channels=block_out_channels,
+            layers_per_block=layers_per_block,
+            norm_num_groups=norm_num_groups,
+            act_fn=act_fn,
+            num_slices=num_slices,
+        )
+
+        self.quant_conv = torch.nn.Conv2d(2 * latent_channels, 2 * latent_channels, 1)
+        self.post_quant_conv = torch.nn.Conv2d(latent_channels, latent_channels, 1)
+        self.use_slicing = False
+
+    def encode(self, x: torch.FloatTensor, return_dict: bool = True) -> AutoencoderKLOutput:
+        h = self.encoder(x)
+        moments = self.quant_conv(h)
+        posterior = DiagonalGaussianDistribution(moments)
+
+        if not return_dict:
+            return (posterior,)
+
+        return AutoencoderKLOutput(latent_dist=posterior)
+
+    def _decode(self, z: torch.FloatTensor, return_dict: bool = True) -> Union[DecoderOutput, torch.FloatTensor]:
+        z = self.post_quant_conv(z)
+        dec = self.decoder(z)
+
+        if not return_dict:
+            return (dec,)
+
+        return DecoderOutput(sample=dec)
+
+    # これはバッチ方向のスライシング　紛らわしい
+    def enable_slicing(self):
+        r"""
+        Enable sliced VAE decoding.
+
+        When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several
+        steps. This is useful to save some memory and allow larger batch sizes.
+        """
+        self.use_slicing = True
+
+    def disable_slicing(self):
+        r"""
+        Disable sliced VAE decoding. If `enable_slicing` was previously invoked, this method will go back to computing
+        decoding in one step.
+        """
+        self.use_slicing = False
+
+    def decode(self, z: torch.FloatTensor, return_dict: bool = True) -> Union[DecoderOutput, torch.FloatTensor]:
+        if self.use_slicing and z.shape[0] > 1:
+            decoded_slices = [self._decode(z_slice).sample for z_slice in z.split(1)]
+            decoded = torch.cat(decoded_slices)
+        else:
+            decoded = self._decode(z).sample
+
+        if not return_dict:
+            return (decoded,)
+
+        return DecoderOutput(sample=decoded)
+
+    def forward(
+        self,
+        sample: torch.FloatTensor,
+        sample_posterior: bool = False,
+        return_dict: bool = True,
+        generator: Optional[torch.Generator] = None,
+    ) -> Union[DecoderOutput, torch.FloatTensor]:
+        r"""
+        Args:
+            sample (`torch.FloatTensor`): Input sample.
+            sample_posterior (`bool`, *optional*, defaults to `False`):
+                Whether to sample from the posterior.
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
+        """
+        x = sample
+        posterior = self.encode(x).latent_dist
+        if sample_posterior:
+            z = posterior.sample(generator=generator)
+        else:
+            z = posterior.mode()
+        dec = self.decode(z).sample
+
+        if not return_dict:
+            return (dec,)
+
+        return DecoderOutput(sample=dec)
--- a/library/strategy_base.py
+++ b/library/strategy_base.py
@@ -0,0 +1,636 @@
+# base class for platform strategies. this file defines the interface for strategies
+
+import os
+import re
+from typing import Any, List, Optional, Tuple, Union, Callable
+
+import numpy as np
+import torch
+from transformers import CLIPTokenizer, CLIPTextModel, CLIPTextModelWithProjection
+
+
+# TODO remove circular import by moving ImageInfo to a separate file
+# from library.train_util import ImageInfo
+
+from library.utils import setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+
+class TokenizeStrategy:
+    _strategy = None  # strategy instance: actual strategy class
+
+    _re_attention = re.compile(
+        r"""\\\(|
+\\\)|
+\\\[|
+\\]|
+\\\\|
+\\|
+\(|
+\[|
+:([+-]?[.\d]+)\)|
+\)|
+]|
+[^\\()\[\]:]+|
+:
+""",
+        re.X,
+    )
+
+    @classmethod
+    def set_strategy(cls, strategy):
+        if cls._strategy is not None:
+            raise RuntimeError(f"Internal error. {cls.__name__} strategy is already set")
+        cls._strategy = strategy
+
+    @classmethod
+    def get_strategy(cls) -> Optional["TokenizeStrategy"]:
+        return cls._strategy
+
+    def _load_tokenizer(
+        self, model_class: Any, model_id: str, subfolder: Optional[str] = None, tokenizer_cache_dir: Optional[str] = None
+    ) -> Any:
+        tokenizer = None
+        if tokenizer_cache_dir:
+            local_tokenizer_path = os.path.join(tokenizer_cache_dir, model_id.replace("/", "_"))
+            if os.path.exists(local_tokenizer_path):
+                logger.info(f"load tokenizer from cache: {local_tokenizer_path}")
+                tokenizer = model_class.from_pretrained(local_tokenizer_path)  # same for v1 and v2
+
+        if tokenizer is None:
+            tokenizer = model_class.from_pretrained(model_id, subfolder=subfolder)
+
+        if tokenizer_cache_dir and not os.path.exists(local_tokenizer_path):
+            logger.info(f"save Tokenizer to cache: {local_tokenizer_path}")
+            tokenizer.save_pretrained(local_tokenizer_path)
+
+        return tokenizer
+
+    def tokenize(self, text: Union[str, List[str]]) -> List[torch.Tensor]:
+        raise NotImplementedError
+
+    def tokenize_with_weights(self, text: Union[str, List[str]]) -> Tuple[List[torch.Tensor], List[torch.Tensor]]:
+        """
+        returns: [tokens1, tokens2, ...], [weights1, weights2, ...]
+        """
+        raise NotImplementedError
+
+    def _get_weighted_input_ids(
+        self, tokenizer: CLIPTokenizer, text: str, max_length: Optional[int] = None
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """
+        max_length includes starting and ending tokens.
+        """
+
+        def parse_prompt_attention(text):
+            """
+            Parses a string with attention tokens and returns a list of pairs: text and its associated weight.
+            Accepted tokens are:
+            (abc) - increases attention to abc by a multiplier of 1.1
+            (abc:3.12) - increases attention to abc by a multiplier of 3.12
+            [abc] - decreases attention to abc by a multiplier of 1.1
+            \( - literal character '('
+            \[ - literal character '['
+            \) - literal character ')'
+            \] - literal character ']'
+            \\ - literal character '\'
+            anything else - just text
+            >>> parse_prompt_attention('normal text')
+            [['normal text', 1.0]]
+            >>> parse_prompt_attention('an (important) word')
+            [['an ', 1.0], ['important', 1.1], [' word', 1.0]]
+            >>> parse_prompt_attention('(unbalanced')
+            [['unbalanced', 1.1]]
+            >>> parse_prompt_attention('\(literal\]')
+            [['(literal]', 1.0]]
+            >>> parse_prompt_attention('(unnecessary)(parens)')
+            [['unnecessaryparens', 1.1]]
+            >>> parse_prompt_attention('a (((house:1.3)) [on] a (hill:0.5), sun, (((sky))).')
+            [['a ', 1.0],
+            ['house', 1.5730000000000004],
+            [' ', 1.1],
+            ['on', 1.0],
+            [' a ', 1.1],
+            ['hill', 0.55],
+            [', sun, ', 1.1],
+            ['sky', 1.4641000000000006],
+            ['.', 1.1]]
+            """
+
+            res = []
+            round_brackets = []
+            square_brackets = []
+
+            round_bracket_multiplier = 1.1
+            square_bracket_multiplier = 1 / 1.1
+
+            def multiply_range(start_position, multiplier):
+                for p in range(start_position, len(res)):
+                    res[p][1] *= multiplier
+
+            for m in TokenizeStrategy._re_attention.finditer(text):
+                text = m.group(0)
+                weight = m.group(1)
+
+                if text.startswith("\\"):
+                    res.append([text[1:], 1.0])
+                elif text == "(":
+                    round_brackets.append(len(res))
+                elif text == "[":
+                    square_brackets.append(len(res))
+                elif weight is not None and len(round_brackets) > 0:
+                    multiply_range(round_brackets.pop(), float(weight))
+                elif text == ")" and len(round_brackets) > 0:
+                    multiply_range(round_brackets.pop(), round_bracket_multiplier)
+                elif text == "]" and len(square_brackets) > 0:
+                    multiply_range(square_brackets.pop(), square_bracket_multiplier)
+                else:
+                    res.append([text, 1.0])
+
+            for pos in round_brackets:
+                multiply_range(pos, round_bracket_multiplier)
+
+            for pos in square_brackets:
+                multiply_range(pos, square_bracket_multiplier)
+
+            if len(res) == 0:
+                res = [["", 1.0]]
+
+            # merge runs of identical weights
+            i = 0
+            while i + 1 < len(res):
+                if res[i][1] == res[i + 1][1]:
+                    res[i][0] += res[i + 1][0]
+                    res.pop(i + 1)
+                else:
+                    i += 1
+
+            return res
+
+        def get_prompts_with_weights(text: str, max_length: int):
+            r"""
+            Tokenize a list of prompts and return its tokens with weights of each token. max_length does not include starting and ending token.
+
+            No padding, starting or ending token is included.
+            """
+            truncated = False
+
+            texts_and_weights = parse_prompt_attention(text)
+            tokens = []
+            weights = []
+            for word, weight in texts_and_weights:
+                # tokenize and discard the starting and the ending token
+                token = tokenizer(word).input_ids[1:-1]
+                tokens += token
+                # copy the weight by length of token
+                weights += [weight] * len(token)
+                # stop if the text is too long (longer than truncation limit)
+                if len(tokens) > max_length:
+                    truncated = True
+                    break
+            # truncate
+            if len(tokens) > max_length:
+                truncated = True
+                tokens = tokens[:max_length]
+                weights = weights[:max_length]
+            if truncated:
+                logger.warning("Prompt was truncated. Try to shorten the prompt or increase max_embeddings_multiples")
+            return tokens, weights
+
+        def pad_tokens_and_weights(tokens, weights, max_length, bos, eos, pad):
+            r"""
+            Pad the tokens (with starting and ending tokens) and weights (with 1.0) to max_length.
+            """
+            tokens = [bos] + tokens + [eos] + [pad] * (max_length - 2 - len(tokens))
+            weights = [1.0] + weights + [1.0] * (max_length - 1 - len(weights))
+            return tokens, weights
+
+        if max_length is None:
+            max_length = tokenizer.model_max_length
+
+        tokens, weights = get_prompts_with_weights(text, max_length - 2)
+        tokens, weights = pad_tokens_and_weights(
+            tokens, weights, max_length, tokenizer.bos_token_id, tokenizer.eos_token_id, tokenizer.pad_token_id
+        )
+        return torch.tensor(tokens).unsqueeze(0), torch.tensor(weights).unsqueeze(0)
+
+    def _get_input_ids(
+        self, tokenizer: CLIPTokenizer, text: str, max_length: Optional[int] = None, weighted: bool = False
+    ) -> torch.Tensor:
+        """
+        for SD1.5/2.0/SDXL
+        TODO support batch input
+        """
+        if max_length is None:
+            max_length = tokenizer.model_max_length - 2
+
+        if weighted:
+            input_ids, weights = self._get_weighted_input_ids(tokenizer, text, max_length)
+        else:
+            input_ids = tokenizer(text, padding="max_length", truncation=True, max_length=max_length, return_tensors="pt").input_ids
+
+        if max_length > tokenizer.model_max_length:
+            input_ids = input_ids.squeeze(0)
+            iids_list = []
+            if tokenizer.pad_token_id == tokenizer.eos_token_id:
+                # v1
+                # 77以上の時は "<BOS> .... <EOS> <EOS> <EOS>" でトータル227とかになっているので、"<BOS>...<EOS>"の三連に変換する
+                # 1111氏のやつは , で区切る、とかしているようだが　とりあえず単純に
+                for i in range(1, max_length - tokenizer.model_max_length + 2, tokenizer.model_max_length - 2):  # (1, 152, 75)
+                    ids_chunk = (
+                        input_ids[0].unsqueeze(0),
+                        input_ids[i : i + tokenizer.model_max_length - 2],
+                        input_ids[-1].unsqueeze(0),
+                    )
+                    ids_chunk = torch.cat(ids_chunk)
+                    iids_list.append(ids_chunk)
+            else:
+                # v2 or SDXL
+                # 77以上の時は "<BOS> .... <EOS> <PAD> <PAD>..." でトータル227とかになっているので、"<BOS>...<EOS> <PAD> <PAD> ..."の三連に変換する
+                for i in range(1, max_length - tokenizer.model_max_length + 2, tokenizer.model_max_length - 2):
+                    ids_chunk = (
+                        input_ids[0].unsqueeze(0),  # BOS
+                        input_ids[i : i + tokenizer.model_max_length - 2],
+                        input_ids[-1].unsqueeze(0),
+                    )  # PAD or EOS
+                    ids_chunk = torch.cat(ids_chunk)
+
+                    # 末尾が <EOS> <PAD> または <PAD> <PAD> の場合は、何もしなくてよい
+                    # 末尾が x <PAD/EOS> の場合は末尾を <EOS> に変える（x <EOS> なら結果的に変化なし）
+                    if ids_chunk[-2] != tokenizer.eos_token_id and ids_chunk[-2] != tokenizer.pad_token_id:
+                        ids_chunk[-1] = tokenizer.eos_token_id
+                    # 先頭が <BOS> <PAD> ... の場合は <BOS> <EOS> <PAD> ... に変える
+                    if ids_chunk[1] == tokenizer.pad_token_id:
+                        ids_chunk[1] = tokenizer.eos_token_id
+
+                    iids_list.append(ids_chunk)
+
+            input_ids = torch.stack(iids_list)  # 3,77
+
+            if weighted:
+                weights = weights.squeeze(0)
+                new_weights = torch.ones(input_ids.shape)
+                for i in range(1, max_length - tokenizer.model_max_length + 2, tokenizer.model_max_length - 2):
+                    b = i // (tokenizer.model_max_length - 2)
+                    new_weights[b, 1 : 1 + tokenizer.model_max_length - 2] = weights[i : i + tokenizer.model_max_length - 2]
+                weights = new_weights
+
+        if weighted:
+            return input_ids, weights
+        return input_ids
+
+
+class TextEncodingStrategy:
+    _strategy = None  # strategy instance: actual strategy class
+
+    @classmethod
+    def set_strategy(cls, strategy):
+        if cls._strategy is not None:
+            raise RuntimeError(f"Internal error. {cls.__name__} strategy is already set")
+        cls._strategy = strategy
+
+    @classmethod
+    def get_strategy(cls) -> Optional["TextEncodingStrategy"]:
+        return cls._strategy
+
+    def encode_tokens(
+        self, tokenize_strategy: TokenizeStrategy, models: List[Any], tokens: List[torch.Tensor]
+    ) -> List[torch.Tensor]:
+        """
+        Encode tokens into embeddings and outputs.
+        :param tokens: list of token tensors for each TextModel
+        :return: list of output embeddings for each architecture
+        """
+        raise NotImplementedError
+
+    def encode_tokens_with_weights(
+        self, tokenize_strategy: TokenizeStrategy, models: List[Any], tokens: List[torch.Tensor], weights: List[torch.Tensor]
+    ) -> List[torch.Tensor]:
+        """
+        Encode tokens into embeddings and outputs.
+        :param tokens: list of token tensors for each TextModel
+        :param weights: list of weight tensors for each TextModel
+        :return: list of output embeddings for each architecture
+        """
+        raise NotImplementedError
+
+
+class TextEncoderOutputsCachingStrategy:
+    _strategy = None  # strategy instance: actual strategy class
+
+    def __init__(
+        self,
+        cache_to_disk: bool,
+        batch_size: Optional[int],
+        skip_disk_cache_validity_check: bool,
+        is_partial: bool = False,
+        is_weighted: bool = False,
+    ) -> None:
+        self._cache_to_disk = cache_to_disk
+        self._batch_size = batch_size
+        self.skip_disk_cache_validity_check = skip_disk_cache_validity_check
+        self._is_partial = is_partial
+        self._is_weighted = is_weighted
+
+    @classmethod
+    def set_strategy(cls, strategy):
+        if cls._strategy is not None:
+            raise RuntimeError(f"Internal error. {cls.__name__} strategy is already set")
+        cls._strategy = strategy
+
+    @classmethod
+    def get_strategy(cls) -> Optional["TextEncoderOutputsCachingStrategy"]:
+        return cls._strategy
+
+    @property
+    def cache_to_disk(self):
+        return self._cache_to_disk
+
+    @property
+    def batch_size(self):
+        return self._batch_size
+
+    @property
+    def is_partial(self):
+        return self._is_partial
+
+    @property
+    def is_weighted(self):
+        return self._is_weighted
+
+    def get_outputs_npz_path(self, image_abs_path: str) -> str:
+        raise NotImplementedError
+
+    def load_outputs_npz(self, npz_path: str) -> List[np.ndarray]:
+        raise NotImplementedError
+
+    def is_disk_cached_outputs_expected(self, npz_path: str) -> bool:
+        raise NotImplementedError
+
+    def cache_batch_outputs(
+        self, tokenize_strategy: TokenizeStrategy, models: List[Any], text_encoding_strategy: TextEncodingStrategy, batch: List
+    ):
+        raise NotImplementedError
+
+
+class LatentsCachingStrategy:
+    # TODO commonize utillity functions to this class, such as npz handling etc.
+
+    _strategy = None  # strategy instance: actual strategy class
+
+    def __init__(self, cache_to_disk: bool, batch_size: int, skip_disk_cache_validity_check: bool) -> None:
+        self._cache_to_disk = cache_to_disk
+        self._batch_size = batch_size
+        self.skip_disk_cache_validity_check = skip_disk_cache_validity_check
+
+    @classmethod
+    def set_strategy(cls, strategy):
+        if cls._strategy is not None:
+            raise RuntimeError(f"Internal error. {cls.__name__} strategy is already set")
+        cls._strategy = strategy
+
+    @classmethod
+    def get_strategy(cls) -> Optional["LatentsCachingStrategy"]:
+        return cls._strategy
+
+    @property
+    def cache_to_disk(self):
+        return self._cache_to_disk
+
+    @property
+    def batch_size(self):
+        return self._batch_size
+
+    @property
+    def cache_suffix(self):
+        raise NotImplementedError
+
+    def get_image_size_from_disk_cache_path(self, absolute_path: str, npz_path: str) -> Tuple[Optional[int], Optional[int]]:
+        w, h = os.path.splitext(npz_path)[0].split("_")[-2].split("x")
+        return int(w), int(h)
+
+    def get_latents_npz_path(self, absolute_path: str, image_size: Tuple[int, int]) -> str:
+        raise NotImplementedError
+
+    def is_disk_cached_latents_expected(
+        self, bucket_reso: Tuple[int, int], npz_path: str, flip_aug: bool, alpha_mask: bool
+    ) -> bool:
+        raise NotImplementedError
+
+    def cache_batch_latents(self, model: Any, batch: List, flip_aug: bool, alpha_mask: bool, random_crop: bool):
+        raise NotImplementedError
+
+    def _default_is_disk_cached_latents_expected(
+        self,
+        latents_stride: int,
+        bucket_reso: Tuple[int, int],
+        npz_path: str,
+        flip_aug: bool,
+        apply_alpha_mask: bool,
+        multi_resolution: bool = False,
+    ) -> bool:
+        """
+        Args:
+            latents_stride: stride of latents
+            bucket_reso: resolution of the bucket
+            npz_path: path to the npz file
+            flip_aug: whether to flip images
+            apply_alpha_mask: whether to apply alpha mask
+            multi_resolution: whether to use multi-resolution latents
+
+        Returns:
+            bool
+        """
+        if not self.cache_to_disk:
+            return False
+        if not os.path.exists(npz_path):
+            return False
+        if self.skip_disk_cache_validity_check:
+            return True
+
+        expected_latents_size = (bucket_reso[1] // latents_stride, bucket_reso[0] // latents_stride)  # bucket_reso is (W, H)
+
+        # e.g. "_32x64", HxW
+        key_reso_suffix = f"_{expected_latents_size[0]}x{expected_latents_size[1]}" if multi_resolution else ""
+
+        try:
+            npz = np.load(npz_path)
+            if "latents" + key_reso_suffix not in npz:
+                return False
+            if flip_aug and "latents_flipped" + key_reso_suffix not in npz:
+                return False
+            if apply_alpha_mask and "alpha_mask" + key_reso_suffix not in npz:
+                return False
+        except Exception as e:
+            logger.error(f"Error loading file: {npz_path}")
+            raise e
+
+        return True
+
+    # TODO remove circular dependency for ImageInfo
+    def _default_cache_batch_latents(
+        self,
+        encode_by_vae: Callable,
+        vae_device: torch.device,
+        vae_dtype: torch.dtype,
+        image_infos: List,
+        flip_aug: bool,
+        apply_alpha_mask: bool,
+        random_crop: bool,
+        multi_resolution: bool = False,
+    ):
+        """
+        Default implementation for cache_batch_latents. Image loading, VAE, flipping, alpha mask handling are common.
+
+        Args:
+            encode_by_vae: function to encode images by VAE
+            vae_device: device to use for VAE
+            vae_dtype: dtype to use for VAE
+            image_infos: list of ImageInfo
+            flip_aug: whether to flip images
+            apply_alpha_mask: whether to apply alpha mask
+            random_crop: whether to random crop images
+            multi_resolution: whether to use multi-resolution latents
+        
+        Returns: 
+            None
+        """
+        from library import train_util  # import here to avoid circular import
+
+        img_tensor, alpha_masks, original_sizes, crop_ltrbs = train_util.load_images_and_masks_for_caching(
+            image_infos, apply_alpha_mask, random_crop
+        )
+        img_tensor = img_tensor.to(device=vae_device, dtype=vae_dtype)
+
+        with torch.no_grad():
+            latents_tensors = encode_by_vae(img_tensor).to("cpu")
+        if flip_aug:
+            img_tensor = torch.flip(img_tensor, dims=[3])
+            with torch.no_grad():
+                flipped_latents = encode_by_vae(img_tensor).to("cpu")
+        else:
+            flipped_latents = [None] * len(latents_tensors)
+
+        # for info, latents, flipped_latent, alpha_mask in zip(image_infos, latents_tensors, flipped_latents, alpha_masks):
+        for i in range(len(image_infos)):
+            info = image_infos[i]
+            latents = latents_tensors[i]
+            flipped_latent = flipped_latents[i]
+            alpha_mask = alpha_masks[i]
+            original_size = original_sizes[i]
+            crop_ltrb = crop_ltrbs[i]
+
+            latents_size = latents.shape[1:3]  # H, W
+            key_reso_suffix = f"_{latents_size[0]}x{latents_size[1]}" if multi_resolution else ""  # e.g. "_32x64", HxW
+
+            if self.cache_to_disk:
+                self.save_latents_to_disk(
+                    info.latents_npz, latents, original_size, crop_ltrb, flipped_latent, alpha_mask, key_reso_suffix
+                )
+            else:
+                info.latents_original_size = original_size
+                info.latents_crop_ltrb = crop_ltrb
+                info.latents = latents
+                if flip_aug:
+                    info.latents_flipped = flipped_latent
+                info.alpha_mask = alpha_mask
+
+    def load_latents_from_disk(
+        self, npz_path: str, bucket_reso: Tuple[int, int]
+    ) -> Tuple[Optional[np.ndarray], Optional[List[int]], Optional[List[int]], Optional[np.ndarray], Optional[np.ndarray]]:
+        """
+        for SD/SDXL
+
+        Args:
+            npz_path (str): Path to the npz file.
+            bucket_reso (Tuple[int, int]): The resolution of the bucket.
+        
+        Returns:
+            Tuple[
+                Optional[np.ndarray], 
+                Optional[List[int]], 
+                Optional[List[int]], 
+                Optional[np.ndarray], 
+                Optional[np.ndarray]
+            ]: Latent np tensors, original size, crop (left top, right bottom), flipped latents, alpha mask
+        """
+        return self._default_load_latents_from_disk(None, npz_path, bucket_reso)
+
+    def _default_load_latents_from_disk(
+        self, latents_stride: Optional[int], npz_path: str, bucket_reso: Tuple[int, int]
+    ) -> Tuple[Optional[np.ndarray], Optional[List[int]], Optional[List[int]], Optional[np.ndarray], Optional[np.ndarray]]:
+        """
+        Args:
+            latents_stride (Optional[int]): Stride for latents. If None, load all latents.
+            npz_path (str): Path to the npz file.
+            bucket_reso (Tuple[int, int]): The resolution of the bucket.
+       
+        Returns:
+            Tuple[
+                Optional[np.ndarray], 
+                Optional[List[int]], 
+                Optional[List[int]], 
+                Optional[np.ndarray], 
+                Optional[np.ndarray]
+            ]: Latent np tensors, original size, crop (left top, right bottom), flipped latents, alpha mask
+        """
+        if latents_stride is None:
+            key_reso_suffix = ""
+        else:
+            latents_size = (bucket_reso[1] // latents_stride, bucket_reso[0] // latents_stride)  # bucket_reso is (W, H)
+            key_reso_suffix = f"_{latents_size[0]}x{latents_size[1]}"  # e.g. "_32x64", HxW
+
+        npz = np.load(npz_path)
+        if "latents" + key_reso_suffix not in npz:
+            raise ValueError(f"latents{key_reso_suffix} not found in {npz_path}")
+
+        latents = npz["latents" + key_reso_suffix]
+        original_size = npz["original_size" + key_reso_suffix].tolist()
+        crop_ltrb = npz["crop_ltrb" + key_reso_suffix].tolist()
+        flipped_latents = npz["latents_flipped" + key_reso_suffix] if "latents_flipped" + key_reso_suffix in npz else None
+        alpha_mask = npz["alpha_mask" + key_reso_suffix] if "alpha_mask" + key_reso_suffix in npz else None
+        return latents, original_size, crop_ltrb, flipped_latents, alpha_mask
+
+    def save_latents_to_disk(
+        self,
+        npz_path,
+        latents_tensor,
+        original_size,
+        crop_ltrb,
+        flipped_latents_tensor=None,
+        alpha_mask=None,
+        key_reso_suffix="",
+    ):
+        """
+        Args:
+            npz_path (str): Path to the npz file.
+            latents_tensor (torch.Tensor): Latent tensor
+            original_size (List[int]): Original size of the image
+            crop_ltrb (List[int]): Crop left top right bottom
+            flipped_latents_tensor (Optional[torch.Tensor]): Flipped latent tensor
+            alpha_mask (Optional[torch.Tensor]): Alpha mask
+            key_reso_suffix (str): Key resolution suffix
+
+        Returns:
+            None
+        """
+        kwargs = {}
+
+        if os.path.exists(npz_path):
+            # load existing npz and update it
+            npz = np.load(npz_path)
+            for key in npz.files:
+                kwargs[key] = npz[key]
+
+        kwargs["latents" + key_reso_suffix] = latents_tensor.float().cpu().numpy()
+        kwargs["original_size" + key_reso_suffix] = np.array(original_size)
+        kwargs["crop_ltrb" + key_reso_suffix] = np.array(crop_ltrb)
+        if flipped_latents_tensor is not None:
+            kwargs["latents_flipped" + key_reso_suffix] = flipped_latents_tensor.float().cpu().numpy()
+        if alpha_mask is not None:
+            kwargs["alpha_mask" + key_reso_suffix] = alpha_mask.float().cpu().numpy()
+        np.savez(npz_path, **kwargs)
--- a/library/strategy_flux.py
+++ b/library/strategy_flux.py
@@ -0,0 +1,271 @@
+import os
+import glob
+from typing import Any, List, Optional, Tuple, Union
+import torch
+import numpy as np
+from transformers import CLIPTokenizer, T5TokenizerFast
+
+from library import flux_utils, train_util
+from library.strategy_base import LatentsCachingStrategy, TextEncodingStrategy, TokenizeStrategy, TextEncoderOutputsCachingStrategy
+
+from library.utils import setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+
+CLIP_L_TOKENIZER_ID = "openai/clip-vit-large-patch14"
+T5_XXL_TOKENIZER_ID = "google/t5-v1_1-xxl"
+
+
+class FluxTokenizeStrategy(TokenizeStrategy):
+    def __init__(self, t5xxl_max_length: int = 512, tokenizer_cache_dir: Optional[str] = None) -> None:
+        self.t5xxl_max_length = t5xxl_max_length
+        self.clip_l = self._load_tokenizer(CLIPTokenizer, CLIP_L_TOKENIZER_ID, tokenizer_cache_dir=tokenizer_cache_dir)
+        self.t5xxl = self._load_tokenizer(T5TokenizerFast, T5_XXL_TOKENIZER_ID, tokenizer_cache_dir=tokenizer_cache_dir)
+
+    def tokenize(self, text: Union[str, List[str]]) -> List[torch.Tensor]:
+        text = [text] if isinstance(text, str) else text
+
+        l_tokens = self.clip_l(text, max_length=77, padding="max_length", truncation=True, return_tensors="pt")
+        t5_tokens = self.t5xxl(text, max_length=self.t5xxl_max_length, padding="max_length", truncation=True, return_tensors="pt")
+
+        t5_attn_mask = t5_tokens["attention_mask"]
+        l_tokens = l_tokens["input_ids"]
+        t5_tokens = t5_tokens["input_ids"]
+
+        return [l_tokens, t5_tokens, t5_attn_mask]
+
+
+class FluxTextEncodingStrategy(TextEncodingStrategy):
+    def __init__(self, apply_t5_attn_mask: Optional[bool] = None) -> None:
+        """
+        Args:
+            apply_t5_attn_mask: Default value for apply_t5_attn_mask.
+        """
+        self.apply_t5_attn_mask = apply_t5_attn_mask
+
+    def encode_tokens(
+        self,
+        tokenize_strategy: TokenizeStrategy,
+        models: List[Any],
+        tokens: List[torch.Tensor],
+        apply_t5_attn_mask: Optional[bool] = None,
+    ) -> List[torch.Tensor]:
+        # supports single model inference
+
+        if apply_t5_attn_mask is None:
+            apply_t5_attn_mask = self.apply_t5_attn_mask
+
+        clip_l, t5xxl = models if len(models) == 2 else (models[0], None)
+        l_tokens, t5_tokens = tokens[:2]
+        t5_attn_mask = tokens[2] if len(tokens) > 2 else None
+
+        # clip_l is None when using T5 only
+        if clip_l is not None and l_tokens is not None:
+            l_pooled = clip_l(l_tokens.to(clip_l.device))["pooler_output"]
+        else:
+            l_pooled = None
+
+        # t5xxl is None when using CLIP only
+        if t5xxl is not None and t5_tokens is not None:
+            # t5_out is [b, max length, 4096]
+            attention_mask = None if not apply_t5_attn_mask else t5_attn_mask.to(t5xxl.device)
+            t5_out, _ = t5xxl(t5_tokens.to(t5xxl.device), attention_mask, return_dict=False, output_hidden_states=True)
+            # if zero_pad_t5_output:
+            #     t5_out = t5_out * t5_attn_mask.to(t5_out.device).unsqueeze(-1)
+            txt_ids = torch.zeros(t5_out.shape[0], t5_out.shape[1], 3, device=t5_out.device)
+        else:
+            t5_out = None
+            txt_ids = None
+            t5_attn_mask = None  # caption may be dropped/shuffled, so t5_attn_mask should not be used to make sure the mask is same as the cached one
+
+        return [l_pooled, t5_out, txt_ids, t5_attn_mask]  # returns t5_attn_mask for attention mask in transformer
+
+
+class FluxTextEncoderOutputsCachingStrategy(TextEncoderOutputsCachingStrategy):
+    FLUX_TEXT_ENCODER_OUTPUTS_NPZ_SUFFIX = "_flux_te.npz"
+
+    def __init__(
+        self,
+        cache_to_disk: bool,
+        batch_size: int,
+        skip_disk_cache_validity_check: bool,
+        is_partial: bool = False,
+        apply_t5_attn_mask: bool = False,
+    ) -> None:
+        super().__init__(cache_to_disk, batch_size, skip_disk_cache_validity_check, is_partial)
+        self.apply_t5_attn_mask = apply_t5_attn_mask
+
+        self.warn_fp8_weights = False
+
+    def get_outputs_npz_path(self, image_abs_path: str) -> str:
+        return os.path.splitext(image_abs_path)[0] + FluxTextEncoderOutputsCachingStrategy.FLUX_TEXT_ENCODER_OUTPUTS_NPZ_SUFFIX
+
+    def is_disk_cached_outputs_expected(self, npz_path: str):
+        if not self.cache_to_disk:
+            return False
+        if not os.path.exists(npz_path):
+            return False
+        if self.skip_disk_cache_validity_check:
+            return True
+
+        try:
+            npz = np.load(npz_path)
+            if "l_pooled" not in npz:
+                return False
+            if "t5_out" not in npz:
+                return False
+            if "txt_ids" not in npz:
+                return False
+            if "t5_attn_mask" not in npz:
+                return False
+            if "apply_t5_attn_mask" not in npz:
+                return False
+            npz_apply_t5_attn_mask = npz["apply_t5_attn_mask"]
+            if npz_apply_t5_attn_mask != self.apply_t5_attn_mask:
+                return False
+        except Exception as e:
+            logger.error(f"Error loading file: {npz_path}")
+            raise e
+
+        return True
+
+    def load_outputs_npz(self, npz_path: str) -> List[np.ndarray]:
+        data = np.load(npz_path)
+        l_pooled = data["l_pooled"]
+        t5_out = data["t5_out"]
+        txt_ids = data["txt_ids"]
+        t5_attn_mask = data["t5_attn_mask"]
+        # apply_t5_attn_mask should be same as self.apply_t5_attn_mask
+        return [l_pooled, t5_out, txt_ids, t5_attn_mask]
+
+    def cache_batch_outputs(
+        self, tokenize_strategy: TokenizeStrategy, models: List[Any], text_encoding_strategy: TextEncodingStrategy, infos: List
+    ):
+        if not self.warn_fp8_weights:
+            if flux_utils.get_t5xxl_actual_dtype(models[1]) == torch.float8_e4m3fn:
+                logger.warning(
+                    "T5 model is using fp8 weights for caching. This may affect the quality of the cached outputs."
+                    " / T5モデルはfp8の重みを使用しています。これはキャッシュの品質に影響を与える可能性があります。"
+                )
+            self.warn_fp8_weights = True
+
+        flux_text_encoding_strategy: FluxTextEncodingStrategy = text_encoding_strategy
+        captions = [info.caption for info in infos]
+
+        tokens_and_masks = tokenize_strategy.tokenize(captions)
+        with torch.no_grad():
+            # attn_mask is applied in text_encoding_strategy.encode_tokens if apply_t5_attn_mask is True
+            l_pooled, t5_out, txt_ids, _ = flux_text_encoding_strategy.encode_tokens(tokenize_strategy, models, tokens_and_masks)
+
+        if l_pooled.dtype == torch.bfloat16:
+            l_pooled = l_pooled.float()
+        if t5_out.dtype == torch.bfloat16:
+            t5_out = t5_out.float()
+        if txt_ids.dtype == torch.bfloat16:
+            txt_ids = txt_ids.float()
+
+        l_pooled = l_pooled.cpu().numpy()
+        t5_out = t5_out.cpu().numpy()
+        txt_ids = txt_ids.cpu().numpy()
+        t5_attn_mask = tokens_and_masks[2].cpu().numpy()
+
+        for i, info in enumerate(infos):
+            l_pooled_i = l_pooled[i]
+            t5_out_i = t5_out[i]
+            txt_ids_i = txt_ids[i]
+            t5_attn_mask_i = t5_attn_mask[i]
+            apply_t5_attn_mask_i = self.apply_t5_attn_mask
+
+            if self.cache_to_disk:
+                np.savez(
+                    info.text_encoder_outputs_npz,
+                    l_pooled=l_pooled_i,
+                    t5_out=t5_out_i,
+                    txt_ids=txt_ids_i,
+                    t5_attn_mask=t5_attn_mask_i,
+                    apply_t5_attn_mask=apply_t5_attn_mask_i,
+                )
+            else:
+                # it's fine that attn mask is not None. it's overwritten before calling the model if necessary
+                info.text_encoder_outputs = (l_pooled_i, t5_out_i, txt_ids_i, t5_attn_mask_i)
+
+
+class FluxLatentsCachingStrategy(LatentsCachingStrategy):
+    FLUX_LATENTS_NPZ_SUFFIX = "_flux.npz"
+
+    def __init__(self, cache_to_disk: bool, batch_size: int, skip_disk_cache_validity_check: bool) -> None:
+        super().__init__(cache_to_disk, batch_size, skip_disk_cache_validity_check)
+
+    @property
+    def cache_suffix(self) -> str:
+        return FluxLatentsCachingStrategy.FLUX_LATENTS_NPZ_SUFFIX
+
+    def get_latents_npz_path(self, absolute_path: str, image_size: Tuple[int, int]) -> str:
+        return (
+            os.path.splitext(absolute_path)[0]
+            + f"_{image_size[0]:04d}x{image_size[1]:04d}"
+            + FluxLatentsCachingStrategy.FLUX_LATENTS_NPZ_SUFFIX
+        )
+
+    def is_disk_cached_latents_expected(self, bucket_reso: Tuple[int, int], npz_path: str, flip_aug: bool, alpha_mask: bool):
+        return self._default_is_disk_cached_latents_expected(8, bucket_reso, npz_path, flip_aug, alpha_mask, multi_resolution=True)
+
+    def load_latents_from_disk(
+        self, npz_path: str, bucket_reso: Tuple[int, int]
+    ) -> Tuple[Optional[np.ndarray], Optional[List[int]], Optional[List[int]], Optional[np.ndarray], Optional[np.ndarray]]:
+        return self._default_load_latents_from_disk(8, npz_path, bucket_reso)  # support multi-resolution
+
+    # TODO remove circular dependency for ImageInfo
+    def cache_batch_latents(self, vae, image_infos: List, flip_aug: bool, alpha_mask: bool, random_crop: bool):
+        encode_by_vae = lambda img_tensor: vae.encode(img_tensor).to("cpu")
+        vae_device = vae.device
+        vae_dtype = vae.dtype
+
+        self._default_cache_batch_latents(
+            encode_by_vae, vae_device, vae_dtype, image_infos, flip_aug, alpha_mask, random_crop, multi_resolution=True
+        )
+
+        if not train_util.HIGH_VRAM:
+            train_util.clean_memory_on_device(vae.device)
+
+
+if __name__ == "__main__":
+    # test code for FluxTokenizeStrategy
+    # tokenizer = sd3_models.SD3Tokenizer()
+    strategy = FluxTokenizeStrategy(256)
+    text = "hello world"
+
+    l_tokens, g_tokens, t5_tokens = strategy.tokenize(text)
+    # print(l_tokens.shape)
+    print(l_tokens)
+    print(g_tokens)
+    print(t5_tokens)
+
+    texts = ["hello world", "the quick brown fox jumps over the lazy dog"]
+    l_tokens_2 = strategy.clip_l(texts, max_length=77, padding="max_length", truncation=True, return_tensors="pt")
+    g_tokens_2 = strategy.clip_g(texts, max_length=77, padding="max_length", truncation=True, return_tensors="pt")
+    t5_tokens_2 = strategy.t5xxl(
+        texts, max_length=strategy.t5xxl_max_length, padding="max_length", truncation=True, return_tensors="pt"
+    )
+    print(l_tokens_2)
+    print(g_tokens_2)
+    print(t5_tokens_2)
+
+    # compare
+    print(torch.allclose(l_tokens, l_tokens_2["input_ids"][0]))
+    print(torch.allclose(g_tokens, g_tokens_2["input_ids"][0]))
+    print(torch.allclose(t5_tokens, t5_tokens_2["input_ids"][0]))
+
+    text = ",".join(["hello world! this is long text"] * 50)
+    l_tokens, g_tokens, t5_tokens = strategy.tokenize(text)
+    print(l_tokens)
+    print(g_tokens)
+    print(t5_tokens)
+
+    print(f"model max length l: {strategy.clip_l.model_max_length}")
+    print(f"model max length g: {strategy.clip_g.model_max_length}")
+    print(f"model max length t5: {strategy.t5xxl.model_max_length}")
--- a/library/strategy_lumina.py
+++ b/library/strategy_lumina.py
@@ -0,0 +1,375 @@
+import glob
+import os
+from typing import Any, List, Optional, Tuple, Union
+
+import torch
+from transformers import AutoTokenizer, AutoModel, Gemma2Model, GemmaTokenizerFast
+from library import train_util
+from library.strategy_base import (
+    LatentsCachingStrategy,
+    TokenizeStrategy,
+    TextEncodingStrategy,
+    TextEncoderOutputsCachingStrategy,
+)
+import numpy as np
+from library.utils import setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+
+GEMMA_ID = "google/gemma-2-2b"
+
+
+class LuminaTokenizeStrategy(TokenizeStrategy):
+    def __init__(
+        self, system_prompt:str, max_length: Optional[int], tokenizer_cache_dir: Optional[str] = None
+    ) -> None:
+        self.tokenizer: GemmaTokenizerFast = AutoTokenizer.from_pretrained(
+            GEMMA_ID, cache_dir=tokenizer_cache_dir
+        )
+        self.tokenizer.padding_side = "right"
+
+        if system_prompt is None:
+            system_prompt = ""
+        system_prompt_special_token = "<Prompt Start>"
+        system_prompt = f"{system_prompt} {system_prompt_special_token} " if system_prompt else ""
+        self.system_prompt = system_prompt
+
+        if max_length is None:
+            self.max_length = 256
+        else:
+            self.max_length = max_length
+
+    def tokenize(
+        self, text: Union[str, List[str]], is_negative: bool = False
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """
+        Args:
+            text (Union[str, List[str]]): Text to tokenize
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor]:
+                token input ids, attention_masks
+        """
+        text = [text] if isinstance(text, str) else text
+        
+        # In training, we always add system prompt (is_negative=False)
+        if not is_negative:
+            # Add system prompt to the beginning of each text
+            text = [self.system_prompt + t for t in text]
+
+        encodings = self.tokenizer(
+            text,
+            max_length=self.max_length,
+            return_tensors="pt",
+            padding="max_length",
+            truncation=True,
+            pad_to_multiple_of=8,
+        )
+        return (encodings.input_ids, encodings.attention_mask)
+
+    def tokenize_with_weights(
+        self, text: str | List[str]
+    ) -> Tuple[torch.Tensor, torch.Tensor, List[torch.Tensor]]:
+        """
+        Args:
+            text (Union[str, List[str]]): Text to tokenize
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor, List[torch.Tensor]]:
+                token input ids, attention_masks, weights
+        """
+        # Gemma doesn't support weighted prompts, return uniform weights
+        tokens, attention_masks = self.tokenize(text)
+        weights = [torch.ones_like(t) for t in tokens]
+        return tokens, attention_masks, weights
+
+
+class LuminaTextEncodingStrategy(TextEncodingStrategy):
+    def __init__(self) -> None:
+        super().__init__()
+
+    def encode_tokens(
+        self,
+        tokenize_strategy: TokenizeStrategy,
+        models: List[Any],
+        tokens: Tuple[torch.Tensor, torch.Tensor],
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """
+        Args:
+            tokenize_strategy (LuminaTokenizeStrategy): Tokenize strategy
+            models (List[Any]): Text encoders
+            tokens (Tuple[torch.Tensor, torch.Tensor]): tokens, attention_masks
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+                hidden_states, input_ids, attention_masks
+        """
+        text_encoder = models[0]
+        # Check model or torch dynamo OptimizedModule
+        assert isinstance(text_encoder, Gemma2Model) or isinstance(text_encoder._orig_mod, Gemma2Model), f"text encoder is not Gemma2Model {text_encoder.__class__.__name__}"
+        input_ids, attention_masks = tokens
+
+        outputs = text_encoder(
+            input_ids=input_ids.to(text_encoder.device),
+            attention_mask=attention_masks.to(text_encoder.device),
+            output_hidden_states=True,
+            return_dict=True,
+        )
+
+        return outputs.hidden_states[-2], input_ids, attention_masks
+
+    def encode_tokens_with_weights(
+        self,
+        tokenize_strategy: TokenizeStrategy,
+        models: List[Any],
+        tokens: Tuple[torch.Tensor, torch.Tensor],
+        weights: List[torch.Tensor],
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """
+        Args:
+            tokenize_strategy (LuminaTokenizeStrategy): Tokenize strategy
+            models (List[Any]): Text encoders
+            tokens (Tuple[torch.Tensor, torch.Tensor]): tokens, attention_masks
+            weights_list (List[torch.Tensor]): Currently unused
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+                hidden_states, input_ids, attention_masks
+        """
+        # For simplicity, use uniform weighting
+        return self.encode_tokens(tokenize_strategy, models, tokens)
+
+
+class LuminaTextEncoderOutputsCachingStrategy(TextEncoderOutputsCachingStrategy):
+    LUMINA_TEXT_ENCODER_OUTPUTS_NPZ_SUFFIX = "_lumina_te.npz"
+
+    def __init__(
+        self,
+        cache_to_disk: bool,
+        batch_size: int,
+        skip_disk_cache_validity_check: bool,
+        is_partial: bool = False,
+    ) -> None:
+        super().__init__(
+            cache_to_disk,
+            batch_size,
+            skip_disk_cache_validity_check,
+            is_partial,
+        )
+
+    def get_outputs_npz_path(self, image_abs_path: str) -> str:
+        return (
+            os.path.splitext(image_abs_path)[0]
+            + LuminaTextEncoderOutputsCachingStrategy.LUMINA_TEXT_ENCODER_OUTPUTS_NPZ_SUFFIX
+        )
+
+    def is_disk_cached_outputs_expected(self, npz_path: str) -> bool:
+        """
+        Args:
+            npz_path (str): Path to the npz file.
+
+        Returns:
+            bool: True if the npz file is expected to be cached.
+        """
+        if not self.cache_to_disk:
+            return False
+        if not os.path.exists(npz_path):
+            return False
+        if self.skip_disk_cache_validity_check:
+            return True
+
+        try:
+            npz = np.load(npz_path)
+            if "hidden_state" not in npz:
+                return False
+            if "attention_mask" not in npz:
+                return False
+            if "input_ids" not in npz:
+                return False
+        except Exception as e:
+            logger.error(f"Error loading file: {npz_path}")
+            raise e
+
+        return True
+
+    def load_outputs_npz(self, npz_path: str) -> List[np.ndarray]:
+        """
+        Load outputs from a npz file
+
+        Returns:
+            List[np.ndarray]: hidden_state, input_ids, attention_mask
+        """
+        data = np.load(npz_path)
+        hidden_state = data["hidden_state"]
+        attention_mask = data["attention_mask"]
+        input_ids = data["input_ids"]
+        return [hidden_state, input_ids, attention_mask]
+
+    @torch.no_grad()
+    def cache_batch_outputs(
+        self,
+        tokenize_strategy: TokenizeStrategy,
+        models: List[Any],
+        text_encoding_strategy: TextEncodingStrategy,
+        batch: List[train_util.ImageInfo],
+    ) -> None:
+        """
+        Args:
+            tokenize_strategy (LuminaTokenizeStrategy): Tokenize strategy
+            models (List[Any]): Text encoders
+            text_encoding_strategy (LuminaTextEncodingStrategy):
+            infos (List): List of ImageInfo
+
+        Returns:
+            None
+        """
+        assert isinstance(text_encoding_strategy, LuminaTextEncodingStrategy)
+        assert isinstance(tokenize_strategy, LuminaTokenizeStrategy)
+
+        captions = [info.caption for info in batch]
+
+        if self.is_weighted:
+            tokens, attention_masks, weights_list = (
+                tokenize_strategy.tokenize_with_weights(captions)
+            )
+            hidden_state, input_ids, attention_masks = (
+                text_encoding_strategy.encode_tokens_with_weights(
+                    tokenize_strategy,
+                    models,
+                    (tokens, attention_masks),
+                    weights_list,
+                )
+            )
+        else:
+            tokens = tokenize_strategy.tokenize(captions)
+            hidden_state, input_ids, attention_masks = (
+                text_encoding_strategy.encode_tokens(
+                    tokenize_strategy, models, tokens
+                )
+            )
+
+        if hidden_state.dtype != torch.float32:
+            hidden_state = hidden_state.float()
+
+        hidden_state = hidden_state.cpu().numpy()
+        attention_mask = attention_masks.cpu().numpy() # (B, S)
+        input_ids = input_ids.cpu().numpy() # (B, S) 
+
+
+        for i, info in enumerate(batch):
+            hidden_state_i = hidden_state[i]
+            attention_mask_i = attention_mask[i]
+            input_ids_i = input_ids[i]
+
+            if self.cache_to_disk:
+                assert info.text_encoder_outputs_npz is not None, f"Text encoder cache outputs to disk not found for image {info.image_key}"
+                np.savez(
+                    info.text_encoder_outputs_npz,
+                    hidden_state=hidden_state_i,
+                    attention_mask=attention_mask_i,
+                    input_ids=input_ids_i,
+                )
+            else:
+                info.text_encoder_outputs = [
+                    hidden_state_i,
+                    input_ids_i,
+                    attention_mask_i,
+                ]
+
+
+class LuminaLatentsCachingStrategy(LatentsCachingStrategy):
+    LUMINA_LATENTS_NPZ_SUFFIX = "_lumina.npz"
+
+    def __init__(
+        self, cache_to_disk: bool, batch_size: int, skip_disk_cache_validity_check: bool
+    ) -> None:
+        super().__init__(cache_to_disk, batch_size, skip_disk_cache_validity_check)
+
+    @property
+    def cache_suffix(self) -> str:
+        return LuminaLatentsCachingStrategy.LUMINA_LATENTS_NPZ_SUFFIX
+
+    def get_latents_npz_path(
+        self, absolute_path: str, image_size: Tuple[int, int]
+    ) -> str:
+        return (
+            os.path.splitext(absolute_path)[0]
+            + f"_{image_size[0]:04d}x{image_size[1]:04d}"
+            + LuminaLatentsCachingStrategy.LUMINA_LATENTS_NPZ_SUFFIX
+        )
+
+    def is_disk_cached_latents_expected(
+        self,
+        bucket_reso: Tuple[int, int],
+        npz_path: str,
+        flip_aug: bool,
+        alpha_mask: bool,
+    ) -> bool:
+        """
+        Args:
+            bucket_reso (Tuple[int, int]): The resolution of the bucket.
+            npz_path (str): Path to the npz file.
+            flip_aug (bool): Whether to flip the image.
+            alpha_mask (bool): Whether to apply
+        """
+        return self._default_is_disk_cached_latents_expected(
+            8, bucket_reso, npz_path, flip_aug, alpha_mask, multi_resolution=True
+        )
+
+    def load_latents_from_disk(
+        self, npz_path: str, bucket_reso: Tuple[int, int]
+    ) -> Tuple[
+        Optional[np.ndarray],
+        Optional[List[int]],
+        Optional[List[int]],
+        Optional[np.ndarray],
+        Optional[np.ndarray],
+    ]:
+        """
+        Args:
+            npz_path (str): Path to the npz file.
+            bucket_reso (Tuple[int, int]): The resolution of the bucket.
+
+        Returns:
+            Tuple[
+                Optional[np.ndarray],
+                Optional[List[int]],
+                Optional[List[int]],
+                Optional[np.ndarray],
+                Optional[np.ndarray],
+            ]: Tuple of latent tensors, attention_mask, input_ids, latents, latents_unet
+        """
+        return self._default_load_latents_from_disk(
+            8, npz_path, bucket_reso
+        )  # support multi-resolution
+
+    # TODO remove circular dependency for ImageInfo
+    def cache_batch_latents(
+        self,
+        model,
+        batch: List,
+        flip_aug: bool,
+        alpha_mask: bool,
+        random_crop: bool,
+    ):
+        encode_by_vae = lambda img_tensor: model.encode(img_tensor).to("cpu")
+        vae_device = model.device
+        vae_dtype = model.dtype
+
+        self._default_cache_batch_latents(
+            encode_by_vae,
+            vae_device,
+            vae_dtype,
+            batch,
+            flip_aug,
+            alpha_mask,
+            random_crop,
+            multi_resolution=True,
+        )
+
+        if not train_util.HIGH_VRAM:
+            train_util.clean_memory_on_device(model.device)
--- a/library/strategy_sd.py
+++ b/library/strategy_sd.py
@@ -0,0 +1,171 @@
+import glob
+import os
+from typing import Any, List, Optional, Tuple, Union
+
+import torch
+from transformers import CLIPTokenizer
+from library import train_util
+from library.strategy_base import LatentsCachingStrategy, TokenizeStrategy, TextEncodingStrategy
+from library.utils import setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+
+TOKENIZER_ID = "openai/clip-vit-large-patch14"
+V2_STABLE_DIFFUSION_ID = "stabilityai/stable-diffusion-2"  # ここからtokenizerだけ使う v2とv2.1はtokenizer仕様は同じ
+
+
+class SdTokenizeStrategy(TokenizeStrategy):
+    def __init__(self, v2: bool, max_length: Optional[int], tokenizer_cache_dir: Optional[str] = None) -> None:
+        """
+        max_length does not include <BOS> and <EOS> (None, 75, 150, 225)
+        """
+        logger.info(f"Using {'v2' if v2 else 'v1'} tokenizer")
+        if v2:
+            self.tokenizer = self._load_tokenizer(
+                CLIPTokenizer, V2_STABLE_DIFFUSION_ID, subfolder="tokenizer", tokenizer_cache_dir=tokenizer_cache_dir
+            )
+        else:
+            self.tokenizer = self._load_tokenizer(CLIPTokenizer, TOKENIZER_ID, tokenizer_cache_dir=tokenizer_cache_dir)
+
+        if max_length is None:
+            self.max_length = self.tokenizer.model_max_length
+        else:
+            self.max_length = max_length + 2
+
+    def tokenize(self, text: Union[str, List[str]]) -> List[torch.Tensor]:
+        text = [text] if isinstance(text, str) else text
+        return [torch.stack([self._get_input_ids(self.tokenizer, t, self.max_length) for t in text], dim=0)]
+
+    def tokenize_with_weights(self, text: str | List[str]) -> Tuple[List[torch.Tensor], List[torch.Tensor]]:
+        text = [text] if isinstance(text, str) else text
+        tokens_list = []
+        weights_list = []
+        for t in text:
+            tokens, weights = self._get_input_ids(self.tokenizer, t, self.max_length, weighted=True)
+            tokens_list.append(tokens)
+            weights_list.append(weights)
+        return [torch.stack(tokens_list, dim=0)], [torch.stack(weights_list, dim=0)]
+
+
+class SdTextEncodingStrategy(TextEncodingStrategy):
+    def __init__(self, clip_skip: Optional[int] = None) -> None:
+        self.clip_skip = clip_skip
+
+    def encode_tokens(
+        self, tokenize_strategy: TokenizeStrategy, models: List[Any], tokens: List[torch.Tensor]
+    ) -> List[torch.Tensor]:
+        text_encoder = models[0]
+        tokens = tokens[0]
+        sd_tokenize_strategy = tokenize_strategy  # type: SdTokenizeStrategy
+
+        # tokens: b,n,77
+        b_size = tokens.size()[0]
+        max_token_length = tokens.size()[1] * tokens.size()[2]
+        model_max_length = sd_tokenize_strategy.tokenizer.model_max_length
+        tokens = tokens.reshape((-1, model_max_length))  # batch_size*3, 77
+
+        tokens = tokens.to(text_encoder.device)
+
+        if self.clip_skip is None:
+            encoder_hidden_states = text_encoder(tokens)[0]
+        else:
+            enc_out = text_encoder(tokens, output_hidden_states=True, return_dict=True)
+            encoder_hidden_states = enc_out["hidden_states"][-self.clip_skip]
+            encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states)
+
+        # bs*3, 77, 768 or 1024
+        encoder_hidden_states = encoder_hidden_states.reshape((b_size, -1, encoder_hidden_states.shape[-1]))
+
+        if max_token_length != model_max_length:
+            v1 = sd_tokenize_strategy.tokenizer.pad_token_id == sd_tokenize_strategy.tokenizer.eos_token_id
+            if not v1:
+                # v2: <BOS>...<EOS> <PAD> ... の三連を <BOS>...<EOS> <PAD> ... へ戻す　正直この実装でいいのかわからん
+                states_list = [encoder_hidden_states[:, 0].unsqueeze(1)]  # <BOS>
+                for i in range(1, max_token_length, model_max_length):
+                    chunk = encoder_hidden_states[:, i : i + model_max_length - 2]  # <BOS> の後から 最後の前まで
+                    if i > 0:
+                        for j in range(len(chunk)):
+                            if tokens[j, 1] == sd_tokenize_strategy.tokenizer.eos_token:
+                                # 空、つまり <BOS> <EOS> <PAD> ...のパターン
+                                chunk[j, 0] = chunk[j, 1]  # 次の <PAD> の値をコピーする
+                    states_list.append(chunk)  # <BOS> の後から <EOS> の前まで
+                states_list.append(encoder_hidden_states[:, -1].unsqueeze(1))  # <EOS> か <PAD> のどちらか
+                encoder_hidden_states = torch.cat(states_list, dim=1)
+            else:
+                # v1: <BOS>...<EOS> の三連を <BOS>...<EOS> へ戻す
+                states_list = [encoder_hidden_states[:, 0].unsqueeze(1)]  # <BOS>
+                for i in range(1, max_token_length, model_max_length):
+                    states_list.append(encoder_hidden_states[:, i : i + model_max_length - 2])  # <BOS> の後から <EOS> の前まで
+                states_list.append(encoder_hidden_states[:, -1].unsqueeze(1))  # <EOS>
+                encoder_hidden_states = torch.cat(states_list, dim=1)
+
+        return [encoder_hidden_states]
+
+    def encode_tokens_with_weights(
+        self,
+        tokenize_strategy: TokenizeStrategy,
+        models: List[Any],
+        tokens_list: List[torch.Tensor],
+        weights_list: List[torch.Tensor],
+    ) -> List[torch.Tensor]:
+        encoder_hidden_states = self.encode_tokens(tokenize_strategy, models, tokens_list)[0]
+
+        weights = weights_list[0].to(encoder_hidden_states.device)
+
+        # apply weights
+        if weights.shape[1] == 1:  # no max_token_length
+            # weights: ((b, 1, 77), (b, 1, 77)), hidden_states: (b, 77, 768), (b, 77, 768)
+            encoder_hidden_states = encoder_hidden_states * weights.squeeze(1).unsqueeze(2)
+        else:
+            # weights: ((b, n, 77), (b, n, 77)), hidden_states: (b, n*75+2, 768), (b, n*75+2, 768)
+            for i in range(weights.shape[1]):
+                encoder_hidden_states[:, i * 75 + 1 : i * 75 + 76] = encoder_hidden_states[:, i * 75 + 1 : i * 75 + 76] * weights[
+                    :, i, 1:-1
+                ].unsqueeze(-1)
+
+        return [encoder_hidden_states]
+
+
+class SdSdxlLatentsCachingStrategy(LatentsCachingStrategy):
+    # sd and sdxl share the same strategy. we can make them separate, but the difference is only the suffix.
+    # and we keep the old npz for the backward compatibility.
+
+    SD_OLD_LATENTS_NPZ_SUFFIX = ".npz"
+    SD_LATENTS_NPZ_SUFFIX = "_sd.npz"
+    SDXL_LATENTS_NPZ_SUFFIX = "_sdxl.npz"
+
+    def __init__(self, sd: bool, cache_to_disk: bool, batch_size: int, skip_disk_cache_validity_check: bool) -> None:
+        super().__init__(cache_to_disk, batch_size, skip_disk_cache_validity_check)
+        self.sd = sd
+        self.suffix = (
+            SdSdxlLatentsCachingStrategy.SD_LATENTS_NPZ_SUFFIX if sd else SdSdxlLatentsCachingStrategy.SDXL_LATENTS_NPZ_SUFFIX
+        )
+    
+    @property
+    def cache_suffix(self) -> str:
+        return self.suffix
+
+    def get_latents_npz_path(self, absolute_path: str, image_size: Tuple[int, int]) -> str:
+        # support old .npz
+        old_npz_file = os.path.splitext(absolute_path)[0] + SdSdxlLatentsCachingStrategy.SD_OLD_LATENTS_NPZ_SUFFIX
+        if os.path.exists(old_npz_file):
+            return old_npz_file
+        return os.path.splitext(absolute_path)[0] + f"_{image_size[0]:04d}x{image_size[1]:04d}" + self.suffix
+
+    def is_disk_cached_latents_expected(self, bucket_reso: Tuple[int, int], npz_path: str, flip_aug: bool, alpha_mask: bool):
+        return self._default_is_disk_cached_latents_expected(8, bucket_reso, npz_path, flip_aug, alpha_mask)
+
+    # TODO remove circular dependency for ImageInfo
+    def cache_batch_latents(self, vae, image_infos: List, flip_aug: bool, alpha_mask: bool, random_crop: bool):
+        encode_by_vae = lambda img_tensor: vae.encode(img_tensor).latent_dist.sample()
+        vae_device = vae.device
+        vae_dtype = vae.dtype
+
+        self._default_cache_batch_latents(encode_by_vae, vae_device, vae_dtype, image_infos, flip_aug, alpha_mask, random_crop)
+
+        if not train_util.HIGH_VRAM:
+            train_util.clean_memory_on_device(vae.device)
--- a/library/strategy_sd3.py
+++ b/library/strategy_sd3.py
@@ -0,0 +1,420 @@
+import os
+import glob
+import random
+from typing import Any, List, Optional, Tuple, Union
+import torch
+import numpy as np
+from transformers import CLIPTokenizer, T5TokenizerFast, CLIPTextModel, CLIPTextModelWithProjection, T5EncoderModel
+
+from library import sd3_utils, train_util
+from library import sd3_models
+from library.strategy_base import LatentsCachingStrategy, TextEncodingStrategy, TokenizeStrategy, TextEncoderOutputsCachingStrategy
+
+from library.utils import setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+
+CLIP_L_TOKENIZER_ID = "openai/clip-vit-large-patch14"
+CLIP_G_TOKENIZER_ID = "laion/CLIP-ViT-bigG-14-laion2B-39B-b160k"
+T5_XXL_TOKENIZER_ID = "google/t5-v1_1-xxl"
+
+
+class Sd3TokenizeStrategy(TokenizeStrategy):
+    def __init__(self, t5xxl_max_length: int = 256, tokenizer_cache_dir: Optional[str] = None) -> None:
+        self.t5xxl_max_length = t5xxl_max_length
+        self.clip_l = self._load_tokenizer(CLIPTokenizer, CLIP_L_TOKENIZER_ID, tokenizer_cache_dir=tokenizer_cache_dir)
+        self.clip_g = self._load_tokenizer(CLIPTokenizer, CLIP_G_TOKENIZER_ID, tokenizer_cache_dir=tokenizer_cache_dir)
+        self.t5xxl = self._load_tokenizer(T5TokenizerFast, T5_XXL_TOKENIZER_ID, tokenizer_cache_dir=tokenizer_cache_dir)
+        self.clip_g.pad_token_id = 0  # use 0 as pad token for clip_g
+
+    def tokenize(self, text: Union[str, List[str]]) -> List[torch.Tensor]:
+        text = [text] if isinstance(text, str) else text
+
+        l_tokens = self.clip_l(text, max_length=77, padding="max_length", truncation=True, return_tensors="pt")
+        g_tokens = self.clip_g(text, max_length=77, padding="max_length", truncation=True, return_tensors="pt")
+        t5_tokens = self.t5xxl(text, max_length=self.t5xxl_max_length, padding="max_length", truncation=True, return_tensors="pt")
+
+        l_attn_mask = l_tokens["attention_mask"]
+        g_attn_mask = g_tokens["attention_mask"]
+        t5_attn_mask = t5_tokens["attention_mask"]
+        l_tokens = l_tokens["input_ids"]
+        g_tokens = g_tokens["input_ids"]
+        t5_tokens = t5_tokens["input_ids"]
+
+        return [l_tokens, g_tokens, t5_tokens, l_attn_mask, g_attn_mask, t5_attn_mask]
+
+
+class Sd3TextEncodingStrategy(TextEncodingStrategy):
+    def __init__(
+        self,
+        apply_lg_attn_mask: Optional[bool] = None,
+        apply_t5_attn_mask: Optional[bool] = None,
+        l_dropout_rate: float = 0.0,
+        g_dropout_rate: float = 0.0,
+        t5_dropout_rate: float = 0.0,
+    ) -> None:
+        """
+        Args:
+            apply_t5_attn_mask: Default value for apply_t5_attn_mask.
+        """
+        self.apply_lg_attn_mask = apply_lg_attn_mask
+        self.apply_t5_attn_mask = apply_t5_attn_mask
+        self.l_dropout_rate = l_dropout_rate
+        self.g_dropout_rate = g_dropout_rate
+        self.t5_dropout_rate = t5_dropout_rate
+
+    def encode_tokens(
+        self,
+        tokenize_strategy: TokenizeStrategy,
+        models: List[Any],
+        tokens: List[torch.Tensor],
+        apply_lg_attn_mask: Optional[bool] = False,
+        apply_t5_attn_mask: Optional[bool] = False,
+        enable_dropout: bool = True,
+    ) -> List[torch.Tensor]:
+        """
+        returned embeddings are not masked
+        """
+        clip_l, clip_g, t5xxl = models
+        clip_l: Optional[CLIPTextModel]
+        clip_g: Optional[CLIPTextModelWithProjection]
+        t5xxl: Optional[T5EncoderModel]
+
+        if apply_lg_attn_mask is None:
+            apply_lg_attn_mask = self.apply_lg_attn_mask
+        if apply_t5_attn_mask is None:
+            apply_t5_attn_mask = self.apply_t5_attn_mask
+
+        l_tokens, g_tokens, t5_tokens, l_attn_mask, g_attn_mask, t5_attn_mask = tokens
+
+        # dropout: if enable_dropout is False, dropout is not applied. dropout means zeroing out embeddings
+
+        if l_tokens is None or clip_l is None:
+            assert g_tokens is None, "g_tokens must be None if l_tokens is None"
+            lg_out = None
+            lg_pooled = None
+            l_attn_mask = None
+            g_attn_mask = None
+        else:
+            assert g_tokens is not None, "g_tokens must not be None if l_tokens is not None"
+
+            # drop some members of the batch: we do not call clip_l and clip_g for dropped members
+            batch_size, l_seq_len = l_tokens.shape
+            g_seq_len = g_tokens.shape[1]
+
+            non_drop_l_indices = []
+            non_drop_g_indices = []
+            for i in range(l_tokens.shape[0]):
+                drop_l = enable_dropout and (self.l_dropout_rate > 0.0 and random.random() < self.l_dropout_rate)
+                drop_g = enable_dropout and (self.g_dropout_rate > 0.0 and random.random() < self.g_dropout_rate)
+                if not drop_l:
+                    non_drop_l_indices.append(i)
+                if not drop_g:
+                    non_drop_g_indices.append(i)
+
+            # filter out dropped members
+            if len(non_drop_l_indices) > 0 and len(non_drop_l_indices) < batch_size:
+                l_tokens = l_tokens[non_drop_l_indices]
+                l_attn_mask = l_attn_mask[non_drop_l_indices]
+            if len(non_drop_g_indices) > 0 and len(non_drop_g_indices) < batch_size:
+                g_tokens = g_tokens[non_drop_g_indices]
+                g_attn_mask = g_attn_mask[non_drop_g_indices]
+
+            # call clip_l for non-dropped members
+            if len(non_drop_l_indices) > 0:
+                nd_l_attn_mask = l_attn_mask.to(clip_l.device)
+                prompt_embeds = clip_l(
+                    l_tokens.to(clip_l.device), nd_l_attn_mask if apply_lg_attn_mask else None, output_hidden_states=True
+                )
+                nd_l_pooled = prompt_embeds[0]
+                nd_l_out = prompt_embeds.hidden_states[-2]
+            if len(non_drop_g_indices) > 0:
+                nd_g_attn_mask = g_attn_mask.to(clip_g.device)
+                prompt_embeds = clip_g(
+                    g_tokens.to(clip_g.device), nd_g_attn_mask if apply_lg_attn_mask else None, output_hidden_states=True
+                )
+                nd_g_pooled = prompt_embeds[0]
+                nd_g_out = prompt_embeds.hidden_states[-2]
+
+            # fill in the dropped members
+            if len(non_drop_l_indices) == batch_size:
+                l_pooled = nd_l_pooled
+                l_out = nd_l_out
+            else:
+                # model output is always float32 because of the models are wrapped with Accelerator
+                l_pooled = torch.zeros((batch_size, 768), device=clip_l.device, dtype=torch.float32)
+                l_out = torch.zeros((batch_size, l_seq_len, 768), device=clip_l.device, dtype=torch.float32)
+                l_attn_mask = torch.zeros((batch_size, l_seq_len), device=clip_l.device, dtype=l_attn_mask.dtype)
+                if len(non_drop_l_indices) > 0:
+                    l_pooled[non_drop_l_indices] = nd_l_pooled
+                    l_out[non_drop_l_indices] = nd_l_out
+                    l_attn_mask[non_drop_l_indices] = nd_l_attn_mask
+
+            if len(non_drop_g_indices) == batch_size:
+                g_pooled = nd_g_pooled
+                g_out = nd_g_out
+            else:
+                g_pooled = torch.zeros((batch_size, 1280), device=clip_g.device, dtype=torch.float32)
+                g_out = torch.zeros((batch_size, g_seq_len, 1280), device=clip_g.device, dtype=torch.float32)
+                g_attn_mask = torch.zeros((batch_size, g_seq_len), device=clip_g.device, dtype=g_attn_mask.dtype)
+                if len(non_drop_g_indices) > 0:
+                    g_pooled[non_drop_g_indices] = nd_g_pooled
+                    g_out[non_drop_g_indices] = nd_g_out
+                    g_attn_mask[non_drop_g_indices] = nd_g_attn_mask
+
+            lg_pooled = torch.cat((l_pooled, g_pooled), dim=-1)
+            lg_out = torch.cat([l_out, g_out], dim=-1)
+
+        if t5xxl is None or t5_tokens is None:
+            t5_out = None
+            t5_attn_mask = None
+        else:
+            # drop some members of the batch: we do not call t5xxl for dropped members
+            batch_size, t5_seq_len = t5_tokens.shape
+            non_drop_t5_indices = []
+            for i in range(t5_tokens.shape[0]):
+                drop_t5 = enable_dropout and (self.t5_dropout_rate > 0.0 and random.random() < self.t5_dropout_rate)
+                if not drop_t5:
+                    non_drop_t5_indices.append(i)
+
+            # filter out dropped members
+            if len(non_drop_t5_indices) > 0 and len(non_drop_t5_indices) < batch_size:
+                t5_tokens = t5_tokens[non_drop_t5_indices]
+                t5_attn_mask = t5_attn_mask[non_drop_t5_indices]
+
+            # call t5xxl for non-dropped members
+            if len(non_drop_t5_indices) > 0:
+                nd_t5_attn_mask = t5_attn_mask.to(t5xxl.device)
+                nd_t5_out, _ = t5xxl(
+                    t5_tokens.to(t5xxl.device),
+                    nd_t5_attn_mask if apply_t5_attn_mask else None,
+                    return_dict=False,
+                    output_hidden_states=True,
+                )
+
+            # fill in the dropped members
+            if len(non_drop_t5_indices) == batch_size:
+                t5_out = nd_t5_out
+            else:
+                t5_out = torch.zeros((batch_size, t5_seq_len, 4096), device=t5xxl.device, dtype=torch.float32)
+                t5_attn_mask = torch.zeros((batch_size, t5_seq_len), device=t5xxl.device, dtype=t5_attn_mask.dtype)
+                if len(non_drop_t5_indices) > 0:
+                    t5_out[non_drop_t5_indices] = nd_t5_out
+                    t5_attn_mask[non_drop_t5_indices] = nd_t5_attn_mask
+
+        # masks are used for attention masking in transformer
+        return [lg_out, t5_out, lg_pooled, l_attn_mask, g_attn_mask, t5_attn_mask]
+
+    def drop_cached_text_encoder_outputs(
+        self,
+        lg_out: torch.Tensor,
+        t5_out: torch.Tensor,
+        lg_pooled: torch.Tensor,
+        l_attn_mask: torch.Tensor,
+        g_attn_mask: torch.Tensor,
+        t5_attn_mask: torch.Tensor,
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+        # dropout: if enable_dropout is True, dropout is not applied. dropout means zeroing out embeddings
+        if lg_out is not None:
+            for i in range(lg_out.shape[0]):
+                drop_l = self.l_dropout_rate > 0.0 and random.random() < self.l_dropout_rate
+                if drop_l:
+                    lg_out[i, :, :768] = torch.zeros_like(lg_out[i, :, :768])
+                    lg_pooled[i, :768] = torch.zeros_like(lg_pooled[i, :768])
+                    if l_attn_mask is not None:
+                        l_attn_mask[i] = torch.zeros_like(l_attn_mask[i])
+                drop_g = self.g_dropout_rate > 0.0 and random.random() < self.g_dropout_rate
+                if drop_g:
+                    lg_out[i, :, 768:] = torch.zeros_like(lg_out[i, :, 768:])
+                    lg_pooled[i, 768:] = torch.zeros_like(lg_pooled[i, 768:])
+                    if g_attn_mask is not None:
+                        g_attn_mask[i] = torch.zeros_like(g_attn_mask[i])
+
+        if t5_out is not None:
+            for i in range(t5_out.shape[0]):
+                drop_t5 = self.t5_dropout_rate > 0.0 and random.random() < self.t5_dropout_rate
+                if drop_t5:
+                    t5_out[i] = torch.zeros_like(t5_out[i])
+                    if t5_attn_mask is not None:
+                        t5_attn_mask[i] = torch.zeros_like(t5_attn_mask[i])
+
+        return [lg_out, t5_out, lg_pooled, l_attn_mask, g_attn_mask, t5_attn_mask]
+
+    def concat_encodings(
+        self, lg_out: torch.Tensor, t5_out: Optional[torch.Tensor], lg_pooled: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        lg_out = torch.nn.functional.pad(lg_out, (0, 4096 - lg_out.shape[-1]))
+        if t5_out is None:
+            t5_out = torch.zeros((lg_out.shape[0], 77, 4096), device=lg_out.device, dtype=lg_out.dtype)
+        return torch.cat([lg_out, t5_out], dim=-2), lg_pooled
+
+
+class Sd3TextEncoderOutputsCachingStrategy(TextEncoderOutputsCachingStrategy):
+    SD3_TEXT_ENCODER_OUTPUTS_NPZ_SUFFIX = "_sd3_te.npz"
+
+    def __init__(
+        self,
+        cache_to_disk: bool,
+        batch_size: int,
+        skip_disk_cache_validity_check: bool,
+        is_partial: bool = False,
+        apply_lg_attn_mask: bool = False,
+        apply_t5_attn_mask: bool = False,
+    ) -> None:
+        super().__init__(cache_to_disk, batch_size, skip_disk_cache_validity_check, is_partial)
+        self.apply_lg_attn_mask = apply_lg_attn_mask
+        self.apply_t5_attn_mask = apply_t5_attn_mask
+
+    def get_outputs_npz_path(self, image_abs_path: str) -> str:
+        return os.path.splitext(image_abs_path)[0] + Sd3TextEncoderOutputsCachingStrategy.SD3_TEXT_ENCODER_OUTPUTS_NPZ_SUFFIX
+
+    def is_disk_cached_outputs_expected(self, npz_path: str):
+        if not self.cache_to_disk:
+            return False
+        if not os.path.exists(npz_path):
+            return False
+        if self.skip_disk_cache_validity_check:
+            return True
+
+        try:
+            npz = np.load(npz_path)
+            if "lg_out" not in npz:
+                return False
+            if "lg_pooled" not in npz:
+                return False
+            if "clip_l_attn_mask" not in npz or "clip_g_attn_mask" not in npz:  # necessary even if not used
+                return False
+            if "apply_lg_attn_mask" not in npz:
+                return False
+            if "t5_out" not in npz:
+                return False
+            if "t5_attn_mask" not in npz:
+                return False
+            npz_apply_lg_attn_mask = npz["apply_lg_attn_mask"]
+            if npz_apply_lg_attn_mask != self.apply_lg_attn_mask:
+                return False
+            if "apply_t5_attn_mask" not in npz:
+                return False
+            npz_apply_t5_attn_mask = npz["apply_t5_attn_mask"]
+            if npz_apply_t5_attn_mask != self.apply_t5_attn_mask:
+                return False
+        except Exception as e:
+            logger.error(f"Error loading file: {npz_path}")
+            raise e
+
+        return True
+
+    def load_outputs_npz(self, npz_path: str) -> List[np.ndarray]:
+        data = np.load(npz_path)
+        lg_out = data["lg_out"]
+        lg_pooled = data["lg_pooled"]
+        t5_out = data["t5_out"]
+
+        l_attn_mask = data["clip_l_attn_mask"]
+        g_attn_mask = data["clip_g_attn_mask"]
+        t5_attn_mask = data["t5_attn_mask"]
+
+        # apply_t5_attn_mask and apply_lg_attn_mask are same as self.apply_t5_attn_mask and self.apply_lg_attn_mask
+        return [lg_out, t5_out, lg_pooled, l_attn_mask, g_attn_mask, t5_attn_mask]
+
+    def cache_batch_outputs(
+        self, tokenize_strategy: TokenizeStrategy, models: List[Any], text_encoding_strategy: TextEncodingStrategy, infos: List
+    ):
+        sd3_text_encoding_strategy: Sd3TextEncodingStrategy = text_encoding_strategy
+        captions = [info.caption for info in infos]
+
+        tokens_and_masks = tokenize_strategy.tokenize(captions)
+        with torch.no_grad():
+            # always disable dropout during caching
+            lg_out, t5_out, lg_pooled, l_attn_mask, g_attn_mask, t5_attn_mask = sd3_text_encoding_strategy.encode_tokens(
+                tokenize_strategy,
+                models,
+                tokens_and_masks,
+                apply_lg_attn_mask=self.apply_lg_attn_mask,
+                apply_t5_attn_mask=self.apply_t5_attn_mask,
+                enable_dropout=False,
+            )
+
+        if lg_out.dtype == torch.bfloat16:
+            lg_out = lg_out.float()
+        if lg_pooled.dtype == torch.bfloat16:
+            lg_pooled = lg_pooled.float()
+        if t5_out.dtype == torch.bfloat16:
+            t5_out = t5_out.float()
+
+        lg_out = lg_out.cpu().numpy()
+        lg_pooled = lg_pooled.cpu().numpy()
+        t5_out = t5_out.cpu().numpy()
+
+        l_attn_mask = tokens_and_masks[3].cpu().numpy()
+        g_attn_mask = tokens_and_masks[4].cpu().numpy()
+        t5_attn_mask = tokens_and_masks[5].cpu().numpy()
+
+        for i, info in enumerate(infos):
+            lg_out_i = lg_out[i]
+            t5_out_i = t5_out[i]
+            lg_pooled_i = lg_pooled[i]
+            l_attn_mask_i = l_attn_mask[i]
+            g_attn_mask_i = g_attn_mask[i]
+            t5_attn_mask_i = t5_attn_mask[i]
+            apply_lg_attn_mask = self.apply_lg_attn_mask
+            apply_t5_attn_mask = self.apply_t5_attn_mask
+
+            if self.cache_to_disk:
+                np.savez(
+                    info.text_encoder_outputs_npz,
+                    lg_out=lg_out_i,
+                    lg_pooled=lg_pooled_i,
+                    t5_out=t5_out_i,
+                    clip_l_attn_mask=l_attn_mask_i,
+                    clip_g_attn_mask=g_attn_mask_i,
+                    t5_attn_mask=t5_attn_mask_i,
+                    apply_lg_attn_mask=apply_lg_attn_mask,
+                    apply_t5_attn_mask=apply_t5_attn_mask,
+                )
+            else:
+                # it's fine that attn mask is not None. it's overwritten before calling the model if necessary
+                info.text_encoder_outputs = (lg_out_i, t5_out_i, lg_pooled_i, l_attn_mask_i, g_attn_mask_i, t5_attn_mask_i)
+
+
+class Sd3LatentsCachingStrategy(LatentsCachingStrategy):
+    SD3_LATENTS_NPZ_SUFFIX = "_sd3.npz"
+
+    def __init__(self, cache_to_disk: bool, batch_size: int, skip_disk_cache_validity_check: bool) -> None:
+        super().__init__(cache_to_disk, batch_size, skip_disk_cache_validity_check)
+
+    @property
+    def cache_suffix(self) -> str:
+        return Sd3LatentsCachingStrategy.SD3_LATENTS_NPZ_SUFFIX
+
+    def get_latents_npz_path(self, absolute_path: str, image_size: Tuple[int, int]) -> str:
+        return (
+            os.path.splitext(absolute_path)[0]
+            + f"_{image_size[0]:04d}x{image_size[1]:04d}"
+            + Sd3LatentsCachingStrategy.SD3_LATENTS_NPZ_SUFFIX
+        )
+
+    def is_disk_cached_latents_expected(self, bucket_reso: Tuple[int, int], npz_path: str, flip_aug: bool, alpha_mask: bool):
+        return self._default_is_disk_cached_latents_expected(8, bucket_reso, npz_path, flip_aug, alpha_mask, multi_resolution=True)
+
+    def load_latents_from_disk(
+        self, npz_path: str, bucket_reso: Tuple[int, int]
+    ) -> Tuple[Optional[np.ndarray], Optional[List[int]], Optional[List[int]], Optional[np.ndarray], Optional[np.ndarray]]:
+        return self._default_load_latents_from_disk(8, npz_path, bucket_reso)  # support multi-resolution
+
+    # TODO remove circular dependency for ImageInfo
+    def cache_batch_latents(self, vae, image_infos: List, flip_aug: bool, alpha_mask: bool, random_crop: bool):
+        encode_by_vae = lambda img_tensor: vae.encode(img_tensor).to("cpu")
+        vae_device = vae.device
+        vae_dtype = vae.dtype
+
+        self._default_cache_batch_latents(
+            encode_by_vae, vae_device, vae_dtype, image_infos, flip_aug, alpha_mask, random_crop, multi_resolution=True
+        )
+
+        if not train_util.HIGH_VRAM:
+            train_util.clean_memory_on_device(vae.device)
--- a/library/strategy_sdxl.py
+++ b/library/strategy_sdxl.py
@@ -0,0 +1,306 @@
+import os
+from typing import Any, List, Optional, Tuple, Union
+
+import numpy as np
+import torch
+from transformers import CLIPTokenizer, CLIPTextModel, CLIPTextModelWithProjection
+from library.strategy_base import TokenizeStrategy, TextEncodingStrategy, TextEncoderOutputsCachingStrategy
+
+
+from library.utils import setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+
+TOKENIZER1_PATH = "openai/clip-vit-large-patch14"
+TOKENIZER2_PATH = "laion/CLIP-ViT-bigG-14-laion2B-39B-b160k"
+
+
+class SdxlTokenizeStrategy(TokenizeStrategy):
+    def __init__(self, max_length: Optional[int], tokenizer_cache_dir: Optional[str] = None) -> None:
+        self.tokenizer1 = self._load_tokenizer(CLIPTokenizer, TOKENIZER1_PATH, tokenizer_cache_dir=tokenizer_cache_dir)
+        self.tokenizer2 = self._load_tokenizer(CLIPTokenizer, TOKENIZER2_PATH, tokenizer_cache_dir=tokenizer_cache_dir)
+        self.tokenizer2.pad_token_id = 0  # use 0 as pad token for tokenizer2
+
+        if max_length is None:
+            self.max_length = self.tokenizer1.model_max_length
+        else:
+            self.max_length = max_length + 2
+
+    def tokenize(self, text: Union[str, List[str]]) -> List[torch.Tensor]:
+        text = [text] if isinstance(text, str) else text
+        return (
+            torch.stack([self._get_input_ids(self.tokenizer1, t, self.max_length) for t in text], dim=0),
+            torch.stack([self._get_input_ids(self.tokenizer2, t, self.max_length) for t in text], dim=0),
+        )
+
+    def tokenize_with_weights(self, text: str | List[str]) -> Tuple[List[torch.Tensor]]:
+        text = [text] if isinstance(text, str) else text
+        tokens1_list, tokens2_list = [], []
+        weights1_list, weights2_list = [], []
+        for t in text:
+            tokens1, weights1 = self._get_input_ids(self.tokenizer1, t, self.max_length, weighted=True)
+            tokens2, weights2 = self._get_input_ids(self.tokenizer2, t, self.max_length, weighted=True)
+            tokens1_list.append(tokens1)
+            tokens2_list.append(tokens2)
+            weights1_list.append(weights1)
+            weights2_list.append(weights2)
+        return [torch.stack(tokens1_list, dim=0), torch.stack(tokens2_list, dim=0)], [
+            torch.stack(weights1_list, dim=0),
+            torch.stack(weights2_list, dim=0),
+        ]
+
+
+class SdxlTextEncodingStrategy(TextEncodingStrategy):
+    def __init__(self) -> None:
+        pass
+
+    def _pool_workaround(
+        self, text_encoder: CLIPTextModelWithProjection, last_hidden_state: torch.Tensor, input_ids: torch.Tensor, eos_token_id: int
+    ):
+        r"""
+        workaround for CLIP's pooling bug: it returns the hidden states for the max token id as the pooled output
+        instead of the hidden states for the EOS token
+        If we use Textual Inversion, we need to use the hidden states for the EOS token as the pooled output
+
+        Original code from CLIP's pooling function:
+
+        \# text_embeds.shape = [batch_size, sequence_length, transformer.width]
+        \# take features from the eot embedding (eot_token is the highest number in each sequence)
+        \# casting to torch.int for onnx compatibility: argmax doesn't support int64 inputs with opset 14
+        pooled_output = last_hidden_state[
+            torch.arange(last_hidden_state.shape[0], device=last_hidden_state.device),
+            input_ids.to(dtype=torch.int, device=last_hidden_state.device).argmax(dim=-1),
+        ]
+        """
+
+        # input_ids: b*n,77
+        # find index for EOS token
+
+        # Following code is not working if one of the input_ids has multiple EOS tokens (very odd case)
+        # eos_token_index = torch.where(input_ids == eos_token_id)[1]
+        # eos_token_index = eos_token_index.to(device=last_hidden_state.device)
+
+        # Create a mask where the EOS tokens are
+        eos_token_mask = (input_ids == eos_token_id).int()
+
+        # Use argmax to find the last index of the EOS token for each element in the batch
+        eos_token_index = torch.argmax(eos_token_mask, dim=1)  # this will be 0 if there is no EOS token, it's fine
+        eos_token_index = eos_token_index.to(device=last_hidden_state.device)
+
+        # get hidden states for EOS token
+        pooled_output = last_hidden_state[
+            torch.arange(last_hidden_state.shape[0], device=last_hidden_state.device), eos_token_index
+        ]
+
+        # apply projection: projection may be of different dtype than last_hidden_state
+        pooled_output = text_encoder.text_projection(pooled_output.to(text_encoder.text_projection.weight.dtype))
+        pooled_output = pooled_output.to(last_hidden_state.dtype)
+
+        return pooled_output
+
+    def _get_hidden_states_sdxl(
+        self,
+        input_ids1: torch.Tensor,
+        input_ids2: torch.Tensor,
+        tokenizer1: CLIPTokenizer,
+        tokenizer2: CLIPTokenizer,
+        text_encoder1: Union[CLIPTextModel, torch.nn.Module],
+        text_encoder2: Union[CLIPTextModelWithProjection, torch.nn.Module],
+        unwrapped_text_encoder2: Optional[CLIPTextModelWithProjection] = None,
+    ):
+        # input_ids: b,n,77 -> b*n, 77
+        b_size = input_ids1.size()[0]
+        if input_ids1.size()[1] == 1:
+            max_token_length = None
+        else:
+            max_token_length = input_ids1.size()[1] * input_ids1.size()[2]
+        input_ids1 = input_ids1.reshape((-1, tokenizer1.model_max_length))  # batch_size*n, 77
+        input_ids2 = input_ids2.reshape((-1, tokenizer2.model_max_length))  # batch_size*n, 77
+        input_ids1 = input_ids1.to(text_encoder1.device)
+        input_ids2 = input_ids2.to(text_encoder2.device)
+
+        # text_encoder1
+        enc_out = text_encoder1(input_ids1, output_hidden_states=True, return_dict=True)
+        hidden_states1 = enc_out["hidden_states"][11]
+
+        # text_encoder2
+        enc_out = text_encoder2(input_ids2, output_hidden_states=True, return_dict=True)
+        hidden_states2 = enc_out["hidden_states"][-2]  # penuultimate layer
+
+        # pool2 = enc_out["text_embeds"]
+        unwrapped_text_encoder2 = unwrapped_text_encoder2 or text_encoder2
+        pool2 = self._pool_workaround(unwrapped_text_encoder2, enc_out["last_hidden_state"], input_ids2, tokenizer2.eos_token_id)
+
+        # b*n, 77, 768 or 1280 -> b, n*77, 768 or 1280
+        n_size = 1 if max_token_length is None else max_token_length // 75
+        hidden_states1 = hidden_states1.reshape((b_size, -1, hidden_states1.shape[-1]))
+        hidden_states2 = hidden_states2.reshape((b_size, -1, hidden_states2.shape[-1]))
+
+        if max_token_length is not None:
+            # bs*3, 77, 768 or 1024
+            # encoder1: <BOS>...<EOS> の三連を <BOS>...<EOS> へ戻す
+            states_list = [hidden_states1[:, 0].unsqueeze(1)]  # <BOS>
+            for i in range(1, max_token_length, tokenizer1.model_max_length):
+                states_list.append(hidden_states1[:, i : i + tokenizer1.model_max_length - 2])  # <BOS> の後から <EOS> の前まで
+            states_list.append(hidden_states1[:, -1].unsqueeze(1))  # <EOS>
+            hidden_states1 = torch.cat(states_list, dim=1)
+
+            # v2: <BOS>...<EOS> <PAD> ... の三連を <BOS>...<EOS> <PAD> ... へ戻す　正直この実装でいいのかわからん
+            states_list = [hidden_states2[:, 0].unsqueeze(1)]  # <BOS>
+            for i in range(1, max_token_length, tokenizer2.model_max_length):
+                chunk = hidden_states2[:, i : i + tokenizer2.model_max_length - 2]  # <BOS> の後から 最後の前まで
+                # this causes an error:
+                # RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation
+                # if i > 1:
+                #     for j in range(len(chunk)):  # batch_size
+                #         if input_ids2[n_index + j * n_size, 1] == tokenizer2.eos_token_id:  # 空、つまり <BOS> <EOS> <PAD> ...のパターン
+                #             chunk[j, 0] = chunk[j, 1]  # 次の <PAD> の値をコピーする
+                states_list.append(chunk)  # <BOS> の後から <EOS> の前まで
+            states_list.append(hidden_states2[:, -1].unsqueeze(1))  # <EOS> か <PAD> のどちらか
+            hidden_states2 = torch.cat(states_list, dim=1)
+
+            # pool はnの最初のものを使う
+            pool2 = pool2[::n_size]
+
+        return hidden_states1, hidden_states2, pool2
+
+    def encode_tokens(
+        self, tokenize_strategy: TokenizeStrategy, models: List[Any], tokens: List[torch.Tensor]
+    ) -> List[torch.Tensor]:
+        """
+        Args:
+            tokenize_strategy: TokenizeStrategy
+            models: List of models, [text_encoder1, text_encoder2, unwrapped text_encoder2 (optional)].
+                If text_encoder2 is wrapped by accelerate, unwrapped_text_encoder2 is required
+            tokens: List of tokens, for text_encoder1 and text_encoder2
+        """
+        if len(models) == 2:
+            text_encoder1, text_encoder2 = models
+            unwrapped_text_encoder2 = None
+        else:
+            text_encoder1, text_encoder2, unwrapped_text_encoder2 = models
+        tokens1, tokens2 = tokens
+        sdxl_tokenize_strategy = tokenize_strategy  # type: SdxlTokenizeStrategy
+        tokenizer1, tokenizer2 = sdxl_tokenize_strategy.tokenizer1, sdxl_tokenize_strategy.tokenizer2
+
+        hidden_states1, hidden_states2, pool2 = self._get_hidden_states_sdxl(
+            tokens1, tokens2, tokenizer1, tokenizer2, text_encoder1, text_encoder2, unwrapped_text_encoder2
+        )
+        return [hidden_states1, hidden_states2, pool2]
+
+    def encode_tokens_with_weights(
+        self,
+        tokenize_strategy: TokenizeStrategy,
+        models: List[Any],
+        tokens_list: List[torch.Tensor],
+        weights_list: List[torch.Tensor],
+    ) -> List[torch.Tensor]:
+        hidden_states1, hidden_states2, pool2 = self.encode_tokens(tokenize_strategy, models, tokens_list)
+
+        weights_list = [weights.to(hidden_states1.device) for weights in weights_list]
+
+        # apply weights
+        if weights_list[0].shape[1] == 1:  # no max_token_length
+            # weights: ((b, 1, 77), (b, 1, 77)), hidden_states: (b, 77, 768), (b, 77, 768)
+            hidden_states1 = hidden_states1 * weights_list[0].squeeze(1).unsqueeze(2)
+            hidden_states2 = hidden_states2 * weights_list[1].squeeze(1).unsqueeze(2)
+        else:
+            # weights: ((b, n, 77), (b, n, 77)), hidden_states: (b, n*75+2, 768), (b, n*75+2, 768)
+            for weight, hidden_states in zip(weights_list, [hidden_states1, hidden_states2]):
+                for i in range(weight.shape[1]):
+                    hidden_states[:, i * 75 + 1 : i * 75 + 76] = hidden_states[:, i * 75 + 1 : i * 75 + 76] * weight[
+                        :, i, 1:-1
+                    ].unsqueeze(-1)
+
+        return [hidden_states1, hidden_states2, pool2]
+
+
+class SdxlTextEncoderOutputsCachingStrategy(TextEncoderOutputsCachingStrategy):
+    SDXL_TEXT_ENCODER_OUTPUTS_NPZ_SUFFIX = "_te_outputs.npz"
+
+    def __init__(
+        self,
+        cache_to_disk: bool,
+        batch_size: int,
+        skip_disk_cache_validity_check: bool,
+        is_partial: bool = False,
+        is_weighted: bool = False,
+    ) -> None:
+        super().__init__(cache_to_disk, batch_size, skip_disk_cache_validity_check, is_partial, is_weighted)
+
+    def get_outputs_npz_path(self, image_abs_path: str) -> str:
+        return os.path.splitext(image_abs_path)[0] + SdxlTextEncoderOutputsCachingStrategy.SDXL_TEXT_ENCODER_OUTPUTS_NPZ_SUFFIX
+
+    def is_disk_cached_outputs_expected(self, npz_path: str):
+        if not self.cache_to_disk:
+            return False
+        if not os.path.exists(npz_path):
+            return False
+        if self.skip_disk_cache_validity_check:
+            return True
+
+        try:
+            npz = np.load(npz_path)
+            if "hidden_state1" not in npz or "hidden_state2" not in npz or "pool2" not in npz:
+                return False
+        except Exception as e:
+            logger.error(f"Error loading file: {npz_path}")
+            raise e
+
+        return True
+
+    def load_outputs_npz(self, npz_path: str) -> List[np.ndarray]:
+        data = np.load(npz_path)
+        hidden_state1 = data["hidden_state1"]
+        hidden_state2 = data["hidden_state2"]
+        pool2 = data["pool2"]
+        return [hidden_state1, hidden_state2, pool2]
+
+    def cache_batch_outputs(
+        self, tokenize_strategy: TokenizeStrategy, models: List[Any], text_encoding_strategy: TextEncodingStrategy, infos: List
+    ):
+        sdxl_text_encoding_strategy = text_encoding_strategy  # type: SdxlTextEncodingStrategy
+        captions = [info.caption for info in infos]
+
+        if self.is_weighted:
+            tokens_list, weights_list = tokenize_strategy.tokenize_with_weights(captions)
+            with torch.no_grad():
+                hidden_state1, hidden_state2, pool2 = sdxl_text_encoding_strategy.encode_tokens_with_weights(
+                    tokenize_strategy, models, tokens_list, weights_list
+                )
+        else:
+            tokens1, tokens2 = tokenize_strategy.tokenize(captions)
+            with torch.no_grad():
+                hidden_state1, hidden_state2, pool2 = sdxl_text_encoding_strategy.encode_tokens(
+                    tokenize_strategy, models, [tokens1, tokens2]
+                )
+
+        if hidden_state1.dtype == torch.bfloat16:
+            hidden_state1 = hidden_state1.float()
+        if hidden_state2.dtype == torch.bfloat16:
+            hidden_state2 = hidden_state2.float()
+        if pool2.dtype == torch.bfloat16:
+            pool2 = pool2.float()
+
+        hidden_state1 = hidden_state1.cpu().numpy()
+        hidden_state2 = hidden_state2.cpu().numpy()
+        pool2 = pool2.cpu().numpy()
+
+        for i, info in enumerate(infos):
+            hidden_state1_i = hidden_state1[i]
+            hidden_state2_i = hidden_state2[i]
+            pool2_i = pool2[i]
+
+            if self.cache_to_disk:
+                np.savez(
+                    info.text_encoder_outputs_npz,
+                    hidden_state1=hidden_state1_i,
+                    hidden_state2=hidden_state2_i,
+                    pool2=pool2_i,
+                )
+            else:
+                info.text_encoder_outputs = [hidden_state1_i, hidden_state2_i, pool2_i]
--- a/library/train_util.py
+++ b/library/train_util.py
--- a/library/utils.py
+++ b/library/utils.py
@@ -0,0 +1,695 @@
+import logging
+import sys
+import threading
+from typing import *
+import json
+import struct
+
+import torch
+import torch.nn as nn
+from torchvision import transforms
+from diffusers import EulerAncestralDiscreteScheduler
+import diffusers.schedulers.scheduling_euler_ancestral_discrete
+from diffusers.schedulers.scheduling_euler_ancestral_discrete import EulerAncestralDiscreteSchedulerOutput
+import cv2
+from PIL import Image
+import numpy as np
+from safetensors.torch import load_file
+
+def fire_in_thread(f, *args, **kwargs):
+    threading.Thread(target=f, args=args, kwargs=kwargs).start()
+
+
+# region Logging
+
+
+def add_logging_arguments(parser):
+    parser.add_argument(
+        "--console_log_level",
+        type=str,
+        default=None,
+        choices=["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"],
+        help="Set the logging level, default is INFO / ログレベルを設定する。デフォルトはINFO",
+    )
+    parser.add_argument(
+        "--console_log_file",
+        type=str,
+        default=None,
+        help="Log to a file instead of stderr / 標準エラー出力ではなくファイルにログを出力する",
+    )
+    parser.add_argument("--console_log_simple", action="store_true", help="Simple log output / シンプルなログ出力")
+
+
+def setup_logging(args=None, log_level=None, reset=False):
+    if logging.root.handlers:
+        if reset:
+            # remove all handlers
+            for handler in logging.root.handlers[:]:
+                logging.root.removeHandler(handler)
+        else:
+            return
+
+    # log_level can be set by the caller or by the args, the caller has priority. If not set, use INFO
+    if log_level is None and args is not None:
+        log_level = args.console_log_level
+    if log_level is None:
+        log_level = "INFO"
+    log_level = getattr(logging, log_level)
+
+    msg_init = None
+    if args is not None and args.console_log_file:
+        handler = logging.FileHandler(args.console_log_file, mode="w")
+    else:
+        handler = None
+        if not args or not args.console_log_simple:
+            try:
+                from rich.logging import RichHandler
+                from rich.console import Console
+                from rich.logging import RichHandler
+
+                handler = RichHandler(console=Console(stderr=True))
+            except ImportError:
+                # print("rich is not installed, using basic logging")
+                msg_init = "rich is not installed, using basic logging"
+
+        if handler is None:
+            handler = logging.StreamHandler(sys.stdout)  # same as print
+            handler.propagate = False
+
+    formatter = logging.Formatter(
+        fmt="%(message)s",
+        datefmt="%Y-%m-%d %H:%M:%S",
+    )
+    handler.setFormatter(formatter)
+    logging.root.setLevel(log_level)
+    logging.root.addHandler(handler)
+
+    if msg_init is not None:
+        logger = logging.getLogger(__name__)
+        logger.info(msg_init)
+
+setup_logging()
+logger = logging.getLogger(__name__)
+
+# endregion
+
+# region PyTorch utils
+
+
+def swap_weight_devices(layer_to_cpu: nn.Module, layer_to_cuda: nn.Module):
+    assert layer_to_cpu.__class__ == layer_to_cuda.__class__
+
+    weight_swap_jobs = []
+    for module_to_cpu, module_to_cuda in zip(layer_to_cpu.modules(), layer_to_cuda.modules()):
+        if hasattr(module_to_cpu, "weight") and module_to_cpu.weight is not None:
+            weight_swap_jobs.append((module_to_cpu, module_to_cuda, module_to_cpu.weight.data, module_to_cuda.weight.data))
+
+    torch.cuda.current_stream().synchronize()  # this prevents the illegal loss value
+
+    stream = torch.cuda.Stream()
+    with torch.cuda.stream(stream):
+        # cuda to cpu
+        for module_to_cpu, module_to_cuda, cuda_data_view, cpu_data_view in weight_swap_jobs:
+            cuda_data_view.record_stream(stream)
+            module_to_cpu.weight.data = cuda_data_view.data.to("cpu", non_blocking=True)
+
+        stream.synchronize()
+
+        # cpu to cuda
+        for module_to_cpu, module_to_cuda, cuda_data_view, cpu_data_view in weight_swap_jobs:
+            cuda_data_view.copy_(module_to_cuda.weight.data, non_blocking=True)
+            module_to_cuda.weight.data = cuda_data_view
+
+    stream.synchronize()
+    torch.cuda.current_stream().synchronize()  # this prevents the illegal loss value
+
+
+def weighs_to_device(layer: nn.Module, device: torch.device):
+    for module in layer.modules():
+        if hasattr(module, "weight") and module.weight is not None:
+            module.weight.data = module.weight.data.to(device, non_blocking=True)
+
+
+def str_to_dtype(s: Optional[str], default_dtype: Optional[torch.dtype] = None) -> torch.dtype:
+    """
+    Convert a string to a torch.dtype
+
+    Args:
+        s: string representation of the dtype
+        default_dtype: default dtype to return if s is None
+
+    Returns:
+        torch.dtype: the corresponding torch.dtype
+
+    Raises:
+        ValueError: if the dtype is not supported
+
+    Examples:
+        >>> str_to_dtype("float32")
+        torch.float32
+        >>> str_to_dtype("fp32")
+        torch.float32
+        >>> str_to_dtype("float16")
+        torch.float16
+        >>> str_to_dtype("fp16")
+        torch.float16
+        >>> str_to_dtype("bfloat16")
+        torch.bfloat16
+        >>> str_to_dtype("bf16")
+        torch.bfloat16
+        >>> str_to_dtype("fp8")
+        torch.float8_e4m3fn
+        >>> str_to_dtype("fp8_e4m3fn")
+        torch.float8_e4m3fn
+        >>> str_to_dtype("fp8_e4m3fnuz")
+        torch.float8_e4m3fnuz
+        >>> str_to_dtype("fp8_e5m2")
+        torch.float8_e5m2
+        >>> str_to_dtype("fp8_e5m2fnuz")
+        torch.float8_e5m2fnuz
+    """
+    if s is None:
+        return default_dtype
+    if s in ["bf16", "bfloat16"]:
+        return torch.bfloat16
+    elif s in ["fp16", "float16"]:
+        return torch.float16
+    elif s in ["fp32", "float32", "float"]:
+        return torch.float32
+    elif s in ["fp8_e4m3fn", "e4m3fn", "float8_e4m3fn"]:
+        return torch.float8_e4m3fn
+    elif s in ["fp8_e4m3fnuz", "e4m3fnuz", "float8_e4m3fnuz"]:
+        return torch.float8_e4m3fnuz
+    elif s in ["fp8_e5m2", "e5m2", "float8_e5m2"]:
+        return torch.float8_e5m2
+    elif s in ["fp8_e5m2fnuz", "e5m2fnuz", "float8_e5m2fnuz"]:
+        return torch.float8_e5m2fnuz
+    elif s in ["fp8", "float8"]:
+        return torch.float8_e4m3fn  # default fp8
+    else:
+        raise ValueError(f"Unsupported dtype: {s}")
+
+
+def mem_eff_save_file(tensors: Dict[str, torch.Tensor], filename: str, metadata: Dict[str, Any] = None):
+    """
+    memory efficient save file
+    """
+
+    _TYPES = {
+        torch.float64: "F64",
+        torch.float32: "F32",
+        torch.float16: "F16",
+        torch.bfloat16: "BF16",
+        torch.int64: "I64",
+        torch.int32: "I32",
+        torch.int16: "I16",
+        torch.int8: "I8",
+        torch.uint8: "U8",
+        torch.bool: "BOOL",
+        getattr(torch, "float8_e5m2", None): "F8_E5M2",
+        getattr(torch, "float8_e4m3fn", None): "F8_E4M3",
+    }
+    _ALIGN = 256
+
+    def validate_metadata(metadata: Dict[str, Any]) -> Dict[str, str]:
+        validated = {}
+        for key, value in metadata.items():
+            if not isinstance(key, str):
+                raise ValueError(f"Metadata key must be a string, got {type(key)}")
+            if not isinstance(value, str):
+                print(f"Warning: Metadata value for key '{key}' is not a string. Converting to string.")
+                validated[key] = str(value)
+            else:
+                validated[key] = value
+        return validated
+
+    print(f"Using memory efficient save file: {filename}")
+
+    header = {}
+    offset = 0
+    if metadata:
+        header["__metadata__"] = validate_metadata(metadata)
+    for k, v in tensors.items():
+        if v.numel() == 0:  # empty tensor
+            header[k] = {"dtype": _TYPES[v.dtype], "shape": list(v.shape), "data_offsets": [offset, offset]}
+        else:
+            size = v.numel() * v.element_size()
+            header[k] = {"dtype": _TYPES[v.dtype], "shape": list(v.shape), "data_offsets": [offset, offset + size]}
+            offset += size
+
+    hjson = json.dumps(header).encode("utf-8")
+    hjson += b" " * (-(len(hjson) + 8) % _ALIGN)
+
+    with open(filename, "wb") as f:
+        f.write(struct.pack("<Q", len(hjson)))
+        f.write(hjson)
+
+        for k, v in tensors.items():
+            if v.numel() == 0:
+                continue
+            if v.is_cuda:
+                # Direct GPU to disk save
+                with torch.cuda.device(v.device):
+                    if v.dim() == 0:  # if scalar, need to add a dimension to work with view
+                        v = v.unsqueeze(0)
+                    tensor_bytes = v.contiguous().view(torch.uint8)
+                    tensor_bytes.cpu().numpy().tofile(f)
+            else:
+                # CPU tensor save
+                if v.dim() == 0:  # if scalar, need to add a dimension to work with view
+                    v = v.unsqueeze(0)
+                v.contiguous().view(torch.uint8).numpy().tofile(f)
+
+
+class MemoryEfficientSafeOpen:
+    def __init__(self, filename):
+        self.filename = filename
+        self.file = open(filename, "rb")
+        self.header, self.header_size = self._read_header()
+
+    def __enter__(self):
+        return self
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        self.file.close()
+
+    def keys(self):
+        return [k for k in self.header.keys() if k != "__metadata__"]
+
+    def metadata(self) -> Dict[str, str]:
+        return self.header.get("__metadata__", {})
+
+    def get_tensor(self, key):
+        if key not in self.header:
+            raise KeyError(f"Tensor '{key}' not found in the file")
+
+        metadata = self.header[key]
+        offset_start, offset_end = metadata["data_offsets"]
+
+        if offset_start == offset_end:
+            tensor_bytes = None
+        else:
+            # adjust offset by header size
+            self.file.seek(self.header_size + 8 + offset_start)
+            tensor_bytes = self.file.read(offset_end - offset_start)
+
+        return self._deserialize_tensor(tensor_bytes, metadata)
+
+    def _read_header(self):
+        header_size = struct.unpack("<Q", self.file.read(8))[0]
+        header_json = self.file.read(header_size).decode("utf-8")
+        return json.loads(header_json), header_size
+
+    def _deserialize_tensor(self, tensor_bytes, metadata):
+        dtype = self._get_torch_dtype(metadata["dtype"])
+        shape = metadata["shape"]
+
+        if tensor_bytes is None:
+            byte_tensor = torch.empty(0, dtype=torch.uint8)
+        else:
+            tensor_bytes = bytearray(tensor_bytes)  # make it writable
+            byte_tensor = torch.frombuffer(tensor_bytes, dtype=torch.uint8)
+
+        # process float8 types
+        if metadata["dtype"] in ["F8_E5M2", "F8_E4M3"]:
+            return self._convert_float8(byte_tensor, metadata["dtype"], shape)
+
+        # convert to the target dtype and reshape
+        return byte_tensor.view(dtype).reshape(shape)
+
+    @staticmethod
+    def _get_torch_dtype(dtype_str):
+        dtype_map = {
+            "F64": torch.float64,
+            "F32": torch.float32,
+            "F16": torch.float16,
+            "BF16": torch.bfloat16,
+            "I64": torch.int64,
+            "I32": torch.int32,
+            "I16": torch.int16,
+            "I8": torch.int8,
+            "U8": torch.uint8,
+            "BOOL": torch.bool,
+        }
+        # add float8 types if available
+        if hasattr(torch, "float8_e5m2"):
+            dtype_map["F8_E5M2"] = torch.float8_e5m2
+        if hasattr(torch, "float8_e4m3fn"):
+            dtype_map["F8_E4M3"] = torch.float8_e4m3fn
+        return dtype_map.get(dtype_str)
+
+    @staticmethod
+    def _convert_float8(byte_tensor, dtype_str, shape):
+        if dtype_str == "F8_E5M2" and hasattr(torch, "float8_e5m2"):
+            return byte_tensor.view(torch.float8_e5m2).reshape(shape)
+        elif dtype_str == "F8_E4M3" and hasattr(torch, "float8_e4m3fn"):
+            return byte_tensor.view(torch.float8_e4m3fn).reshape(shape)
+        else:
+            # # convert to float16 if float8 is not supported
+            # print(f"Warning: {dtype_str} is not supported in this PyTorch version. Converting to float16.")
+            # return byte_tensor.view(torch.uint8).to(torch.float16).reshape(shape)
+            raise ValueError(f"Unsupported float8 type: {dtype_str} (upgrade PyTorch to support float8 types)")
+
+
+def load_safetensors(
+    path: str, device: Union[str, torch.device], disable_mmap: bool = False, dtype: Optional[torch.dtype] = torch.float32
+) -> dict[str, torch.Tensor]:
+    if disable_mmap:
+        # return safetensors.torch.load(open(path, "rb").read())
+        # use experimental loader
+        # logger.info(f"Loading without mmap (experimental)")
+        state_dict = {}
+        with MemoryEfficientSafeOpen(path) as f:
+            for key in f.keys():
+                state_dict[key] = f.get_tensor(key).to(device, dtype=dtype)
+        return state_dict
+    else:
+        try:
+            state_dict = load_file(path, device=device)
+        except:
+            state_dict = load_file(path)  # prevent device invalid Error
+        if dtype is not None:
+            for key in state_dict.keys():
+                state_dict[key] = state_dict[key].to(dtype=dtype)
+        return state_dict
+
+
+# endregion
+
+# region Image utils
+
+
+def pil_resize(image, size, interpolation):
+    has_alpha = image.shape[2] == 4 if len(image.shape) == 3 else False
+
+    if has_alpha:
+        pil_image = Image.fromarray(cv2.cvtColor(image, cv2.COLOR_BGRA2RGBA))
+    else:
+        pil_image = Image.fromarray(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
+
+    resized_pil = pil_image.resize(size, resample=interpolation)
+
+    # Convert back to cv2 format
+    if has_alpha:
+        resized_cv2 = cv2.cvtColor(np.array(resized_pil), cv2.COLOR_RGBA2BGRA)
+    else:
+        resized_cv2 = cv2.cvtColor(np.array(resized_pil), cv2.COLOR_RGB2BGR)
+
+    return resized_cv2
+
+
+def resize_image(image: np.ndarray, width: int, height: int, resized_width: int, resized_height: int, resize_interpolation: Optional[str] = None):
+    """
+    Resize image with resize interpolation. Default interpolation to AREA if image is smaller, else LANCZOS.
+
+    Args:
+        image: numpy.ndarray
+        width: int Original image width
+        height: int Original image height
+        resized_width: int Resized image width
+        resized_height: int Resized image height
+        resize_interpolation: Optional[str] Resize interpolation method "lanczos", "area", "bilinear", "bicubic", "nearest", "box"
+
+    Returns:
+        image
+    """
+
+    # Ensure all size parameters are actual integers
+    width = int(width)
+    height = int(height)
+    resized_width = int(resized_width)
+    resized_height = int(resized_height)
+
+    if resize_interpolation is None:
+        if width >= resized_width and height >= resized_height:
+            resize_interpolation = "area"
+        else:
+            resize_interpolation = "lanczos"
+
+    # we use PIL for lanczos (for backward compatibility) and box, cv2 for others
+    use_pil = resize_interpolation in ["lanczos", "lanczos4", "box"]
+
+    resized_size = (resized_width, resized_height)
+    if use_pil:
+        interpolation = get_pil_interpolation(resize_interpolation)
+        image = pil_resize(image, resized_size, interpolation=interpolation)
+        logger.debug(f"resize image using {resize_interpolation} (PIL)")
+    else:
+        interpolation = get_cv2_interpolation(resize_interpolation)
+        image = cv2.resize(image, resized_size, interpolation=interpolation)
+        logger.debug(f"resize image using {resize_interpolation} (cv2)")
+
+    return image
+
+
+def get_cv2_interpolation(interpolation: Optional[str]) -> Optional[int]:
+    """
+    Convert interpolation value to cv2 interpolation integer
+
+    https://docs.opencv.org/3.4/da/d54/group__imgproc__transform.html#ga5bb5a1fea74ea38e1a5445ca803ff121
+    """
+    if interpolation is None:
+        return None 
+
+    if interpolation == "lanczos" or interpolation == "lanczos4":
+        # Lanczos interpolation over 8x8 neighborhood 
+        return cv2.INTER_LANCZOS4
+    elif interpolation == "nearest":
+        # Bit exact nearest neighbor interpolation. This will produce same results as the nearest neighbor method in PIL, scikit-image or Matlab. 
+        return cv2.INTER_NEAREST_EXACT
+    elif interpolation == "bilinear" or interpolation == "linear":
+        # bilinear interpolation
+        return cv2.INTER_LINEAR
+    elif interpolation == "bicubic" or interpolation == "cubic":
+        # bicubic interpolation 
+        return cv2.INTER_CUBIC
+    elif interpolation == "area":
+        # resampling using pixel area relation. It may be a preferred method for image decimation, as it gives moire'-free results. But when the image is zoomed, it is similar to the INTER_NEAREST method. 
+        return cv2.INTER_AREA
+    elif interpolation == "box":
+        # resampling using pixel area relation. It may be a preferred method for image decimation, as it gives moire'-free results. But when the image is zoomed, it is similar to the INTER_NEAREST method. 
+        return cv2.INTER_AREA
+    else:
+        return None
+
+def get_pil_interpolation(interpolation: Optional[str]) -> Optional[Image.Resampling]:
+    """
+    Convert interpolation value to PIL interpolation
+
+    https://pillow.readthedocs.io/en/stable/handbook/concepts.html#concept-filters
+    """
+    if interpolation is None:
+        return None 
+
+    if interpolation == "lanczos":
+        return Image.Resampling.LANCZOS
+    elif interpolation == "nearest":
+        # Pick one nearest pixel from the input image. Ignore all other input pixels.
+        return Image.Resampling.NEAREST
+    elif interpolation == "bilinear" or interpolation == "linear":
+        # For resize calculate the output pixel value using linear interpolation on all pixels that may contribute to the output value. For other transformations linear interpolation over a 2x2 environment in the input image is used.
+        return Image.Resampling.BILINEAR
+    elif interpolation == "bicubic" or interpolation == "cubic":
+        # For resize calculate the output pixel value using cubic interpolation on all pixels that may contribute to the output value. For other transformations cubic interpolation over a 4x4 environment in the input image is used.
+        return Image.Resampling.BICUBIC
+    elif interpolation == "area":
+        # Image.Resampling.BOX may be more appropriate if upscaling 
+        # Area interpolation is related to cv2.INTER_AREA
+        # Produces a sharper image than Resampling.BILINEAR, doesn’t have dislocations on local level like with Resampling.BOX.
+        return Image.Resampling.HAMMING
+    elif interpolation == "box":
+        # Each pixel of source image contributes to one pixel of the destination image with identical weights. For upscaling is equivalent of Resampling.NEAREST.
+        return Image.Resampling.BOX
+    else:
+        return None
+
+def validate_interpolation_fn(interpolation_str: str) -> bool:
+    """
+    Check if a interpolation function is supported
+    """
+    return interpolation_str in ["lanczos", "nearest", "bilinear", "linear", "bicubic", "cubic", "area", "box"]
+
+# endregion
+
+# TODO make inf_utils.py
+# region Gradual Latent hires fix
+
+
+class GradualLatent:
+    def __init__(
+        self,
+        ratio,
+        start_timesteps,
+        every_n_steps,
+        ratio_step,
+        s_noise=1.0,
+        gaussian_blur_ksize=None,
+        gaussian_blur_sigma=0.5,
+        gaussian_blur_strength=0.5,
+        unsharp_target_x=True,
+    ):
+        self.ratio = ratio
+        self.start_timesteps = start_timesteps
+        self.every_n_steps = every_n_steps
+        self.ratio_step = ratio_step
+        self.s_noise = s_noise
+        self.gaussian_blur_ksize = gaussian_blur_ksize
+        self.gaussian_blur_sigma = gaussian_blur_sigma
+        self.gaussian_blur_strength = gaussian_blur_strength
+        self.unsharp_target_x = unsharp_target_x
+
+    def __str__(self) -> str:
+        return (
+            f"GradualLatent(ratio={self.ratio}, start_timesteps={self.start_timesteps}, "
+            + f"every_n_steps={self.every_n_steps}, ratio_step={self.ratio_step}, s_noise={self.s_noise}, "
+            + f"gaussian_blur_ksize={self.gaussian_blur_ksize}, gaussian_blur_sigma={self.gaussian_blur_sigma}, gaussian_blur_strength={self.gaussian_blur_strength}, "
+            + f"unsharp_target_x={self.unsharp_target_x})"
+        )
+
+    def apply_unshark_mask(self, x: torch.Tensor):
+        if self.gaussian_blur_ksize is None:
+            return x
+        blurred = transforms.functional.gaussian_blur(x, self.gaussian_blur_ksize, self.gaussian_blur_sigma)
+        # mask = torch.sigmoid((x - blurred) * self.gaussian_blur_strength)
+        mask = (x - blurred) * self.gaussian_blur_strength
+        sharpened = x + mask
+        return sharpened
+
+    def interpolate(self, x: torch.Tensor, resized_size, unsharp=True):
+        org_dtype = x.dtype
+        if org_dtype == torch.bfloat16:
+            x = x.float()
+
+        x = torch.nn.functional.interpolate(x, size=resized_size, mode="bicubic", align_corners=False).to(dtype=org_dtype)
+
+        # apply unsharp mask / アンシャープマスクを適用する
+        if unsharp and self.gaussian_blur_ksize:
+            x = self.apply_unshark_mask(x)
+
+        return x
+
+
+class EulerAncestralDiscreteSchedulerGL(EulerAncestralDiscreteScheduler):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.resized_size = None
+        self.gradual_latent = None
+
+    def set_gradual_latent_params(self, size, gradual_latent: GradualLatent):
+        self.resized_size = size
+        self.gradual_latent = gradual_latent
+
+    def step(
+        self,
+        model_output: torch.FloatTensor,
+        timestep: Union[float, torch.FloatTensor],
+        sample: torch.FloatTensor,
+        generator: Optional[torch.Generator] = None,
+        return_dict: bool = True,
+    ) -> Union[EulerAncestralDiscreteSchedulerOutput, Tuple]:
+        """
+        Predict the sample from the previous timestep by reversing the SDE. This function propagates the diffusion
+        process from the learned model outputs (most often the predicted noise).
+
+        Args:
+            model_output (`torch.FloatTensor`):
+                The direct output from learned diffusion model.
+            timestep (`float`):
+                The current discrete timestep in the diffusion chain.
+            sample (`torch.FloatTensor`):
+                A current instance of a sample created by the diffusion process.
+            generator (`torch.Generator`, *optional*):
+                A random number generator.
+            return_dict (`bool`):
+                Whether or not to return a
+                [`~schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteSchedulerOutput`] or tuple.
+
+        Returns:
+            [`~schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteSchedulerOutput`] or `tuple`:
+                If return_dict is `True`,
+                [`~schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteSchedulerOutput`] is returned,
+                otherwise a tuple is returned where the first element is the sample tensor.
+
+        """
+
+        if isinstance(timestep, int) or isinstance(timestep, torch.IntTensor) or isinstance(timestep, torch.LongTensor):
+            raise ValueError(
+                (
+                    "Passing integer indices (e.g. from `enumerate(timesteps)`) as timesteps to"
+                    " `EulerDiscreteScheduler.step()` is not supported. Make sure to pass"
+                    " one of the `scheduler.timesteps` as a timestep."
+                ),
+            )
+
+        if not self.is_scale_input_called:
+            # logger.warning(
+            print(
+                "The `scale_model_input` function should be called before `step` to ensure correct denoising. "
+                "See `StableDiffusionPipeline` for a usage example."
+            )
+
+        if self.step_index is None:
+            self._init_step_index(timestep)
+
+        sigma = self.sigmas[self.step_index]
+
+        # 1. compute predicted original sample (x_0) from sigma-scaled predicted noise
+        if self.config.prediction_type == "epsilon":
+            pred_original_sample = sample - sigma * model_output
+        elif self.config.prediction_type == "v_prediction":
+            # * c_out + input * c_skip
+            pred_original_sample = model_output * (-sigma / (sigma**2 + 1) ** 0.5) + (sample / (sigma**2 + 1))
+        elif self.config.prediction_type == "sample":
+            raise NotImplementedError("prediction_type not implemented yet: sample")
+        else:
+            raise ValueError(f"prediction_type given as {self.config.prediction_type} must be one of `epsilon`, or `v_prediction`")
+
+        sigma_from = self.sigmas[self.step_index]
+        sigma_to = self.sigmas[self.step_index + 1]
+        sigma_up = (sigma_to**2 * (sigma_from**2 - sigma_to**2) / sigma_from**2) ** 0.5
+        sigma_down = (sigma_to**2 - sigma_up**2) ** 0.5
+
+        # 2. Convert to an ODE derivative
+        derivative = (sample - pred_original_sample) / sigma
+
+        dt = sigma_down - sigma
+
+        device = model_output.device
+        if self.resized_size is None:
+            prev_sample = sample + derivative * dt
+
+            noise = diffusers.schedulers.scheduling_euler_ancestral_discrete.randn_tensor(
+                model_output.shape, dtype=model_output.dtype, device=device, generator=generator
+            )
+            s_noise = 1.0
+        else:
+            print("resized_size", self.resized_size, "model_output.shape", model_output.shape, "sample.shape", sample.shape)
+            s_noise = self.gradual_latent.s_noise
+
+            if self.gradual_latent.unsharp_target_x:
+                prev_sample = sample + derivative * dt
+                prev_sample = self.gradual_latent.interpolate(prev_sample, self.resized_size)
+            else:
+                sample = self.gradual_latent.interpolate(sample, self.resized_size)
+                derivative = self.gradual_latent.interpolate(derivative, self.resized_size, unsharp=False)
+                prev_sample = sample + derivative * dt
+
+            noise = diffusers.schedulers.scheduling_euler_ancestral_discrete.randn_tensor(
+                (model_output.shape[0], model_output.shape[1], self.resized_size[0], self.resized_size[1]),
+                dtype=model_output.dtype,
+                device=device,
+                generator=generator,
+            )
+
+        prev_sample = prev_sample + noise * sigma_up * s_noise
+
+        # upon completion increase step index by one
+        self._step_index += 1
+
+        if not return_dict:
+            return (prev_sample,)
+
+        return EulerAncestralDiscreteSchedulerOutput(prev_sample=prev_sample, pred_original_sample=pred_original_sample)
+
+
+# endregion
--- a/lumina_minimal_inference.py
+++ b/lumina_minimal_inference.py
@@ -0,0 +1,415 @@
+# Minimum Inference Code for Lumina
+# Based on flux_minimal_inference.py
+
+import logging
+import argparse
+import math
+import os
+import random
+import time
+from typing import Optional
+
+import einops
+import numpy as np
+import torch
+from accelerate import Accelerator
+from PIL import Image
+from safetensors.torch import load_file
+from tqdm import tqdm
+from transformers import Gemma2Model
+from library.flux_models import AutoEncoder
+
+from library import (
+    device_utils,
+    lumina_models,
+    lumina_train_util,
+    lumina_util,
+    sd3_train_utils,
+    strategy_lumina,
+)
+import networks.lora_lumina as lora_lumina
+from library.device_utils import get_preferred_device, init_ipex
+from library.utils import setup_logging, str_to_dtype
+
+init_ipex()
+setup_logging()
+logger = logging.getLogger(__name__)
+
+
+def generate_image(
+    model: lumina_models.NextDiT,
+    gemma2: Gemma2Model,
+    ae: AutoEncoder,
+    prompt: str,
+    system_prompt: str,
+    seed: Optional[int],
+    image_width: int,
+    image_height: int,
+    steps: int,
+    guidance_scale: float,
+    negative_prompt: Optional[str],
+    args,
+    cfg_trunc_ratio: float = 0.25,
+    renorm_cfg: float = 1.0,
+):
+    #
+    # 0. Prepare arguments
+    #
+    device = get_preferred_device()
+    if args.device:
+        device = torch.device(args.device)
+
+    dtype = str_to_dtype(args.dtype)
+    ae_dtype = str_to_dtype(args.ae_dtype)
+    gemma2_dtype = str_to_dtype(args.gemma2_dtype)
+
+    #
+    # 1. Prepare models
+    #
+    # model.to(device, dtype=dtype)
+    model.to(dtype)
+    model.eval()
+
+    gemma2.to(device, dtype=gemma2_dtype)
+    gemma2.eval()
+
+    ae.to(ae_dtype)
+    ae.eval()
+
+    #
+    # 2. Encode prompts
+    #
+    logger.info("Encoding prompts...")
+
+    tokenize_strategy = strategy_lumina.LuminaTokenizeStrategy(system_prompt, args.gemma2_max_token_length)
+    encoding_strategy = strategy_lumina.LuminaTextEncodingStrategy()
+
+    tokens_and_masks = tokenize_strategy.tokenize(prompt)
+    with torch.no_grad():
+        gemma2_conds = encoding_strategy.encode_tokens(tokenize_strategy, [gemma2], tokens_and_masks)
+
+    tokens_and_masks = tokenize_strategy.tokenize(negative_prompt, is_negative=True)
+    with torch.no_grad():
+        neg_gemma2_conds = encoding_strategy.encode_tokens(tokenize_strategy, [gemma2], tokens_and_masks)
+
+    # Unpack Gemma2 outputs
+    prompt_hidden_states, _, prompt_attention_mask = gemma2_conds
+    uncond_hidden_states, _, uncond_attention_mask = neg_gemma2_conds
+
+    if args.offload:
+        print("Offloading models to CPU to save VRAM...")
+        gemma2.to("cpu")
+        device_utils.clean_memory()
+
+    model.to(device)
+
+    #
+    # 3. Prepare latents
+    #
+    seed = seed if seed is not None else random.randint(0, 2**32 - 1)
+    logger.info(f"Seed: {seed}")
+    torch.manual_seed(seed)
+
+    latent_height = image_height // 8
+    latent_width = image_width // 8
+    latent_channels = 16
+
+    latents = torch.randn(
+        (1, latent_channels, latent_height, latent_width),
+        device=device,
+        dtype=dtype,
+        generator=torch.Generator(device=device).manual_seed(seed),
+    )
+
+    #
+    # 4. Denoise
+    #
+    logger.info("Denoising...")
+    scheduler = sd3_train_utils.FlowMatchEulerDiscreteScheduler(num_train_timesteps=1000, shift=args.discrete_flow_shift)
+    scheduler.set_timesteps(steps, device=device)
+    timesteps = scheduler.timesteps
+
+    # # compare with lumina_train_util.retrieve_timesteps
+    # lumina_timestep = lumina_train_util.retrieve_timesteps(scheduler, num_inference_steps=steps)
+    # print(f"Using timesteps: {timesteps}")
+    # print(f"vs Lumina timesteps: {lumina_timestep}")  # should be the same
+
+    with torch.autocast(device_type=device.type, dtype=dtype), torch.no_grad():
+        latents = lumina_train_util.denoise(
+            scheduler,
+            model,
+            latents.to(device),
+            prompt_hidden_states.to(device),
+            prompt_attention_mask.to(device),
+            uncond_hidden_states.to(device),
+            uncond_attention_mask.to(device),
+            timesteps,
+            guidance_scale,
+            cfg_trunc_ratio,
+            renorm_cfg,
+        )
+
+    if args.offload:
+        model.to("cpu")
+        device_utils.clean_memory()
+        ae.to(device)
+
+    #
+    # 5. Decode latents
+    #
+    logger.info("Decoding image...")
+    latents = latents / ae.scale_factor + ae.shift_factor
+    with torch.no_grad():
+        image = ae.decode(latents.to(ae_dtype))
+    image = (image / 2 + 0.5).clamp(0, 1)
+    image = image.cpu().permute(0, 2, 3, 1).float().numpy()
+    image = (image * 255).round().astype("uint8")
+
+    #
+    # 6. Save image
+    #
+    pil_image = Image.fromarray(image[0])
+    output_dir = args.output_dir
+    os.makedirs(output_dir, exist_ok=True)
+    ts_str = time.strftime("%Y%m%d%H%M%S", time.localtime())
+    seed_suffix = f"_{seed}"
+    output_path = os.path.join(output_dir, f"image_{ts_str}{seed_suffix}.png")
+    pil_image.save(output_path)
+    logger.info(f"Image saved to {output_path}")
+
+
+def setup_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--pretrained_model_name_or_path",
+        type=str,
+        default=None,
+        required=True,
+        help="Lumina DiT model path / Lumina DiTモデルのパス",
+    )
+    parser.add_argument(
+        "--gemma2_path",
+        type=str,
+        default=None,
+        required=True,
+        help="Gemma2 model path / Gemma2モデルのパス",
+    )
+    parser.add_argument(
+        "--ae_path",
+        type=str,
+        default=None,
+        required=True,
+        help="Autoencoder model path / Autoencoderモデルのパス",
+    )
+    parser.add_argument("--prompt", type=str, default="A beautiful sunset over the mountains", help="Prompt for image generation")
+    parser.add_argument("--negative_prompt", type=str, default="", help="Negative prompt for image generation, default is empty")
+    parser.add_argument("--output_dir", type=str, default="outputs", help="Output directory for generated images")
+    parser.add_argument("--seed", type=int, default=None, help="Random seed")
+    parser.add_argument("--steps", type=int, default=36, help="Number of inference steps")
+    parser.add_argument("--guidance_scale", type=float, default=3.5, help="Guidance scale for classifier-free guidance")
+    parser.add_argument("--image_width", type=int, default=1024, help="Image width")
+    parser.add_argument("--image_height", type=int, default=1024, help="Image height")
+    parser.add_argument("--dtype", type=str, default="bf16", help="Data type for model (bf16, fp16, float)")
+    parser.add_argument("--gemma2_dtype", type=str, default="bf16", help="Data type for Gemma2 (bf16, fp16, float)")
+    parser.add_argument("--ae_dtype", type=str, default="bf16", help="Data type for Autoencoder (bf16, fp16, float)")
+    parser.add_argument("--device", type=str, default=None, help="Device to use (e.g., 'cuda:0')")
+    parser.add_argument("--offload", action="store_true", help="Offload models to CPU to save VRAM")
+    parser.add_argument("--system_prompt", type=str, default="", help="System prompt for Gemma2 model")
+    parser.add_argument(
+        "--gemma2_max_token_length",
+        type=int,
+        default=256,
+        help="Max token length for Gemma2 tokenizer",
+    )
+    parser.add_argument(
+        "--discrete_flow_shift",
+        type=float,
+        default=6.0,
+        help="Shift value for FlowMatchEulerDiscreteScheduler",
+    )
+    parser.add_argument(
+        "--cfg_trunc_ratio",
+        type=float,
+        default=0.25,
+        help="TBD",
+    )
+    parser.add_argument(
+        "--renorm_cfg",
+        type=float,
+        default=1.0,
+        help="TBD",
+    )
+    parser.add_argument(
+        "--use_flash_attn",
+        action="store_true",
+        help="Use flash attention for Lumina model",
+    )
+    parser.add_argument(
+        "--use_sage_attn",
+        action="store_true",
+        help="Use sage attention for Lumina model",
+    )
+    parser.add_argument(
+        "--lora_weights",
+        type=str,
+        nargs="*",
+        default=[],
+        help="LoRA weights, each argument is a `path;multiplier` (semi-colon separated)",
+    )
+    parser.add_argument("--merge_lora_weights", action="store_true", help="Merge LoRA weights to model")
+    parser.add_argument(
+        "--interactive",
+        action="store_true",
+        help="Enable interactive mode for generating multiple images / 対話モードで複数の画像を生成する",
+    )
+    return parser
+
+
+if __name__ == "__main__":
+    parser = setup_parser()
+    args = parser.parse_args()
+
+    logger.info("Loading models...")
+    device = get_preferred_device()
+    if args.device:
+        device = torch.device(args.device)
+
+    # Load Lumina DiT model
+    model = lumina_util.load_lumina_model(
+        args.pretrained_model_name_or_path,
+        dtype=None,  # Load in fp32 and then convert
+        device="cpu",
+        use_flash_attn=args.use_flash_attn,
+        use_sage_attn=args.use_sage_attn,
+    )
+
+    # Load Gemma2
+    gemma2 = lumina_util.load_gemma2(args.gemma2_path, dtype=None, device="cpu")
+
+    # Load Autoencoder
+    ae = lumina_util.load_ae(args.ae_path, dtype=None, device="cpu")
+
+    # LoRA
+    lora_models = []
+    for weights_file in args.lora_weights:
+        if ";" in weights_file:
+            weights_file, multiplier = weights_file.split(";")
+            multiplier = float(multiplier)
+        else:
+            multiplier = 1.0
+
+        weights_sd = load_file(weights_file)
+        lora_model, _ = lora_lumina.create_network_from_weights(multiplier, None, ae, [gemma2], model, weights_sd, True)
+
+        if args.merge_lora_weights:
+            lora_model.merge_to([gemma2], model, weights_sd)
+        else:
+            lora_model.apply_to([gemma2], model)
+            info = lora_model.load_state_dict(weights_sd, strict=True)
+            logger.info(f"Loaded LoRA weights from {weights_file}: {info}")
+            lora_model.to(device)
+            lora_model.set_multiplier(multiplier)
+            lora_model.eval()
+
+        lora_models.append(lora_model)
+
+    if not args.interactive:
+        generate_image(
+            model,
+            gemma2,
+            ae,
+            args.prompt,
+            args.system_prompt,
+            args.seed,
+            args.image_width,
+            args.image_height,
+            args.steps,
+            args.guidance_scale,
+            args.negative_prompt,
+            args,
+            args.cfg_trunc_ratio,
+            args.renorm_cfg,
+        )
+    else:
+        # Interactive mode loop
+        image_width = args.image_width
+        image_height = args.image_height
+        steps = args.steps
+        guidance_scale = args.guidance_scale
+        cfg_trunc_ratio = args.cfg_trunc_ratio
+        renorm_cfg = args.renorm_cfg
+
+        print("Entering interactive mode.")
+        while True:
+            print(
+                "\nEnter prompt (or 'exit'). Options: --w <int> --h <int> --s <int> --d <int> --g <float> --n <str> --ctr <float> --rcfg <float> --m <m1,m2...>"
+            )
+            user_input = input()
+            if user_input.lower() == "exit":
+                break
+            if not user_input:
+                continue
+
+            # Parse options
+            options = user_input.split("--")
+            prompt = options[0].strip()
+
+            # Set defaults for each generation
+            seed = None  # New random seed each time unless specified
+            negative_prompt = args.negative_prompt  # Reset to default
+
+            for opt in options[1:]:
+                try:
+                    opt = opt.strip()
+                    if not opt:
+                        continue
+
+                    key, value = (opt.split(None, 1) + [""])[:2]
+
+                    if key == "w":
+                        image_width = int(value)
+                    elif key == "h":
+                        image_height = int(value)
+                    elif key == "s":
+                        steps = int(value)
+                    elif key == "d":
+                        seed = int(value)
+                    elif key == "g":
+                        guidance_scale = float(value)
+                    elif key == "n":
+                        negative_prompt = value if value != "-" else ""
+                    elif key == "ctr":
+                        cfg_trunc_ratio = float(value)
+                    elif key == "rcfg":
+                        renorm_cfg = float(value)
+                    elif key == "m":
+                        multipliers = value.split(",")
+                        if len(multipliers) != len(lora_models):
+                            logger.error(f"Invalid number of multipliers, expected {len(lora_models)}")
+                            continue
+                        for i, lora_model in enumerate(lora_models):
+                            lora_model.set_multiplier(float(multipliers[i].strip()))
+                    else:
+                        logger.warning(f"Unknown option: --{key}")
+
+                except (ValueError, IndexError) as e:
+                    logger.error(f"Invalid value for option --{key}: '{value}'. Error: {e}")
+
+            generate_image(
+                model,
+                gemma2,
+                ae,
+                prompt,
+                args.system_prompt,
+                seed,
+                image_width,
+                image_height,
+                steps,
+                guidance_scale,
+                negative_prompt,
+                args,
+                cfg_trunc_ratio,
+                renorm_cfg,
+            )
+
+    logger.info("Done.")
--- a/lumina_train.py
+++ b/lumina_train.py
@@ -0,0 +1,953 @@
+# training with captions
+
+# Swap blocks between CPU and GPU:
+# This implementation is inspired by and based on the work of 2kpr.
+# Many thanks to 2kpr for the original concept and implementation of memory-efficient offloading.
+# The original idea has been adapted and extended to fit the current project's needs.
+
+# Key features:
+# - CPU offloading during forward and backward passes
+# - Use of fused optimizer and grad_hook for efficient gradient processing
+# - Per-block fused optimizer instances
+
+import argparse
+import copy
+import math
+import os
+from multiprocessing import Value
+import toml
+
+from tqdm import tqdm
+
+import torch
+from library.device_utils import init_ipex, clean_memory_on_device
+
+init_ipex()
+
+from accelerate.utils import set_seed
+from library import (
+    deepspeed_utils,
+    lumina_train_util,
+    lumina_util,
+    strategy_base,
+    strategy_lumina,
+)
+from library.sd3_train_utils import FlowMatchEulerDiscreteScheduler
+
+import library.train_util as train_util
+
+from library.utils import setup_logging, add_logging_arguments
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+import library.config_util as config_util
+
+# import library.sdxl_train_util as sdxl_train_util
+from library.config_util import (
+    ConfigSanitizer,
+    BlueprintGenerator,
+)
+from library.custom_train_functions import apply_masked_loss, add_custom_train_arguments
+
+
+def train(args):
+    train_util.verify_training_args(args)
+    train_util.prepare_dataset_args(args, True)
+    # sdxl_train_util.verify_sdxl_training_args(args)
+    deepspeed_utils.prepare_deepspeed_args(args)
+    setup_logging(args, reset=True)
+
+    # temporary: backward compatibility for deprecated options. remove in the future
+    if not args.skip_cache_check:
+        args.skip_cache_check = args.skip_latents_validity_check
+
+    # assert (
+    #     not args.weighted_captions
+    # ), "weighted_captions is not supported currently / weighted_captionsは現在サポートされていません"
+    if args.cache_text_encoder_outputs_to_disk and not args.cache_text_encoder_outputs:
+        logger.warning(
+            "cache_text_encoder_outputs_to_disk is enabled, so cache_text_encoder_outputs is also enabled / cache_text_encoder_outputs_to_diskが有効になっているため、cache_text_encoder_outputsも有効になります"
+        )
+        args.cache_text_encoder_outputs = True
+
+    if args.cpu_offload_checkpointing and not args.gradient_checkpointing:
+        logger.warning(
+            "cpu_offload_checkpointing is enabled, so gradient_checkpointing is also enabled / cpu_offload_checkpointingが有効になっているため、gradient_checkpointingも有効になります"
+        )
+        args.gradient_checkpointing = True
+
+    # assert (
+    #     args.blocks_to_swap is None or args.blocks_to_swap == 0
+    # ) or not args.cpu_offload_checkpointing, "blocks_to_swap is not supported with cpu_offload_checkpointing / blocks_to_swapはcpu_offload_checkpointingと併用できません"
+
+    cache_latents = args.cache_latents
+    use_dreambooth_method = args.in_json is None
+
+    if args.seed is not None:
+        set_seed(args.seed)  # 乱数系列を初期化する
+
+    # prepare caching strategy: this must be set before preparing dataset. because dataset may use this strategy for initialization.
+    if args.cache_latents:
+        latents_caching_strategy = strategy_lumina.LuminaLatentsCachingStrategy(
+            args.cache_latents_to_disk, args.vae_batch_size, args.skip_cache_check
+        )
+        strategy_base.LatentsCachingStrategy.set_strategy(latents_caching_strategy)
+
+    # データセットを準備する
+    if args.dataset_class is None:
+        blueprint_generator = BlueprintGenerator(
+            ConfigSanitizer(True, True, args.masked_loss, True)
+        )
+        if args.dataset_config is not None:
+            logger.info(f"Load dataset config from {args.dataset_config}")
+            user_config = config_util.load_user_config(args.dataset_config)
+            ignored = ["train_data_dir", "in_json"]
+            if any(getattr(args, attr) is not None for attr in ignored):
+                logger.warning(
+                    "ignore following options because config file is found: {0} / 設定ファイルが利用されるため以下のオプションは無視されます: {0}".format(
+                        ", ".join(ignored)
+                    )
+                )
+        else:
+            if use_dreambooth_method:
+                logger.info("Using DreamBooth method.")
+                user_config = {
+                    "datasets": [
+                        {
+                            "subsets": config_util.generate_dreambooth_subsets_config_by_subdirs(
+                                args.train_data_dir, args.reg_data_dir
+                            )
+                        }
+                    ]
+                }
+            else:
+                logger.info("Training with captions.")
+                user_config = {
+                    "datasets": [
+                        {
+                            "subsets": [
+                                {
+                                    "image_dir": args.train_data_dir,
+                                    "metadata_file": args.in_json,
+                                }
+                            ]
+                        }
+                    ]
+                }
+
+        blueprint = blueprint_generator.generate(user_config, args)
+        train_dataset_group, val_dataset_group = (
+            config_util.generate_dataset_group_by_blueprint(blueprint.dataset_group)
+        )
+    else:
+        train_dataset_group = train_util.load_arbitrary_dataset(args)
+        val_dataset_group = None
+
+    current_epoch = Value("i", 0)
+    current_step = Value("i", 0)
+    ds_for_collator = (
+        train_dataset_group if args.max_data_loader_n_workers == 0 else None
+    )
+    collator = train_util.collator_class(current_epoch, current_step, ds_for_collator)
+
+    train_dataset_group.verify_bucket_reso_steps(16)  # TODO これでいいか確認
+
+    if args.debug_dataset:
+        if args.cache_text_encoder_outputs:
+            strategy_base.TextEncoderOutputsCachingStrategy.set_strategy(
+                strategy_lumina.LuminaTextEncoderOutputsCachingStrategy(
+                    args.cache_text_encoder_outputs_to_disk,
+                    args.text_encoder_batch_size,
+                    args.skip_cache_check,
+                    False,
+                )
+            )
+        strategy_base.TokenizeStrategy.set_strategy(
+            strategy_lumina.LuminaTokenizeStrategy(args.system_prompt)
+        )
+
+        train_dataset_group.set_current_strategies()
+        train_util.debug_dataset(train_dataset_group, True)
+        return
+    if len(train_dataset_group) == 0:
+        logger.error(
+            "No data found. Please verify the metadata file and train_data_dir option. / 画像がありません。メタデータおよびtrain_data_dirオプションを確認してください。"
+        )
+        return
+
+    if cache_latents:
+        assert (
+            train_dataset_group.is_latent_cacheable()
+        ), "when caching latents, either color_aug or random_crop cannot be used / latentをキャッシュするときはcolor_augとrandom_cropは使えません"
+
+    if args.cache_text_encoder_outputs:
+        assert (
+            train_dataset_group.is_text_encoder_output_cacheable()
+        ), "when caching text encoder output, either caption_dropout_rate, shuffle_caption, token_warmup_step or caption_tag_dropout_rate cannot be used / text encoderの出力をキャッシュするときはcaption_dropout_rate, shuffle_caption, token_warmup_step, caption_tag_dropout_rateは使えません"
+
+    # acceleratorを準備する
+    logger.info("prepare accelerator")
+    accelerator = train_util.prepare_accelerator(args)
+
+    # mixed precisionに対応した型を用意しておき適宜castする
+    weight_dtype, save_dtype = train_util.prepare_dtype(args)
+
+    # モデルを読み込む
+
+    # load VAE for caching latents
+    ae = None
+    if cache_latents:
+        ae = lumina_util.load_ae(
+            args.ae, weight_dtype, "cpu", args.disable_mmap_load_safetensors
+        )
+        ae.to(accelerator.device, dtype=weight_dtype)
+        ae.requires_grad_(False)
+        ae.eval()
+
+        train_dataset_group.new_cache_latents(ae, accelerator)
+
+        ae.to("cpu")  # if no sampling, vae can be deleted
+        clean_memory_on_device(accelerator.device)
+
+        accelerator.wait_for_everyone()
+
+    # prepare tokenize strategy
+    if args.gemma2_max_token_length is None:
+        gemma2_max_token_length = 256
+    else:
+        gemma2_max_token_length = args.gemma2_max_token_length
+
+    lumina_tokenize_strategy = strategy_lumina.LuminaTokenizeStrategy(
+        args.system_prompt, gemma2_max_token_length
+    )
+    strategy_base.TokenizeStrategy.set_strategy(lumina_tokenize_strategy)
+
+    # load gemma2 for caching text encoder outputs
+    gemma2 = lumina_util.load_gemma2(
+        args.gemma2, weight_dtype, "cpu", args.disable_mmap_load_safetensors
+    )
+    gemma2.eval()
+    gemma2.requires_grad_(False)
+
+    text_encoding_strategy = strategy_lumina.LuminaTextEncodingStrategy()
+    strategy_base.TextEncodingStrategy.set_strategy(text_encoding_strategy)
+
+    # cache text encoder outputs
+    sample_prompts_te_outputs = None
+    if args.cache_text_encoder_outputs:
+        # Text Encodes are eval and no grad here
+        gemma2.to(accelerator.device)
+
+        text_encoder_caching_strategy = (
+            strategy_lumina.LuminaTextEncoderOutputsCachingStrategy(
+                args.cache_text_encoder_outputs_to_disk,
+                args.text_encoder_batch_size,
+                False,
+                False,
+            )
+        )
+        strategy_base.TextEncoderOutputsCachingStrategy.set_strategy(
+            text_encoder_caching_strategy
+        )
+
+        with accelerator.autocast():
+            train_dataset_group.new_cache_text_encoder_outputs([gemma2], accelerator)
+
+        # cache sample prompt's embeddings to free text encoder's memory
+        if args.sample_prompts is not None:
+            logger.info(
+                f"cache Text Encoder outputs for sample prompt: {args.sample_prompts}"
+            )
+
+            text_encoding_strategy: strategy_lumina.LuminaTextEncodingStrategy = (
+                strategy_base.TextEncodingStrategy.get_strategy()
+            )
+
+            prompts = train_util.load_prompts(args.sample_prompts)
+            sample_prompts_te_outputs = {}  # key: prompt, value: text encoder outputs
+            with accelerator.autocast(), torch.no_grad():
+                for prompt_dict in prompts:
+                    for i, p in enumerate([
+                        prompt_dict.get("prompt", ""),
+                        prompt_dict.get("negative_prompt", ""),
+                    ]):
+                        if p not in sample_prompts_te_outputs:
+                            logger.info(f"cache Text Encoder outputs for prompt: {p}")
+                            tokens_and_masks = lumina_tokenize_strategy.tokenize(p, i == 1)  # i == 1 means negative prompt
+                            sample_prompts_te_outputs[p] = (
+                                text_encoding_strategy.encode_tokens(
+                                    lumina_tokenize_strategy,
+                                    [gemma2],
+                                    tokens_and_masks,
+                                )
+                            )
+
+        accelerator.wait_for_everyone()
+
+        # now we can delete Text Encoders to free memory
+        gemma2 = None
+        clean_memory_on_device(accelerator.device)
+
+    # load lumina
+    nextdit = lumina_util.load_lumina_model(
+        args.pretrained_model_name_or_path,
+        loading_dtype,
+        torch.device("cpu"),
+        disable_mmap=args.disable_mmap_load_safetensors,
+        use_flash_attn=args.use_flash_attn,
+    )
+
+    if args.gradient_checkpointing:
+        nextdit.enable_gradient_checkpointing(
+            cpu_offload=args.cpu_offload_checkpointing
+        )
+
+    nextdit.requires_grad_(True)
+
+    # block swap
+
+    # backward compatibility
+    # if args.blocks_to_swap is None:
+    #     blocks_to_swap = args.double_blocks_to_swap or 0
+    #     if args.single_blocks_to_swap is not None:
+    #         blocks_to_swap += args.single_blocks_to_swap // 2
+    #     if blocks_to_swap > 0:
+    #         logger.warning(
+    #             "double_blocks_to_swap and single_blocks_to_swap are deprecated. Use blocks_to_swap instead."
+    #             " / double_blocks_to_swapとsingle_blocks_to_swapは非推奨です。blocks_to_swapを使ってください。"
+    #         )
+    #         logger.info(
+    #             f"double_blocks_to_swap={args.double_blocks_to_swap} and single_blocks_to_swap={args.single_blocks_to_swap} are converted to blocks_to_swap={blocks_to_swap}."
+    #         )
+    #         args.blocks_to_swap = blocks_to_swap
+    #     del blocks_to_swap
+
+    # is_swapping_blocks = args.blocks_to_swap is not None and args.blocks_to_swap > 0
+    # if is_swapping_blocks:
+    #     # Swap blocks between CPU and GPU to reduce memory usage, in forward and backward passes.
+    #     # This idea is based on 2kpr's great work. Thank you!
+    #     logger.info(f"enable block swap: blocks_to_swap={args.blocks_to_swap}")
+    #     flux.enable_block_swap(args.blocks_to_swap, accelerator.device)
+
+    if not cache_latents:
+        # load VAE here if not cached
+        ae = lumina_util.load_ae(args.ae, weight_dtype, "cpu")
+        ae.requires_grad_(False)
+        ae.eval()
+        ae.to(accelerator.device, dtype=weight_dtype)
+
+    training_models = []
+    params_to_optimize = []
+    training_models.append(nextdit)
+    name_and_params = list(nextdit.named_parameters())
+    # single param group for now
+    params_to_optimize.append(
+        {"params": [p for _, p in name_and_params], "lr": args.learning_rate}
+    )
+    param_names = [[n for n, _ in name_and_params]]
+
+    # calculate number of trainable parameters
+    n_params = 0
+    for group in params_to_optimize:
+        for p in group["params"]:
+            n_params += p.numel()
+
+    accelerator.print(f"number of trainable parameters: {n_params}")
+
+    # 学習に必要なクラスを準備する
+    accelerator.print("prepare optimizer, data loader etc.")
+
+    if args.blockwise_fused_optimizers:
+        # fused backward pass: https://pytorch.org/tutorials/intermediate/optimizer_step_in_backward_tutorial.html
+        # Instead of creating an optimizer for all parameters as in the tutorial, we create an optimizer for each block of parameters.
+        # This balances memory usage and management complexity.
+
+        # split params into groups. currently different learning rates are not supported
+        grouped_params = []
+        param_group = {}
+        for group in params_to_optimize:
+            named_parameters = list(nextdit.named_parameters())
+            assert len(named_parameters) == len(
+                group["params"]
+            ), "number of parameters does not match"
+            for p, np in zip(group["params"], named_parameters):
+                # determine target layer and block index for each parameter
+                block_type = "other"  # double, single or other
+                if np[0].startswith("double_blocks"):
+                    block_index = int(np[0].split(".")[1])
+                    block_type = "double"
+                elif np[0].startswith("single_blocks"):
+                    block_index = int(np[0].split(".")[1])
+                    block_type = "single"
+                else:
+                    block_index = -1
+
+                param_group_key = (block_type, block_index)
+                if param_group_key not in param_group:
+                    param_group[param_group_key] = []
+                param_group[param_group_key].append(p)
+
+        block_types_and_indices = []
+        for param_group_key, param_group in param_group.items():
+            block_types_and_indices.append(param_group_key)
+            grouped_params.append({"params": param_group, "lr": args.learning_rate})
+
+            num_params = 0
+            for p in param_group:
+                num_params += p.numel()
+            accelerator.print(f"block {param_group_key}: {num_params} parameters")
+
+        # prepare optimizers for each group
+        optimizers = []
+        for group in grouped_params:
+            _, _, optimizer = train_util.get_optimizer(args, trainable_params=[group])
+            optimizers.append(optimizer)
+        optimizer = optimizers[0]  # avoid error in the following code
+
+        logger.info(
+            f"using {len(optimizers)} optimizers for blockwise fused optimizers"
+        )
+
+        if train_util.is_schedulefree_optimizer(optimizers[0], args):
+            raise ValueError(
+                "Schedule-free optimizer is not supported with blockwise fused optimizers"
+            )
+        optimizer_train_fn = lambda: None  # dummy function
+        optimizer_eval_fn = lambda: None  # dummy function
+    else:
+        _, _, optimizer = train_util.get_optimizer(
+            args, trainable_params=params_to_optimize
+        )
+        optimizer_train_fn, optimizer_eval_fn = train_util.get_optimizer_train_eval_fn(
+            optimizer, args
+        )
+
+    # prepare dataloader
+    # strategies are set here because they cannot be referenced in another process. Copy them with the dataset
+    # some strategies can be None
+    train_dataset_group.set_current_strategies()
+
+    # DataLoaderのプロセス数：0 は persistent_workers が使えないので注意
+    n_workers = min(
+        args.max_data_loader_n_workers, os.cpu_count()
+    )  # cpu_count or max_data_loader_n_workers
+    train_dataloader = torch.utils.data.DataLoader(
+        train_dataset_group,
+        batch_size=1,
+        shuffle=True,
+        collate_fn=collator,
+        num_workers=n_workers,
+        persistent_workers=args.persistent_data_loader_workers,
+    )
+
+    # 学習ステップ数を計算する
+    if args.max_train_epochs is not None:
+        args.max_train_steps = args.max_train_epochs * math.ceil(
+            len(train_dataloader)
+            / accelerator.num_processes
+            / args.gradient_accumulation_steps
+        )
+        accelerator.print(
+            f"override steps. steps for {args.max_train_epochs} epochs is / 指定エポックまでのステップ数: {args.max_train_steps}"
+        )
+
+    # データセット側にも学習ステップを送信
+    train_dataset_group.set_max_train_steps(args.max_train_steps)
+
+    # lr schedulerを用意する
+    if args.blockwise_fused_optimizers:
+        # prepare lr schedulers for each optimizer
+        lr_schedulers = [
+            train_util.get_scheduler_fix(args, optimizer, accelerator.num_processes)
+            for optimizer in optimizers
+        ]
+        lr_scheduler = lr_schedulers[0]  # avoid error in the following code
+    else:
+        lr_scheduler = train_util.get_scheduler_fix(
+            args, optimizer, accelerator.num_processes
+        )
+
+    # 実験的機能：勾配も含めたfp16/bf16学習を行う　モデル全体をfp16/bf16にする
+    if args.full_fp16:
+        assert (
+            args.mixed_precision == "fp16"
+        ), "full_fp16 requires mixed precision='fp16' / full_fp16を使う場合はmixed_precision='fp16'を指定してください。"
+        accelerator.print("enable full fp16 training.")
+        nextdit.to(weight_dtype)
+        if gemma2 is not None:
+            gemma2.to(weight_dtype)
+    elif args.full_bf16:
+        assert (
+            args.mixed_precision == "bf16"
+        ), "full_bf16 requires mixed precision='bf16' / full_bf16を使う場合はmixed_precision='bf16'を指定してください。"
+        accelerator.print("enable full bf16 training.")
+        nextdit.to(weight_dtype)
+        if gemma2 is not None:
+            gemma2.to(weight_dtype)
+
+    # if we don't cache text encoder outputs, move them to device
+    if not args.cache_text_encoder_outputs:
+        gemma2.to(accelerator.device)
+
+    clean_memory_on_device(accelerator.device)
+
+    if args.deepspeed:
+        ds_model = deepspeed_utils.prepare_deepspeed_model(args, nextdit=nextdit)
+        # most of ZeRO stage uses optimizer partitioning, so we have to prepare optimizer and ds_model at the same time. # pull/1139#issuecomment-1986790007
+        ds_model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
+            ds_model, optimizer, train_dataloader, lr_scheduler
+        )
+        training_models = [ds_model]
+
+    else:
+        # accelerator does some magic
+        # if we doesn't swap blocks, we can move the model to device
+        nextdit = accelerator.prepare(
+            nextdit, device_placement=[not is_swapping_blocks]
+        )
+        if is_swapping_blocks:
+            accelerator.unwrap_model(nextdit).move_to_device_except_swap_blocks(
+                accelerator.device
+            )  # reduce peak memory usage
+        optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
+            optimizer, train_dataloader, lr_scheduler
+        )
+
+    # 実験的機能：勾配も含めたfp16学習を行う　PyTorchにパッチを当ててfp16でのgrad scaleを有効にする
+    if args.full_fp16:
+        # During deepseed training, accelerate not handles fp16/bf16|mixed precision directly via scaler. Let deepspeed engine do.
+        # -> But we think it's ok to patch accelerator even if deepspeed is enabled.
+        train_util.patch_accelerator_for_fp16_training(accelerator)
+
+    # resumeする
+    train_util.resume_from_local_or_hf_if_specified(accelerator, args)
+
+    if args.fused_backward_pass:
+        # use fused optimizer for backward pass: other optimizers will be supported in the future
+        import library.adafactor_fused
+
+        library.adafactor_fused.patch_adafactor_fused(optimizer)
+
+        for param_group, param_name_group in zip(optimizer.param_groups, param_names):
+            for parameter, param_name in zip(param_group["params"], param_name_group):
+                if parameter.requires_grad:
+
+                    def create_grad_hook(p_name, p_group):
+                        def grad_hook(tensor: torch.Tensor):
+                            if accelerator.sync_gradients and args.max_grad_norm != 0.0:
+                                accelerator.clip_grad_norm_(tensor, args.max_grad_norm)
+                            optimizer.step_param(tensor, p_group)
+                            tensor.grad = None
+
+                        return grad_hook
+
+                    parameter.register_post_accumulate_grad_hook(
+                        create_grad_hook(param_name, param_group)
+                    )
+
+    elif args.blockwise_fused_optimizers:
+        # prepare for additional optimizers and lr schedulers
+        for i in range(1, len(optimizers)):
+            optimizers[i] = accelerator.prepare(optimizers[i])
+            lr_schedulers[i] = accelerator.prepare(lr_schedulers[i])
+
+        # counters are used to determine when to step the optimizer
+        global optimizer_hooked_count
+        global num_parameters_per_group
+        global parameter_optimizer_map
+
+        optimizer_hooked_count = {}
+        num_parameters_per_group = [0] * len(optimizers)
+        parameter_optimizer_map = {}
+
+        for opt_idx, optimizer in enumerate(optimizers):
+            for param_group in optimizer.param_groups:
+                for parameter in param_group["params"]:
+                    if parameter.requires_grad:
+
+                        def grad_hook(parameter: torch.Tensor):
+                            if accelerator.sync_gradients and args.max_grad_norm != 0.0:
+                                accelerator.clip_grad_norm_(
+                                    parameter, args.max_grad_norm
+                                )
+
+                            i = parameter_optimizer_map[parameter]
+                            optimizer_hooked_count[i] += 1
+                            if optimizer_hooked_count[i] == num_parameters_per_group[i]:
+                                optimizers[i].step()
+                                optimizers[i].zero_grad(set_to_none=True)
+
+                        parameter.register_post_accumulate_grad_hook(grad_hook)
+                        parameter_optimizer_map[parameter] = opt_idx
+                        num_parameters_per_group[opt_idx] += 1
+
+    # epoch数を計算する
+    num_update_steps_per_epoch = math.ceil(
+        len(train_dataloader) / args.gradient_accumulation_steps
+    )
+    num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)
+    if (args.save_n_epoch_ratio is not None) and (args.save_n_epoch_ratio > 0):
+        args.save_every_n_epochs = (
+            math.floor(num_train_epochs / args.save_n_epoch_ratio) or 1
+        )
+
+    # 学習する
+    # total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
+    accelerator.print("running training / 学習開始")
+    accelerator.print(
+        f"  num examples / サンプル数: {train_dataset_group.num_train_images}"
+    )
+    accelerator.print(
+        f"  num batches per epoch / 1epochのバッチ数: {len(train_dataloader)}"
+    )
+    accelerator.print(f"  num epochs / epoch数: {num_train_epochs}")
+    accelerator.print(
+        f"  batch size per device / バッチサイズ: {', '.join([str(d.batch_size) for d in train_dataset_group.datasets])}"
+    )
+    # accelerator.print(
+    #     f"  total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ（並列学習、勾配合計含む）: {total_batch_size}"
+    # )
+    accelerator.print(
+        f"  gradient accumulation steps / 勾配を合計するステップ数 = {args.gradient_accumulation_steps}"
+    )
+    accelerator.print(
+        f"  total optimization steps / 学習ステップ数: {args.max_train_steps}"
+    )
+
+    progress_bar = tqdm(
+        range(args.max_train_steps),
+        smoothing=0,
+        disable=not accelerator.is_local_main_process,
+        desc="steps",
+    )
+    global_step = 0
+
+    noise_scheduler = FlowMatchEulerDiscreteScheduler(
+        num_train_timesteps=1000, shift=args.discrete_flow_shift
+    )
+    noise_scheduler_copy = copy.deepcopy(noise_scheduler)
+
+    if accelerator.is_main_process:
+        init_kwargs = {}
+        if args.wandb_run_name:
+            init_kwargs["wandb"] = {"name": args.wandb_run_name}
+        if args.log_tracker_config is not None:
+            init_kwargs = toml.load(args.log_tracker_config)
+        accelerator.init_trackers(
+            "finetuning" if args.log_tracker_name is None else args.log_tracker_name,
+            config=train_util.get_sanitized_config_or_none(args),
+            init_kwargs=init_kwargs,
+        )
+
+    if is_swapping_blocks:
+        accelerator.unwrap_model(nextdit).prepare_block_swap_before_forward()
+
+    # For --sample_at_first
+    optimizer_eval_fn()
+    lumina_train_util.sample_images(
+        accelerator,
+        args,
+        0,
+        global_step,
+        nextdit,
+        ae,
+        gemma2,
+        sample_prompts_te_outputs,
+    )
+    optimizer_train_fn()
+    if len(accelerator.trackers) > 0:
+        # log empty object to commit the sample images to wandb
+        accelerator.log({}, step=0)
+
+    loss_recorder = train_util.LossRecorder()
+    epoch = 0  # avoid error when max_train_steps is 0
+    for epoch in range(num_train_epochs):
+        accelerator.print(f"\nepoch {epoch+1}/{num_train_epochs}")
+        current_epoch.value = epoch + 1
+
+        for m in training_models:
+            m.train()
+
+        for step, batch in enumerate(train_dataloader):
+            current_step.value = global_step
+
+            if args.blockwise_fused_optimizers:
+                optimizer_hooked_count = {
+                    i: 0 for i in range(len(optimizers))
+                }  # reset counter for each step
+
+            with accelerator.accumulate(*training_models):
+                if "latents" in batch and batch["latents"] is not None:
+                    latents = batch["latents"].to(
+                        accelerator.device, dtype=weight_dtype
+                    )
+                else:
+                    with torch.no_grad():
+                        # encode images to latents. images are [-1, 1]
+                        latents = ae.encode(batch["images"].to(ae.dtype)).to(
+                            accelerator.device, dtype=weight_dtype
+                        )
+
+                    # NaNが含まれていれば警告を表示し0に置き換える
+                    if torch.any(torch.isnan(latents)):
+                        accelerator.print("NaN found in latents, replacing with zeros")
+                        latents = torch.nan_to_num(latents, 0, out=latents)
+
+                text_encoder_outputs_list = batch.get("text_encoder_outputs_list", None)
+                if text_encoder_outputs_list is not None:
+                    text_encoder_conds = text_encoder_outputs_list
+                else:
+                    # not cached or training, so get from text encoders
+                    tokens_and_masks = batch["input_ids_list"]
+                    with torch.no_grad():
+                        input_ids = [
+                            ids.to(accelerator.device)
+                            for ids in batch["input_ids_list"]
+                        ]
+                        text_encoder_conds = text_encoding_strategy.encode_tokens(
+                            lumina_tokenize_strategy,
+                            [gemma2],
+                            input_ids,
+                        )
+                        if args.full_fp16:
+                            text_encoder_conds = [
+                                c.to(weight_dtype) for c in text_encoder_conds
+                            ]
+
+                # TODO support some features for noise implemented in get_noise_noisy_latents_and_timesteps
+
+                # Sample noise that we'll add to the latents
+                noise = torch.randn_like(latents)
+
+                # get noisy model input and timesteps
+                noisy_model_input, timesteps, sigmas = (
+                    lumina_train_util.get_noisy_model_input_and_timesteps(
+                        args,
+                        noise_scheduler_copy,
+                        latents,
+                        noise,
+                        accelerator.device,
+                        weight_dtype,
+                    )
+                )
+                # call model
+                gemma2_hidden_states, input_ids, gemma2_attn_mask = text_encoder_conds
+
+                with accelerator.autocast():
+                    # YiYi notes: divide it by 1000 for now because we scale it by 1000 in the transformer model (we should not keep it but I want to keep the inputs same for the model for testing)
+                    model_pred = nextdit(
+                        x=img,  # image latents (B, C, H, W)
+                        t=timesteps / 1000,  # timesteps需要除以1000来匹配模型预期
+                        cap_feats=gemma2_hidden_states,  # Gemma2的hidden states作为caption features
+                        cap_mask=gemma2_attn_mask.to(
+                            dtype=torch.int32
+                        ),  # Gemma2的attention mask
+                    )
+                # apply model prediction type
+                model_pred, weighting = lumina_train_util.apply_model_prediction_type(
+                    args, model_pred, noisy_model_input, sigmas
+                )
+
+                # flow matching loss: this is different from SD3
+                target = noise - latents
+
+                # calculate loss
+                huber_c = train_util.get_huber_threshold_if_needed(
+                    args, timesteps, noise_scheduler
+                )
+                loss = train_util.conditional_loss(
+                    model_pred.float(), target.float(), args.loss_type, "none", huber_c
+                )
+                if weighting is not None:
+                    loss = loss * weighting
+                if args.masked_loss or (
+                    "alpha_masks" in batch and batch["alpha_masks"] is not None
+                ):
+                    loss = apply_masked_loss(loss, batch)
+                loss = loss.mean([1, 2, 3])
+
+                loss_weights = batch["loss_weights"]  # 各sampleごとのweight
+                loss = loss * loss_weights
+                loss = loss.mean()
+
+                # backward
+                accelerator.backward(loss)
+
+                if not (args.fused_backward_pass or args.blockwise_fused_optimizers):
+                    if accelerator.sync_gradients and args.max_grad_norm != 0.0:
+                        params_to_clip = []
+                        for m in training_models:
+                            params_to_clip.extend(m.parameters())
+                        accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
+
+                    optimizer.step()
+                    lr_scheduler.step()
+                    optimizer.zero_grad(set_to_none=True)
+                else:
+                    # optimizer.step() and optimizer.zero_grad() are called in the optimizer hook
+                    lr_scheduler.step()
+                    if args.blockwise_fused_optimizers:
+                        for i in range(1, len(optimizers)):
+                            lr_schedulers[i].step()
+
+            # Checks if the accelerator has performed an optimization step behind the scenes
+            if accelerator.sync_gradients:
+                progress_bar.update(1)
+                global_step += 1
+
+                optimizer_eval_fn()
+                lumina_train_util.sample_images(
+                    accelerator,
+                    args,
+                    None,
+                    global_step,
+                    nextdit,
+                    ae,
+                    gemma2,
+                    sample_prompts_te_outputs,
+                )
+
+                # 指定ステップごとにモデルを保存
+                if (
+                    args.save_every_n_steps is not None
+                    and global_step % args.save_every_n_steps == 0
+                ):
+                    accelerator.wait_for_everyone()
+                    if accelerator.is_main_process:
+                        lumina_train_util.save_lumina_model_on_epoch_end_or_stepwise(
+                            args,
+                            False,
+                            accelerator,
+                            save_dtype,
+                            epoch,
+                            num_train_epochs,
+                            global_step,
+                            accelerator.unwrap_model(nextdit),
+                        )
+                optimizer_train_fn()
+
+            current_loss = loss.detach().item()  # 平均なのでbatch sizeは関係ないはず
+            if len(accelerator.trackers) > 0:
+                logs = {"loss": current_loss}
+                train_util.append_lr_to_logs(
+                    logs, lr_scheduler, args.optimizer_type, including_unet=True
+                )
+
+                accelerator.log(logs, step=global_step)
+
+            loss_recorder.add(epoch=epoch, step=step, loss=current_loss)
+            avr_loss: float = loss_recorder.moving_average
+            logs = {"avr_loss": avr_loss}  # , "lr": lr_scheduler.get_last_lr()[0]}
+            progress_bar.set_postfix(**logs)
+
+            if global_step >= args.max_train_steps:
+                break
+
+        if len(accelerator.trackers) > 0:
+            logs = {"loss/epoch": loss_recorder.moving_average}
+            accelerator.log(logs, step=epoch + 1)
+
+        accelerator.wait_for_everyone()
+
+        optimizer_eval_fn()
+        if args.save_every_n_epochs is not None:
+            if accelerator.is_main_process:
+                lumina_train_util.save_lumina_model_on_epoch_end_or_stepwise(
+                    args,
+                    True,
+                    accelerator,
+                    save_dtype,
+                    epoch,
+                    num_train_epochs,
+                    global_step,
+                    accelerator.unwrap_model(nextdit),
+                )
+
+        lumina_train_util.sample_images(
+            accelerator,
+            args,
+            epoch + 1,
+            global_step,
+            nextdit,
+            ae,
+            gemma2,
+            sample_prompts_te_outputs,
+        )
+        optimizer_train_fn()
+
+    is_main_process = accelerator.is_main_process
+    # if is_main_process:
+    nextdit = accelerator.unwrap_model(nextdit)
+
+    accelerator.end_training()
+    optimizer_eval_fn()
+
+    if args.save_state or args.save_state_on_train_end:
+        train_util.save_state_on_train_end(args, accelerator)
+
+    del accelerator  # この後メモリを使うのでこれは消す
+
+    if is_main_process:
+        lumina_train_util.save_lumina_model_on_train_end(
+            args, save_dtype, epoch, global_step, nextdit
+        )
+        logger.info("model saved.")
+
+
+def setup_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser()
+
+    add_logging_arguments(parser)
+    train_util.add_sd_models_arguments(parser)  # TODO split this
+    train_util.add_dataset_arguments(parser, True, True, True)
+    train_util.add_training_arguments(parser, False)
+    train_util.add_masked_loss_arguments(parser)
+    deepspeed_utils.add_deepspeed_arguments(parser)
+    train_util.add_sd_saving_arguments(parser)
+    train_util.add_optimizer_arguments(parser)
+    config_util.add_config_arguments(parser)
+    add_custom_train_arguments(parser)  # TODO remove this from here
+    train_util.add_dit_training_arguments(parser)
+    lumina_train_util.add_lumina_train_arguments(parser)
+
+    parser.add_argument(
+        "--mem_eff_save",
+        action="store_true",
+        help="[EXPERIMENTAL] use memory efficient custom model saving method / メモリ効率の良い独自のモデル保存方法を使う",
+    )
+
+    parser.add_argument(
+        "--fused_optimizer_groups",
+        type=int,
+        default=None,
+        help="**this option is not working** will be removed in the future / このオプションは動作しません。将来削除されます",
+    )
+    parser.add_argument(
+        "--blockwise_fused_optimizers",
+        action="store_true",
+        help="enable blockwise optimizers for fused backward pass and optimizer step / fused backward passとoptimizer step のためブロック単位のoptimizerを有効にする",
+    )
+    parser.add_argument(
+        "--skip_latents_validity_check",
+        action="store_true",
+        help="[Deprecated] use 'skip_cache_check' instead / 代わりに 'skip_cache_check' を使用してください",
+    )
+    parser.add_argument(
+        "--cpu_offload_checkpointing",
+        action="store_true",
+        help="[EXPERIMENTAL] enable offloading of tensors to CPU during checkpointing / チェックポイント時にテンソルをCPUにオフロードする",
+    )
+    return parser
+
+
+if __name__ == "__main__":
+    parser = setup_parser()
+
+    args = parser.parse_args()
+    train_util.verify_command_line_training_args(args)
+    args = train_util.read_config_from_file(args, parser)
+
+    train(args)
--- a/lumina_train_network.py
+++ b/lumina_train_network.py
@@ -0,0 +1,383 @@
+import argparse
+import copy
+from typing import Any, Tuple
+
+import torch
+
+from library.device_utils import clean_memory_on_device, init_ipex
+
+init_ipex()
+
+from torch import Tensor
+from accelerate import Accelerator
+
+
+import train_network
+from library import (
+    lumina_models,
+    lumina_util,
+    lumina_train_util,
+    sd3_train_utils,
+    strategy_base,
+    strategy_lumina,
+    train_util,
+)
+from library.utils import setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+
+class LuminaNetworkTrainer(train_network.NetworkTrainer):
+    def __init__(self):
+        super().__init__()
+        self.sample_prompts_te_outputs = None
+        self.is_swapping_blocks: bool = False
+
+    def assert_extra_args(self, args, train_dataset_group, val_dataset_group):
+        super().assert_extra_args(args, train_dataset_group, val_dataset_group)
+
+        if args.cache_text_encoder_outputs_to_disk and not args.cache_text_encoder_outputs:
+            logger.warning("Enabling cache_text_encoder_outputs due to disk caching")
+            args.cache_text_encoder_outputs = True
+
+        train_dataset_group.verify_bucket_reso_steps(32)
+        if val_dataset_group is not None:
+            val_dataset_group.verify_bucket_reso_steps(32)
+
+        self.train_gemma2 = not args.network_train_unet_only
+
+    def load_target_model(self, args, weight_dtype, accelerator):
+        loading_dtype = None if args.fp8_base else weight_dtype
+
+        model = lumina_util.load_lumina_model(
+            args.pretrained_model_name_or_path,
+            loading_dtype,
+            torch.device("cpu"),
+            disable_mmap=args.disable_mmap_load_safetensors,
+            use_flash_attn=args.use_flash_attn,
+            use_sage_attn=args.use_sage_attn,
+        )
+
+        if args.fp8_base:
+            # check dtype of model
+            if model.dtype == torch.float8_e4m3fnuz or model.dtype == torch.float8_e5m2 or model.dtype == torch.float8_e5m2fnuz:
+                raise ValueError(f"Unsupported fp8 model dtype: {model.dtype}")
+            elif model.dtype == torch.float8_e4m3fn:
+                logger.info("Loaded fp8 Lumina 2 model")
+            else:
+                logger.info(
+                    "Cast Lumina 2 model to fp8. This may take a while. You can reduce the time by using fp8 checkpoint."
+                    " / Lumina 2モデルをfp8に変換しています。これには時間がかかる場合があります。fp8チェックポイントを使用することで時間を短縮できます。"
+                )
+                model.to(torch.float8_e4m3fn)
+
+        if args.blocks_to_swap:
+            logger.info(f"Lumina 2: Enabling block swap: {args.blocks_to_swap}")
+            model.enable_block_swap(args.blocks_to_swap, accelerator.device)
+            self.is_swapping_blocks = True
+
+        gemma2 = lumina_util.load_gemma2(args.gemma2, weight_dtype, "cpu")
+        gemma2.eval()
+        ae = lumina_util.load_ae(args.ae, weight_dtype, "cpu")
+
+        return lumina_util.MODEL_VERSION_LUMINA_V2, [gemma2], ae, model
+
+    def get_tokenize_strategy(self, args):
+        return strategy_lumina.LuminaTokenizeStrategy(args.system_prompt, args.gemma2_max_token_length, args.tokenizer_cache_dir)
+
+    def get_tokenizers(self, tokenize_strategy: strategy_lumina.LuminaTokenizeStrategy):
+        return [tokenize_strategy.tokenizer]
+
+    def get_latents_caching_strategy(self, args):
+        return strategy_lumina.LuminaLatentsCachingStrategy(args.cache_latents_to_disk, args.vae_batch_size, False)
+
+    def get_text_encoding_strategy(self, args):
+        return strategy_lumina.LuminaTextEncodingStrategy()
+
+    def get_text_encoders_train_flags(self, args, text_encoders):
+        return [self.train_gemma2]
+
+    def get_text_encoder_outputs_caching_strategy(self, args):
+        if args.cache_text_encoder_outputs:
+            # if the text encoders is trained, we need tokenization, so is_partial is True
+            return strategy_lumina.LuminaTextEncoderOutputsCachingStrategy(
+                args.cache_text_encoder_outputs_to_disk,
+                args.text_encoder_batch_size,
+                args.skip_cache_check,
+                is_partial=self.train_gemma2,
+            )
+        else:
+            return None
+
+    def cache_text_encoder_outputs_if_needed(
+        self,
+        args,
+        accelerator: Accelerator,
+        unet,
+        vae,
+        text_encoders,
+        dataset,
+        weight_dtype,
+    ):
+        if args.cache_text_encoder_outputs:
+            if not args.lowram:
+                # メモリ消費を減らす
+                logger.info("move vae and unet to cpu to save memory")
+                org_vae_device = vae.device
+                org_unet_device = unet.device
+                vae.to("cpu")
+                unet.to("cpu")
+                clean_memory_on_device(accelerator.device)
+
+            # When TE is not be trained, it will not be prepared so we need to use explicit autocast
+            logger.info("move text encoders to gpu")
+            text_encoders[0].to(accelerator.device, dtype=weight_dtype)  # always not fp8
+
+            if text_encoders[0].dtype == torch.float8_e4m3fn:
+                # if we load fp8 weights, the model is already fp8, so we use it as is
+                self.prepare_text_encoder_fp8(1, text_encoders[1], text_encoders[1].dtype, weight_dtype)
+            else:
+                # otherwise, we need to convert it to target dtype
+                text_encoders[0].to(weight_dtype)
+
+            with accelerator.autocast():
+                dataset.new_cache_text_encoder_outputs(text_encoders, accelerator)
+
+            # cache sample prompts
+            if args.sample_prompts is not None:
+                logger.info(f"cache Text Encoder outputs for sample prompts: {args.sample_prompts}")
+
+                tokenize_strategy = strategy_base.TokenizeStrategy.get_strategy()
+                text_encoding_strategy = strategy_base.TextEncodingStrategy.get_strategy()
+
+                assert isinstance(tokenize_strategy, strategy_lumina.LuminaTokenizeStrategy)
+                assert isinstance(text_encoding_strategy, strategy_lumina.LuminaTextEncodingStrategy)
+
+                sample_prompts = train_util.load_prompts(args.sample_prompts)
+                sample_prompts_te_outputs = {}  # key: prompt, value: text encoder outputs
+                with accelerator.autocast(), torch.no_grad():
+                    for prompt_dict in sample_prompts:
+                        prompts = [
+                            prompt_dict.get("prompt", ""),
+                            prompt_dict.get("negative_prompt", ""),
+                        ]
+                        for i, prompt in enumerate(prompts):
+                            if prompt in sample_prompts_te_outputs:
+                                continue
+
+                            logger.info(f"cache Text Encoder outputs for prompt: {prompt}")
+                            tokens_and_masks = tokenize_strategy.tokenize(prompt, i == 1) # i == 1 means negative prompt
+                            sample_prompts_te_outputs[prompt] = text_encoding_strategy.encode_tokens(
+                                tokenize_strategy,
+                                text_encoders,
+                                tokens_and_masks,
+                            )
+
+                self.sample_prompts_te_outputs = sample_prompts_te_outputs
+
+            accelerator.wait_for_everyone()
+
+            # move back to cpu
+            if not self.is_train_text_encoder(args):
+                logger.info("move Gemma 2 back to cpu")
+                text_encoders[0].to("cpu")
+            clean_memory_on_device(accelerator.device)
+
+            if not args.lowram:
+                logger.info("move vae and unet back to original device")
+                vae.to(org_vae_device)
+                unet.to(org_unet_device)
+        else:
+            # Text Encoderから毎回出力を取得するので、GPUに乗せておく
+            text_encoders[0].to(accelerator.device, dtype=weight_dtype)
+
+    def sample_images(
+        self,
+        accelerator,
+        args,
+        epoch,
+        global_step,
+        device,
+        vae,
+        tokenizer,
+        text_encoder,
+        lumina,
+    ):
+        lumina_train_util.sample_images(
+            accelerator,
+            args,
+            epoch,
+            global_step,
+            lumina,
+            vae,
+            self.get_models_for_text_encoding(args, accelerator, text_encoder),
+            self.sample_prompts_te_outputs,
+        )
+
+    # Remaining methods maintain similar structure to flux implementation
+    # with Lumina-specific model calls and strategies
+
+    def get_noise_scheduler(self, args: argparse.Namespace, device: torch.device) -> Any:
+        noise_scheduler = sd3_train_utils.FlowMatchEulerDiscreteScheduler(num_train_timesteps=1000, shift=args.discrete_flow_shift)
+        self.noise_scheduler_copy = copy.deepcopy(noise_scheduler)
+        return noise_scheduler
+
+    def encode_images_to_latents(self, args, vae, images):
+        return vae.encode(images)
+
+    # not sure, they use same flux vae
+    def shift_scale_latents(self, args, latents):
+        return latents
+
+    def get_noise_pred_and_target(
+        self,
+        args,
+        accelerator: Accelerator,
+        noise_scheduler,
+        latents,
+        batch,
+        text_encoder_conds: Tuple[Tensor, Tensor, Tensor],  # (hidden_states, input_ids, attention_masks)
+        dit: lumina_models.NextDiT,
+        network,
+        weight_dtype,
+        train_unet,
+        is_train=True,
+    ):
+        assert isinstance(noise_scheduler, sd3_train_utils.FlowMatchEulerDiscreteScheduler)
+        noise = torch.randn_like(latents)
+        # get noisy model input and timesteps
+        noisy_model_input, timesteps, sigmas = lumina_train_util.get_noisy_model_input_and_timesteps(
+            args, noise_scheduler, latents, noise, accelerator.device, weight_dtype
+        )
+
+        # ensure the hidden state will require grad
+        if args.gradient_checkpointing:
+            noisy_model_input.requires_grad_(True)
+            for t in text_encoder_conds:
+                if t is not None and t.dtype.is_floating_point:
+                    t.requires_grad_(True)
+
+        # Unpack Gemma2 outputs
+        gemma2_hidden_states, input_ids, gemma2_attn_mask = text_encoder_conds
+
+        def call_dit(img, gemma2_hidden_states, gemma2_attn_mask, timesteps):
+            with torch.set_grad_enabled(is_train), accelerator.autocast():
+                # NextDiT forward expects (x, t, cap_feats, cap_mask)
+                model_pred = dit(
+                    x=img,  # image latents (B, C, H, W)
+                    t=timesteps / 1000,  # timesteps需要除以1000来匹配模型预期
+                    cap_feats=gemma2_hidden_states,  # Gemma2的hidden states作为caption features
+                    cap_mask=gemma2_attn_mask.to(dtype=torch.int32),  # Gemma2的attention mask
+                )
+            return model_pred
+
+        model_pred = call_dit(
+            img=noisy_model_input,
+            gemma2_hidden_states=gemma2_hidden_states,
+            gemma2_attn_mask=gemma2_attn_mask,
+            timesteps=timesteps,
+        )
+
+        # apply model prediction type
+        model_pred, weighting = lumina_train_util.apply_model_prediction_type(args, model_pred, noisy_model_input, sigmas)
+
+        # flow matching loss
+        target = latents - noise
+
+        # differential output preservation
+        if "custom_attributes" in batch:
+            diff_output_pr_indices = []
+            for i, custom_attributes in enumerate(batch["custom_attributes"]):
+                if "diff_output_preservation" in custom_attributes and custom_attributes["diff_output_preservation"]:
+                    diff_output_pr_indices.append(i)
+
+            if len(diff_output_pr_indices) > 0:
+                network.set_multiplier(0.0)
+                with torch.no_grad():
+                    model_pred_prior = call_dit(
+                        img=noisy_model_input[diff_output_pr_indices],
+                        gemma2_hidden_states=gemma2_hidden_states[diff_output_pr_indices],
+                        timesteps=timesteps[diff_output_pr_indices],
+                        gemma2_attn_mask=(gemma2_attn_mask[diff_output_pr_indices]),
+                    )
+                network.set_multiplier(1.0)
+
+                # model_pred_prior = lumina_util.unpack_latents(
+                #     model_pred_prior, packed_latent_height, packed_latent_width
+                # )
+                model_pred_prior, _ = lumina_train_util.apply_model_prediction_type(
+                    args,
+                    model_pred_prior,
+                    noisy_model_input[diff_output_pr_indices],
+                    sigmas[diff_output_pr_indices] if sigmas is not None else None,
+                )
+                target[diff_output_pr_indices] = model_pred_prior.to(target.dtype)
+
+        return model_pred, target, timesteps, weighting
+
+    def post_process_loss(self, loss, args, timesteps, noise_scheduler):
+        return loss
+
+    def get_sai_model_spec(self, args):
+        return train_util.get_sai_model_spec(None, args, False, True, False, lumina="lumina2")
+
+    def update_metadata(self, metadata, args):
+        metadata["ss_weighting_scheme"] = args.weighting_scheme
+        metadata["ss_logit_mean"] = args.logit_mean
+        metadata["ss_logit_std"] = args.logit_std
+        metadata["ss_mode_scale"] = args.mode_scale
+        metadata["ss_timestep_sampling"] = args.timestep_sampling
+        metadata["ss_sigmoid_scale"] = args.sigmoid_scale
+        metadata["ss_model_prediction_type"] = args.model_prediction_type
+        metadata["ss_discrete_flow_shift"] = args.discrete_flow_shift
+
+    def is_text_encoder_not_needed_for_training(self, args):
+        return args.cache_text_encoder_outputs and not self.is_train_text_encoder(args)
+
+    def prepare_text_encoder_grad_ckpt_workaround(self, index, text_encoder):
+        text_encoder.embed_tokens.requires_grad_(True)
+
+    def prepare_text_encoder_fp8(self, index, text_encoder, te_weight_dtype, weight_dtype):
+        logger.info(f"prepare Gemma2 for fp8: set to {te_weight_dtype}, set embeddings to {weight_dtype}")
+        text_encoder.to(te_weight_dtype)  # fp8
+        text_encoder.embed_tokens.to(dtype=weight_dtype)
+
+    def prepare_unet_with_accelerator(
+        self, args: argparse.Namespace, accelerator: Accelerator, unet: torch.nn.Module
+    ) -> torch.nn.Module:
+        if not self.is_swapping_blocks:
+            return super().prepare_unet_with_accelerator(args, accelerator, unet)
+
+        # if we doesn't swap blocks, we can move the model to device
+        nextdit = unet
+        assert isinstance(nextdit, lumina_models.NextDiT)
+        nextdit = accelerator.prepare(nextdit, device_placement=[not self.is_swapping_blocks])
+        accelerator.unwrap_model(nextdit).move_to_device_except_swap_blocks(accelerator.device)  # reduce peak memory usage
+        accelerator.unwrap_model(nextdit).prepare_block_swap_before_forward()
+
+        return nextdit
+
+    def on_validation_step_end(self, args, accelerator, network, text_encoders, unet, batch, weight_dtype):
+        if self.is_swapping_blocks:
+            # prepare for next forward: because backward pass is not called, we need to prepare it here
+            accelerator.unwrap_model(unet).prepare_block_swap_before_forward()
+
+
+def setup_parser() -> argparse.ArgumentParser:
+    parser = train_network.setup_parser()
+    train_util.add_dit_training_arguments(parser)
+    lumina_train_util.add_lumina_train_arguments(parser)
+    return parser
+
+
+if __name__ == "__main__":
+    parser = setup_parser()
+    args = parser.parse_args()
+    train_util.verify_command_line_training_args(args)
+    args = train_util.read_config_from_file(args, parser)
+
+    trainer = LuminaNetworkTrainer()
+    trainer.train(args)
--- a/networks/check_lora_weights.py
+++ b/networks/check_lora_weights.py
@@ -2,31 +2,47 @@ import argparse
 import os
 import torch
 from safetensors.torch import load_file
-
+from library.utils import setup_logging
+setup_logging()
+import logging
+logger = logging.getLogger(__name__)

 def main(file):
-  print(f"loading: {file}")
-  if os.path.splitext(file)[1] == '.safetensors':
-    sd = load_file(file)
-  else:
-    sd = torch.load(file, map_location='cpu')
+    logger.info(f"loading: {file}")
+    if os.path.splitext(file)[1] == ".safetensors":
+        sd = load_file(file)
+    else:
+        sd = torch.load(file, map_location="cpu")

-  values = []
+    values = []

-  keys = list(sd.keys())
-  for key in keys:
-    if 'lora_up' in key or 'lora_down' in key:
-      values.append((key, sd[key]))
-  print(f"number of LoRA modules: {len(values)}")
+    keys = list(sd.keys())
+    for key in keys:
+        if "lora_up" in key or "lora_down" in key or "lora_A" in key or "lora_B" in key or "oft_" in key:
+            values.append((key, sd[key]))
+    print(f"number of LoRA modules: {len(values)}")

-  for key, value in values:
-    value = value.to(torch.float32)
-    print(f"{key},{torch.mean(torch.abs(value))},{torch.min(torch.abs(value))}")
+    if args.show_all_keys:
+        for key in [k for k in keys if k not in values]:
+            values.append((key, sd[key]))
+        print(f"number of all modules: {len(values)}")
+
+    for key, value in values:
+        value = value.to(torch.float32)
+        print(f"{key},{str(tuple(value.size())).replace(', ', '-')},{torch.mean(torch.abs(value))},{torch.min(torch.abs(value))}")


-if __name__ == '__main__':
-  parser = argparse.ArgumentParser()
-  parser.add_argument("file", type=str, help="model file to check / 重みを確認するモデルファイル")
-  args = parser.parse_args()
+def setup_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("file", type=str, help="model file to check / 重みを確認するモデルファイル")
+    parser.add_argument("-s", "--show_all_keys", action="store_true", help="show all keys / 全てのキーを表示する")

-  main(args.file)
+    return parser
+
+
+if __name__ == "__main__":
+    parser = setup_parser()
+
+    args = parser.parse_args()
+
+    main(args.file)
--- a/networks/control_net_lllite.py
+++ b/networks/control_net_lllite.py
@@ -0,0 +1,449 @@
+import os
+from typing import Optional, List, Type
+import torch
+from library import sdxl_original_unet
+from library.utils import setup_logging
+setup_logging()
+import logging
+logger = logging.getLogger(__name__)
+
+# input_blocksに適用するかどうか / if True, input_blocks are not applied
+SKIP_INPUT_BLOCKS = False
+
+# output_blocksに適用するかどうか / if True, output_blocks are not applied
+SKIP_OUTPUT_BLOCKS = True
+
+# conv2dに適用するかどうか / if True, conv2d are not applied
+SKIP_CONV2D = False
+
+# transformer_blocksのみに適用するかどうか。Trueの場合、ResBlockには適用されない
+# if True, only transformer_blocks are applied, and ResBlocks are not applied
+TRANSFORMER_ONLY = True  # if True, SKIP_CONV2D is ignored because conv2d is not used in transformer_blocks
+
+# Trueならattn1とattn2にのみ適用し、ffなどには適用しない / if True, apply only to attn1 and attn2, not to ff etc.
+ATTN1_2_ONLY = True
+
+# Trueならattn1のQKV、attn2のQにのみ適用する、ATTN1_2_ONLY指定時のみ有効 / if True, apply only to attn1 QKV and attn2 Q, only valid when ATTN1_2_ONLY is specified
+ATTN_QKV_ONLY = True
+
+# Trueならattn1やffなどにのみ適用し、attn2などには適用しない / if True, apply only to attn1 and ff, not to attn2
+# ATTN1_2_ONLYと同時にTrueにできない / cannot be True at the same time as ATTN1_2_ONLY
+ATTN1_ETC_ONLY = False  # True
+
+# transformer_blocksの最大インデックス。Noneなら全てのtransformer_blocksに適用
+# max index of transformer_blocks. if None, apply to all transformer_blocks
+TRANSFORMER_MAX_BLOCK_INDEX = None
+
+
+class LLLiteModule(torch.nn.Module):
+    def __init__(self, depth, cond_emb_dim, name, org_module, mlp_dim, dropout=None, multiplier=1.0):
+        super().__init__()
+
+        self.is_conv2d = org_module.__class__.__name__ == "Conv2d"
+        self.lllite_name = name
+        self.cond_emb_dim = cond_emb_dim
+        self.org_module = [org_module]
+        self.dropout = dropout
+        self.multiplier = multiplier
+
+        if self.is_conv2d:
+            in_dim = org_module.in_channels
+        else:
+            in_dim = org_module.in_features
+
+        # conditioning1はconditioning imageを embedding する。timestepごとに呼ばれない
+        # conditioning1 embeds conditioning image. it is not called for each timestep
+        modules = []
+        modules.append(torch.nn.Conv2d(3, cond_emb_dim // 2, kernel_size=4, stride=4, padding=0))  # to latent (from VAE) size
+        if depth == 1:
+            modules.append(torch.nn.ReLU(inplace=True))
+            modules.append(torch.nn.Conv2d(cond_emb_dim // 2, cond_emb_dim, kernel_size=2, stride=2, padding=0))
+        elif depth == 2:
+            modules.append(torch.nn.ReLU(inplace=True))
+            modules.append(torch.nn.Conv2d(cond_emb_dim // 2, cond_emb_dim, kernel_size=4, stride=4, padding=0))
+        elif depth == 3:
+            # kernel size 8は大きすぎるので、4にする / kernel size 8 is too large, so set it to 4
+            modules.append(torch.nn.ReLU(inplace=True))
+            modules.append(torch.nn.Conv2d(cond_emb_dim // 2, cond_emb_dim // 2, kernel_size=4, stride=4, padding=0))
+            modules.append(torch.nn.ReLU(inplace=True))
+            modules.append(torch.nn.Conv2d(cond_emb_dim // 2, cond_emb_dim, kernel_size=2, stride=2, padding=0))
+
+        self.conditioning1 = torch.nn.Sequential(*modules)
+
+        # downで入力の次元数を削減する。LoRAにヒントを得ていることにする
+        # midでconditioning image embeddingと入力を結合する
+        # upで元の次元数に戻す
+        # これらはtimestepごとに呼ばれる
+        # reduce the number of input dimensions with down. inspired by LoRA
+        # combine conditioning image embedding and input with mid
+        # restore to the original dimension with up
+        # these are called for each timestep
+
+        if self.is_conv2d:
+            self.down = torch.nn.Sequential(
+                torch.nn.Conv2d(in_dim, mlp_dim, kernel_size=1, stride=1, padding=0),
+                torch.nn.ReLU(inplace=True),
+            )
+            self.mid = torch.nn.Sequential(
+                torch.nn.Conv2d(mlp_dim + cond_emb_dim, mlp_dim, kernel_size=1, stride=1, padding=0),
+                torch.nn.ReLU(inplace=True),
+            )
+            self.up = torch.nn.Sequential(
+                torch.nn.Conv2d(mlp_dim, in_dim, kernel_size=1, stride=1, padding=0),
+            )
+        else:
+            # midの前にconditioningをreshapeすること / reshape conditioning before mid
+            self.down = torch.nn.Sequential(
+                torch.nn.Linear(in_dim, mlp_dim),
+                torch.nn.ReLU(inplace=True),
+            )
+            self.mid = torch.nn.Sequential(
+                torch.nn.Linear(mlp_dim + cond_emb_dim, mlp_dim),
+                torch.nn.ReLU(inplace=True),
+            )
+            self.up = torch.nn.Sequential(
+                torch.nn.Linear(mlp_dim, in_dim),
+            )
+
+        # Zero-Convにする / set to Zero-Conv
+        torch.nn.init.zeros_(self.up[0].weight)  # zero conv
+
+        self.depth = depth  # 1~3
+        self.cond_emb = None
+        self.batch_cond_only = False  # Trueなら推論時のcondにのみ適用する / if True, apply only to cond at inference
+        self.use_zeros_for_batch_uncond = False  # Trueならuncondのconditioningを0にする / if True, set uncond conditioning to 0
+
+        # batch_cond_onlyとuse_zeros_for_batch_uncondはどちらも適用すると生成画像の色味がおかしくなるので実際には使えそうにない
+        # Controlの種類によっては使えるかも
+        # both batch_cond_only and use_zeros_for_batch_uncond make the color of the generated image strange, so it doesn't seem to be usable in practice
+        # it may be available depending on the type of Control
+
+    def set_cond_image(self, cond_image):
+        r"""
+        中でモデルを呼び出すので必要ならwith torch.no_grad()で囲む
+        / call the model inside, so if necessary, surround it with torch.no_grad()
+        """
+        if cond_image is None:
+            self.cond_emb = None
+            return
+
+        # timestepごとに呼ばれないので、あらかじめ計算しておく / it is not called for each timestep, so calculate it in advance
+        # logger.info(f"C {self.lllite_name}, cond_image.shape={cond_image.shape}")
+        cx = self.conditioning1(cond_image)
+        if not self.is_conv2d:
+            # reshape / b,c,h,w -> b,h*w,c
+            n, c, h, w = cx.shape
+            cx = cx.view(n, c, h * w).permute(0, 2, 1)
+        self.cond_emb = cx
+
+    def set_batch_cond_only(self, cond_only, zeros):
+        self.batch_cond_only = cond_only
+        self.use_zeros_for_batch_uncond = zeros
+
+    def apply_to(self):
+        self.org_forward = self.org_module[0].forward
+        self.org_module[0].forward = self.forward
+
+    def forward(self, x):
+        r"""
+        学習用の便利forward。元のモジュールのforwardを呼び出す
+        / convenient forward for training. call the forward of the original module
+        """
+        if self.multiplier == 0.0 or self.cond_emb is None:
+            return self.org_forward(x)
+
+        cx = self.cond_emb
+
+        if not self.batch_cond_only and x.shape[0] // 2 == cx.shape[0]:  # inference only
+            cx = cx.repeat(2, 1, 1, 1) if self.is_conv2d else cx.repeat(2, 1, 1)
+            if self.use_zeros_for_batch_uncond:
+                cx[0::2] = 0.0  # uncond is zero
+        # logger.info(f"C {self.lllite_name}, x.shape={x.shape}, cx.shape={cx.shape}")
+
+        # downで入力の次元数を削減し、conditioning image embeddingと結合する
+        # 加算ではなくchannel方向に結合することで、うまいこと混ぜてくれることを期待している
+        # down reduces the number of input dimensions and combines it with conditioning image embedding
+        # we expect that it will mix well by combining in the channel direction instead of adding
+
+        cx = torch.cat([cx, self.down(x if not self.batch_cond_only else x[1::2])], dim=1 if self.is_conv2d else 2)
+        cx = self.mid(cx)
+
+        if self.dropout is not None and self.training:
+            cx = torch.nn.functional.dropout(cx, p=self.dropout)
+
+        cx = self.up(cx) * self.multiplier
+
+        # residual (x) を加算して元のforwardを呼び出す / add residual (x) and call the original forward
+        if self.batch_cond_only:
+            zx = torch.zeros_like(x)
+            zx[1::2] += cx
+            cx = zx
+
+        x = self.org_forward(x + cx)  # ここで元のモジュールを呼び出す / call the original module here
+        return x
+
+
+class ControlNetLLLite(torch.nn.Module):
+    UNET_TARGET_REPLACE_MODULE = ["Transformer2DModel"]
+    UNET_TARGET_REPLACE_MODULE_CONV2D_3X3 = ["ResnetBlock2D", "Downsample2D", "Upsample2D"]
+
+    def __init__(
+        self,
+        unet: sdxl_original_unet.SdxlUNet2DConditionModel,
+        cond_emb_dim: int = 16,
+        mlp_dim: int = 16,
+        dropout: Optional[float] = None,
+        varbose: Optional[bool] = False,
+        multiplier: Optional[float] = 1.0,
+    ) -> None:
+        super().__init__()
+        # self.unets = [unet]
+
+        def create_modules(
+            root_module: torch.nn.Module,
+            target_replace_modules: List[torch.nn.Module],
+            module_class: Type[object],
+        ) -> List[torch.nn.Module]:
+            prefix = "lllite_unet"
+
+            modules = []
+            for name, module in root_module.named_modules():
+                if module.__class__.__name__ in target_replace_modules:
+                    for child_name, child_module in module.named_modules():
+                        is_linear = child_module.__class__.__name__ == "Linear"
+                        is_conv2d = child_module.__class__.__name__ == "Conv2d"
+
+                        if is_linear or (is_conv2d and not SKIP_CONV2D):
+                            # block indexからdepthを計算: depthはconditioningのサイズやチャネルを計算するのに使う
+                            # block index to depth: depth is using to calculate conditioning size and channels
+                            block_name, index1, index2 = (name + "." + child_name).split(".")[:3]
+                            index1 = int(index1)
+                            if block_name == "input_blocks":
+                                if SKIP_INPUT_BLOCKS:
+                                    continue
+                                depth = 1 if index1 <= 2 else (2 if index1 <= 5 else 3)
+                            elif block_name == "middle_block":
+                                depth = 3
+                            elif block_name == "output_blocks":
+                                if SKIP_OUTPUT_BLOCKS:
+                                    continue
+                                depth = 3 if index1 <= 2 else (2 if index1 <= 5 else 1)
+                                if int(index2) >= 2:
+                                    depth -= 1
+                            else:
+                                raise NotImplementedError()
+
+                            lllite_name = prefix + "." + name + "." + child_name
+                            lllite_name = lllite_name.replace(".", "_")
+
+                            if TRANSFORMER_MAX_BLOCK_INDEX is not None:
+                                p = lllite_name.find("transformer_blocks")
+                                if p >= 0:
+                                    tf_index = int(lllite_name[p:].split("_")[2])
+                                    if tf_index > TRANSFORMER_MAX_BLOCK_INDEX:
+                                        continue
+
+                            #  time embは適用外とする
+                            # attn2のconditioning (CLIPからの入力) はshapeが違うので適用できない
+                            # time emb is not applied
+                            # attn2 conditioning (input from CLIP) cannot be applied because the shape is different
+                            if "emb_layers" in lllite_name or (
+                                "attn2" in lllite_name and ("to_k" in lllite_name or "to_v" in lllite_name)
+                            ):
+                                continue
+
+                            if ATTN1_2_ONLY:
+                                if not ("attn1" in lllite_name or "attn2" in lllite_name):
+                                    continue
+                                if ATTN_QKV_ONLY:
+                                    if "to_out" in lllite_name:
+                                        continue
+
+                            if ATTN1_ETC_ONLY:
+                                if "proj_out" in lllite_name:
+                                    pass
+                                elif "attn1" in lllite_name and (
+                                    "to_k" in lllite_name or "to_v" in lllite_name or "to_out" in lllite_name
+                                ):
+                                    pass
+                                elif "ff_net_2" in lllite_name:
+                                    pass
+                                else:
+                                    continue
+
+                            module = module_class(
+                                depth,
+                                cond_emb_dim,
+                                lllite_name,
+                                child_module,
+                                mlp_dim,
+                                dropout=dropout,
+                                multiplier=multiplier,
+                            )
+                            modules.append(module)
+            return modules
+
+        target_modules = ControlNetLLLite.UNET_TARGET_REPLACE_MODULE
+        if not TRANSFORMER_ONLY:
+            target_modules = target_modules + ControlNetLLLite.UNET_TARGET_REPLACE_MODULE_CONV2D_3X3
+
+        # create module instances
+        self.unet_modules: List[LLLiteModule] = create_modules(unet, target_modules, LLLiteModule)
+        logger.info(f"create ControlNet LLLite for U-Net: {len(self.unet_modules)} modules.")
+
+    def forward(self, x):
+        return x  # dummy
+
+    def set_cond_image(self, cond_image):
+        r"""
+        中でモデルを呼び出すので必要ならwith torch.no_grad()で囲む
+        / call the model inside, so if necessary, surround it with torch.no_grad()
+        """
+        for module in self.unet_modules:
+            module.set_cond_image(cond_image)
+
+    def set_batch_cond_only(self, cond_only, zeros):
+        for module in self.unet_modules:
+            module.set_batch_cond_only(cond_only, zeros)
+
+    def set_multiplier(self, multiplier):
+        for module in self.unet_modules:
+            module.multiplier = multiplier
+
+    def load_weights(self, file):
+        if os.path.splitext(file)[1] == ".safetensors":
+            from safetensors.torch import load_file
+
+            weights_sd = load_file(file)
+        else:
+            weights_sd = torch.load(file, map_location="cpu")
+
+        info = self.load_state_dict(weights_sd, False)
+        return info
+
+    def apply_to(self):
+        logger.info("applying LLLite for U-Net...")
+        for module in self.unet_modules:
+            module.apply_to()
+            self.add_module(module.lllite_name, module)
+
+    # マージできるかどうかを返す
+    def is_mergeable(self):
+        return False
+
+    def merge_to(self, text_encoder, unet, weights_sd, dtype, device):
+        raise NotImplementedError()
+
+    def enable_gradient_checkpointing(self):
+        # not supported
+        pass
+
+    def prepare_optimizer_params(self):
+        self.requires_grad_(True)
+        return self.parameters()
+
+    def prepare_grad_etc(self):
+        self.requires_grad_(True)
+
+    def on_epoch_start(self):
+        self.train()
+
+    def get_trainable_params(self):
+        return self.parameters()
+
+    def save_weights(self, file, dtype, metadata):
+        if metadata is not None and len(metadata) == 0:
+            metadata = None
+
+        state_dict = self.state_dict()
+
+        if dtype is not None:
+            for key in list(state_dict.keys()):
+                v = state_dict[key]
+                v = v.detach().clone().to("cpu").to(dtype)
+                state_dict[key] = v
+
+        if os.path.splitext(file)[1] == ".safetensors":
+            from safetensors.torch import save_file
+
+            save_file(state_dict, file, metadata)
+        else:
+            torch.save(state_dict, file)
+
+
+if __name__ == "__main__":
+    # デバッグ用 / for debug
+
+    # sdxl_original_unet.USE_REENTRANT = False
+
+    # test shape etc
+    logger.info("create unet")
+    unet = sdxl_original_unet.SdxlUNet2DConditionModel()
+    unet.to("cuda").to(torch.float16)
+
+    logger.info("create ControlNet-LLLite")
+    control_net = ControlNetLLLite(unet, 32, 64)
+    control_net.apply_to()
+    control_net.to("cuda")
+
+    logger.info(control_net)
+
+    # logger.info number of parameters
+    logger.info(f"number of parameters {sum(p.numel() for p in control_net.parameters() if p.requires_grad)}")
+
+    input()
+
+    unet.set_use_memory_efficient_attention(True, False)
+    unet.set_gradient_checkpointing(True)
+    unet.train()  # for gradient checkpointing
+
+    control_net.train()
+
+    # # visualize
+    # import torchviz
+    # logger.info("run visualize")
+    # controlnet.set_control(conditioning_image)
+    # output = unet(x, t, ctx, y)
+    # logger.info("make_dot")
+    # image = torchviz.make_dot(output, params=dict(controlnet.named_parameters()))
+    # logger.info("render")
+    # image.format = "svg" # "png"
+    # image.render("NeuralNet") # すごく時間がかかるので注意 / be careful because it takes a long time
+    # input()
+
+    import bitsandbytes
+
+    optimizer = bitsandbytes.adam.Adam8bit(control_net.prepare_optimizer_params(), 1e-3)
+
+    scaler = torch.cuda.amp.GradScaler(enabled=True)
+
+    logger.info("start training")
+    steps = 10
+
+    sample_param = [p for p in control_net.named_parameters() if "up" in p[0]][0]
+    for step in range(steps):
+        logger.info(f"step {step}")
+
+        batch_size = 1
+        conditioning_image = torch.rand(batch_size, 3, 1024, 1024).cuda() * 2.0 - 1.0
+        x = torch.randn(batch_size, 4, 128, 128).cuda()
+        t = torch.randint(low=0, high=10, size=(batch_size,)).cuda()
+        ctx = torch.randn(batch_size, 77, 2048).cuda()
+        y = torch.randn(batch_size, sdxl_original_unet.ADM_IN_CHANNELS).cuda()
+
+        with torch.cuda.amp.autocast(enabled=True):
+            control_net.set_cond_image(conditioning_image)
+
+            output = unet(x, t, ctx, y)
+            target = torch.randn_like(output)
+            loss = torch.nn.functional.mse_loss(output, target)
+
+        scaler.scale(loss).backward()
+        scaler.step(optimizer)
+        scaler.update()
+        optimizer.zero_grad(set_to_none=True)
+        logger.info(f"{sample_param}")
+
+    # from safetensors.torch import save_file
+
+    # save_file(control_net.state_dict(), "logs/control_net.safetensors")
--- a/networks/control_net_lllite_for_train.py
+++ b/networks/control_net_lllite_for_train.py
@@ -0,0 +1,501 @@
+# cond_imageをU-Netのforwardで渡すバージョンのControlNet-LLLite検証用実装
+# ControlNet-LLLite implementation for verification with cond_image passed in U-Net's forward
+
+import os
+import re
+from typing import Optional, List, Type
+import torch
+from library import sdxl_original_unet
+from library.utils import setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+# input_blocksに適用するかどうか / if True, input_blocks are not applied
+SKIP_INPUT_BLOCKS = False
+
+# output_blocksに適用するかどうか / if True, output_blocks are not applied
+SKIP_OUTPUT_BLOCKS = True
+
+# conv2dに適用するかどうか / if True, conv2d are not applied
+SKIP_CONV2D = False
+
+# transformer_blocksのみに適用するかどうか。Trueの場合、ResBlockには適用されない
+# if True, only transformer_blocks are applied, and ResBlocks are not applied
+TRANSFORMER_ONLY = True  # if True, SKIP_CONV2D is ignored because conv2d is not used in transformer_blocks
+
+# Trueならattn1とattn2にのみ適用し、ffなどには適用しない / if True, apply only to attn1 and attn2, not to ff etc.
+ATTN1_2_ONLY = True
+
+# Trueならattn1のQKV、attn2のQにのみ適用する、ATTN1_2_ONLY指定時のみ有効 / if True, apply only to attn1 QKV and attn2 Q, only valid when ATTN1_2_ONLY is specified
+ATTN_QKV_ONLY = True
+
+# Trueならattn1やffなどにのみ適用し、attn2などには適用しない / if True, apply only to attn1 and ff, not to attn2
+# ATTN1_2_ONLYと同時にTrueにできない / cannot be True at the same time as ATTN1_2_ONLY
+ATTN1_ETC_ONLY = False  # True
+
+# transformer_blocksの最大インデックス。Noneなら全てのtransformer_blocksに適用
+# max index of transformer_blocks. if None, apply to all transformer_blocks
+TRANSFORMER_MAX_BLOCK_INDEX = None
+
+ORIGINAL_LINEAR = torch.nn.Linear
+ORIGINAL_CONV2D = torch.nn.Conv2d
+
+
+def add_lllite_modules(module: torch.nn.Module, in_dim: int, depth, cond_emb_dim, mlp_dim) -> None:
+    # conditioning1はconditioning imageを embedding する。timestepごとに呼ばれない
+    # conditioning1 embeds conditioning image. it is not called for each timestep
+    modules = []
+    modules.append(ORIGINAL_CONV2D(3, cond_emb_dim // 2, kernel_size=4, stride=4, padding=0))  # to latent (from VAE) size
+    if depth == 1:
+        modules.append(torch.nn.ReLU(inplace=True))
+        modules.append(ORIGINAL_CONV2D(cond_emb_dim // 2, cond_emb_dim, kernel_size=2, stride=2, padding=0))
+    elif depth == 2:
+        modules.append(torch.nn.ReLU(inplace=True))
+        modules.append(ORIGINAL_CONV2D(cond_emb_dim // 2, cond_emb_dim, kernel_size=4, stride=4, padding=0))
+    elif depth == 3:
+        # kernel size 8は大きすぎるので、4にする / kernel size 8 is too large, so set it to 4
+        modules.append(torch.nn.ReLU(inplace=True))
+        modules.append(ORIGINAL_CONV2D(cond_emb_dim // 2, cond_emb_dim // 2, kernel_size=4, stride=4, padding=0))
+        modules.append(torch.nn.ReLU(inplace=True))
+        modules.append(ORIGINAL_CONV2D(cond_emb_dim // 2, cond_emb_dim, kernel_size=2, stride=2, padding=0))
+
+    module.lllite_conditioning1 = torch.nn.Sequential(*modules)
+
+    # downで入力の次元数を削減する。LoRAにヒントを得ていることにする
+    # midでconditioning image embeddingと入力を結合する
+    # upで元の次元数に戻す
+    # これらはtimestepごとに呼ばれる
+    # reduce the number of input dimensions with down. inspired by LoRA
+    # combine conditioning image embedding and input with mid
+    # restore to the original dimension with up
+    # these are called for each timestep
+
+    module.lllite_down = torch.nn.Sequential(
+        ORIGINAL_LINEAR(in_dim, mlp_dim),
+        torch.nn.ReLU(inplace=True),
+    )
+    module.lllite_mid = torch.nn.Sequential(
+        ORIGINAL_LINEAR(mlp_dim + cond_emb_dim, mlp_dim),
+        torch.nn.ReLU(inplace=True),
+    )
+    module.lllite_up = torch.nn.Sequential(
+        ORIGINAL_LINEAR(mlp_dim, in_dim),
+    )
+
+    # Zero-Convにする / set to Zero-Conv
+    torch.nn.init.zeros_(module.lllite_up[0].weight)  # zero conv
+
+
+class LLLiteLinear(ORIGINAL_LINEAR):
+    def __init__(self, in_features: int, out_features: int, **kwargs):
+        super().__init__(in_features, out_features, **kwargs)
+        self.enabled = False
+
+    def set_lllite(self, depth, cond_emb_dim, name, mlp_dim, dropout=None, multiplier=1.0):
+        self.enabled = True
+        self.lllite_name = name
+        self.cond_emb_dim = cond_emb_dim
+        self.dropout = dropout
+        self.multiplier = multiplier  # ignored
+
+        in_dim = self.in_features
+        add_lllite_modules(self, in_dim, depth, cond_emb_dim, mlp_dim)
+
+        self.cond_image = None
+
+    def set_cond_image(self, cond_image):
+        self.cond_image = cond_image
+
+    def forward(self, x):
+        if not self.enabled:
+            return super().forward(x)
+
+        cx = self.lllite_conditioning1(self.cond_image)  # make forward and backward compatible
+
+        # reshape / b,c,h,w -> b,h*w,c
+        n, c, h, w = cx.shape
+        cx = cx.view(n, c, h * w).permute(0, 2, 1)
+
+        cx = torch.cat([cx, self.lllite_down(x)], dim=2)
+        cx = self.lllite_mid(cx)
+
+        if self.dropout is not None and self.training:
+            cx = torch.nn.functional.dropout(cx, p=self.dropout)
+
+        cx = self.lllite_up(cx) * self.multiplier
+
+        x = super().forward(x + cx)  # ここで元のモジュールを呼び出す / call the original module here
+        return x
+
+
+class LLLiteConv2d(ORIGINAL_CONV2D):
+    def __init__(self, in_channels: int, out_channels: int, kernel_size, **kwargs):
+        super().__init__(in_channels, out_channels, kernel_size, **kwargs)
+        self.enabled = False
+
+    def set_lllite(self, depth, cond_emb_dim, name, mlp_dim, dropout=None, multiplier=1.0):
+        self.enabled = True
+        self.lllite_name = name
+        self.cond_emb_dim = cond_emb_dim
+        self.dropout = dropout
+        self.multiplier = multiplier  # ignored
+
+        in_dim = self.in_channels
+        add_lllite_modules(self, in_dim, depth, cond_emb_dim, mlp_dim)
+
+        self.cond_image = None
+        self.cond_emb = None
+
+    def set_cond_image(self, cond_image):
+        self.cond_image = cond_image
+        self.cond_emb = None
+
+    def forward(self, x):  # , cond_image=None):
+        if not self.enabled:
+            return super().forward(x)
+
+        cx = self.lllite_conditioning1(self.cond_image)
+
+        cx = torch.cat([cx, self.down(x)], dim=1)
+        cx = self.mid(cx)
+
+        if self.dropout is not None and self.training:
+            cx = torch.nn.functional.dropout(cx, p=self.dropout)
+
+        cx = self.up(cx) * self.multiplier
+
+        x = super().forward(x + cx)  # ここで元のモジュールを呼び出す / call the original module here
+        return x
+
+
+class SdxlUNet2DConditionModelControlNetLLLite(sdxl_original_unet.SdxlUNet2DConditionModel):
+    UNET_TARGET_REPLACE_MODULE = ["Transformer2DModel"]
+    UNET_TARGET_REPLACE_MODULE_CONV2D_3X3 = ["ResnetBlock2D", "Downsample2D", "Upsample2D"]
+    LLLITE_PREFIX = "lllite_unet"
+
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+
+    def apply_lllite(
+        self,
+        cond_emb_dim: int = 16,
+        mlp_dim: int = 16,
+        dropout: Optional[float] = None,
+        varbose: Optional[bool] = False,
+        multiplier: Optional[float] = 1.0,
+    ) -> None:
+        def apply_to_modules(
+            root_module: torch.nn.Module,
+            target_replace_modules: List[torch.nn.Module],
+        ) -> List[torch.nn.Module]:
+            prefix = "lllite_unet"
+
+            modules = []
+            for name, module in root_module.named_modules():
+                if module.__class__.__name__ in target_replace_modules:
+                    for child_name, child_module in module.named_modules():
+                        is_linear = child_module.__class__.__name__ == "LLLiteLinear"
+                        is_conv2d = child_module.__class__.__name__ == "LLLiteConv2d"
+
+                        if is_linear or (is_conv2d and not SKIP_CONV2D):
+                            # block indexからdepthを計算: depthはconditioningのサイズやチャネルを計算するのに使う
+                            # block index to depth: depth is using to calculate conditioning size and channels
+                            block_name, index1, index2 = (name + "." + child_name).split(".")[:3]
+                            index1 = int(index1)
+                            if block_name == "input_blocks":
+                                if SKIP_INPUT_BLOCKS:
+                                    continue
+                                depth = 1 if index1 <= 2 else (2 if index1 <= 5 else 3)
+                            elif block_name == "middle_block":
+                                depth = 3
+                            elif block_name == "output_blocks":
+                                if SKIP_OUTPUT_BLOCKS:
+                                    continue
+                                depth = 3 if index1 <= 2 else (2 if index1 <= 5 else 1)
+                                if int(index2) >= 2:
+                                    depth -= 1
+                            else:
+                                raise NotImplementedError()
+
+                            lllite_name = prefix + "." + name + "." + child_name
+                            lllite_name = lllite_name.replace(".", "_")
+
+                            if TRANSFORMER_MAX_BLOCK_INDEX is not None:
+                                p = lllite_name.find("transformer_blocks")
+                                if p >= 0:
+                                    tf_index = int(lllite_name[p:].split("_")[2])
+                                    if tf_index > TRANSFORMER_MAX_BLOCK_INDEX:
+                                        continue
+
+                            #  time embは適用外とする
+                            # attn2のconditioning (CLIPからの入力) はshapeが違うので適用できない
+                            # time emb is not applied
+                            # attn2 conditioning (input from CLIP) cannot be applied because the shape is different
+                            if "emb_layers" in lllite_name or (
+                                "attn2" in lllite_name and ("to_k" in lllite_name or "to_v" in lllite_name)
+                            ):
+                                continue
+
+                            if ATTN1_2_ONLY:
+                                if not ("attn1" in lllite_name or "attn2" in lllite_name):
+                                    continue
+                                if ATTN_QKV_ONLY:
+                                    if "to_out" in lllite_name:
+                                        continue
+
+                            if ATTN1_ETC_ONLY:
+                                if "proj_out" in lllite_name:
+                                    pass
+                                elif "attn1" in lllite_name and (
+                                    "to_k" in lllite_name or "to_v" in lllite_name or "to_out" in lllite_name
+                                ):
+                                    pass
+                                elif "ff_net_2" in lllite_name:
+                                    pass
+                                else:
+                                    continue
+
+                            child_module.set_lllite(depth, cond_emb_dim, lllite_name, mlp_dim, dropout, multiplier)
+                            modules.append(child_module)
+
+            return modules
+
+        target_modules = SdxlUNet2DConditionModelControlNetLLLite.UNET_TARGET_REPLACE_MODULE
+        if not TRANSFORMER_ONLY:
+            target_modules = target_modules + SdxlUNet2DConditionModelControlNetLLLite.UNET_TARGET_REPLACE_MODULE_CONV2D_3X3
+
+        # create module instances
+        self.lllite_modules = apply_to_modules(self, target_modules)
+        logger.info(f"enable ControlNet LLLite for U-Net: {len(self.lllite_modules)} modules.")
+
+    # def prepare_optimizer_params(self):
+    def prepare_params(self):
+        train_params = []
+        non_train_params = []
+        for name, p in self.named_parameters():
+            if "lllite" in name:
+                train_params.append(p)
+            else:
+                non_train_params.append(p)
+        logger.info(f"count of trainable parameters: {len(train_params)}")
+        logger.info(f"count of non-trainable parameters: {len(non_train_params)}")
+
+        for p in non_train_params:
+            p.requires_grad_(False)
+
+        # without this, an error occurs in the optimizer
+        #       RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
+        non_train_params[0].requires_grad_(True)
+
+        for p in train_params:
+            p.requires_grad_(True)
+
+        return train_params
+
+    # def prepare_grad_etc(self):
+    #     self.requires_grad_(True)
+
+    # def on_epoch_start(self):
+    #     self.train()
+
+    def get_trainable_params(self):
+        return [p[1] for p in self.named_parameters() if "lllite" in p[0]]
+
+    def save_lllite_weights(self, file, dtype, metadata):
+        if metadata is not None and len(metadata) == 0:
+            metadata = None
+
+        org_state_dict = self.state_dict()
+
+        # copy LLLite keys from org_state_dict to state_dict with key conversion
+        state_dict = {}
+        for key in org_state_dict.keys():
+            # split with ".lllite"
+            pos = key.find(".lllite")
+            if pos < 0:
+                continue
+            lllite_key = SdxlUNet2DConditionModelControlNetLLLite.LLLITE_PREFIX + "." + key[:pos]
+            lllite_key = lllite_key.replace(".", "_") + key[pos:]
+            lllite_key = lllite_key.replace(".lllite_", ".")
+            state_dict[lllite_key] = org_state_dict[key]
+
+        if dtype is not None:
+            for key in list(state_dict.keys()):
+                v = state_dict[key]
+                v = v.detach().clone().to("cpu").to(dtype)
+                state_dict[key] = v
+
+        if os.path.splitext(file)[1] == ".safetensors":
+            from safetensors.torch import save_file
+
+            save_file(state_dict, file, metadata)
+        else:
+            torch.save(state_dict, file)
+
+    def load_lllite_weights(self, file, non_lllite_unet_sd=None):
+        r"""
+        LLLiteの重みを読み込まない（initされた値を使う）場合はfileにNoneを指定する。
+        この場合、non_lllite_unet_sdにはU-Netのstate_dictを指定する。
+
+        If you do not want to load LLLite weights (use initialized values), specify None for file.
+        In this case, specify the state_dict of U-Net for non_lllite_unet_sd.
+        """
+        if not file:
+            state_dict = self.state_dict()
+            for key in non_lllite_unet_sd:
+                if key in state_dict:
+                    state_dict[key] = non_lllite_unet_sd[key]
+            info = self.load_state_dict(state_dict, False)
+            return info
+
+        if os.path.splitext(file)[1] == ".safetensors":
+            from safetensors.torch import load_file
+
+            weights_sd = load_file(file)
+        else:
+            weights_sd = torch.load(file, map_location="cpu")
+
+        # module_name = module_name.replace("_block", "@blocks")
+        # module_name = module_name.replace("_layer", "@layer")
+        # module_name = module_name.replace("to_", "to@")
+        # module_name = module_name.replace("time_embed", "time@embed")
+        # module_name = module_name.replace("label_emb", "label@emb")
+        # module_name = module_name.replace("skip_connection", "skip@connection")
+        # module_name = module_name.replace("proj_in", "proj@in")
+        # module_name = module_name.replace("proj_out", "proj@out")
+        pattern = re.compile(r"(_block|_layer|to_|time_embed|label_emb|skip_connection|proj_in|proj_out)")
+
+        # convert to lllite with U-Net state dict
+        state_dict = non_lllite_unet_sd.copy() if non_lllite_unet_sd is not None else {}
+        for key in weights_sd.keys():
+            # split with "."
+            pos = key.find(".")
+            if pos < 0:
+                continue
+
+            module_name = key[:pos]
+            weight_name = key[pos + 1 :]  # exclude "."
+            module_name = module_name.replace(SdxlUNet2DConditionModelControlNetLLLite.LLLITE_PREFIX + "_", "")
+
+            # これはうまくいかない。逆変換を考えなかった設計が悪い / this does not work well. bad design because I didn't think about inverse conversion
+            # module_name = module_name.replace("_", ".")
+
+            # ださいけどSDXLのU-Netの "_" を "@" に変換する / ugly but convert "_" of SDXL U-Net to "@"
+            matches = pattern.findall(module_name)
+            if matches is not None:
+                for m in matches:
+                    logger.info(f"{module_name} {m}")
+                    module_name = module_name.replace(m, m.replace("_", "@"))
+            module_name = module_name.replace("_", ".")
+            module_name = module_name.replace("@", "_")
+
+            lllite_key = module_name + ".lllite_" + weight_name
+
+            state_dict[lllite_key] = weights_sd[key]
+
+        info = self.load_state_dict(state_dict, False)
+        return info
+
+    def forward(self, x, timesteps=None, context=None, y=None, cond_image=None, **kwargs):
+        for m in self.lllite_modules:
+            m.set_cond_image(cond_image)
+        return super().forward(x, timesteps, context, y, **kwargs)
+
+
+def replace_unet_linear_and_conv2d():
+    logger.info("replace torch.nn.Linear and torch.nn.Conv2d to LLLiteLinear and LLLiteConv2d in U-Net")
+    sdxl_original_unet.torch.nn.Linear = LLLiteLinear
+    sdxl_original_unet.torch.nn.Conv2d = LLLiteConv2d
+
+
+if __name__ == "__main__":
+    # デバッグ用 / for debug
+
+    # sdxl_original_unet.USE_REENTRANT = False
+    replace_unet_linear_and_conv2d()
+
+    # test shape etc
+    logger.info("create unet")
+    unet = SdxlUNet2DConditionModelControlNetLLLite()
+
+    logger.info("enable ControlNet-LLLite")
+    unet.apply_lllite(32, 64, None, False, 1.0)
+    unet.to("cuda")  # .to(torch.float16)
+
+    # from safetensors.torch import load_file
+
+    # model_sd = load_file(r"E:\Work\SD\Models\sdxl\sd_xl_base_1.0_0.9vae.safetensors")
+    # unet_sd = {}
+
+    # # copy U-Net keys from unet_state_dict to state_dict
+    # prefix = "model.diffusion_model."
+    # for key in model_sd.keys():
+    #     if key.startswith(prefix):
+    #         converted_key = key[len(prefix) :]
+    #         unet_sd[converted_key] = model_sd[key]
+
+    # info = unet.load_lllite_weights("r:/lllite_from_unet.safetensors", unet_sd)
+    # logger.info(info)
+
+    # logger.info(unet)
+
+    # logger.info number of parameters
+    params = unet.prepare_params()
+    logger.info(f"number of parameters {sum(p.numel() for p in params)}")
+    # logger.info("type any key to continue")
+    # input()
+
+    unet.set_use_memory_efficient_attention(True, False)
+    unet.set_gradient_checkpointing(True)
+    unet.train()  # for gradient checkpointing
+
+    # # visualize
+    # import torchviz
+    # logger.info("run visualize")
+    # controlnet.set_control(conditioning_image)
+    # output = unet(x, t, ctx, y)
+    # logger.info("make_dot")
+    # image = torchviz.make_dot(output, params=dict(controlnet.named_parameters()))
+    # logger.info("render")
+    # image.format = "svg" # "png"
+    # image.render("NeuralNet") # すごく時間がかかるので注意 / be careful because it takes a long time
+    # input()
+
+    import bitsandbytes
+
+    optimizer = bitsandbytes.adam.Adam8bit(params, 1e-3)
+
+    scaler = torch.cuda.amp.GradScaler(enabled=True)
+
+    logger.info("start training")
+    steps = 10
+    batch_size = 1
+
+    sample_param = [p for p in unet.named_parameters() if ".lllite_up." in p[0]][0]
+    for step in range(steps):
+        logger.info(f"step {step}")
+
+        conditioning_image = torch.rand(batch_size, 3, 1024, 1024).cuda() * 2.0 - 1.0
+        x = torch.randn(batch_size, 4, 128, 128).cuda()
+        t = torch.randint(low=0, high=10, size=(batch_size,)).cuda()
+        ctx = torch.randn(batch_size, 77, 2048).cuda()
+        y = torch.randn(batch_size, sdxl_original_unet.ADM_IN_CHANNELS).cuda()
+
+        with torch.cuda.amp.autocast(enabled=True, dtype=torch.bfloat16):
+            output = unet(x, t, ctx, y, conditioning_image)
+            target = torch.randn_like(output)
+            loss = torch.nn.functional.mse_loss(output, target)
+
+        scaler.scale(loss).backward()
+        scaler.step(optimizer)
+        scaler.update()
+        optimizer.zero_grad(set_to_none=True)
+        logger.info(sample_param)
+
+    # from safetensors.torch import save_file
+
+    # logger.info("save weights")
+    # unet.save_lllite_weights("r:/lllite_from_unet.safetensors", torch.float16, None)
--- a/networks/convert_flux_lora.py
+++ b/networks/convert_flux_lora.py
@@ -0,0 +1,434 @@
+# convert key mapping and data format from some LoRA format to another
+"""
+Original LoRA format: Based on Black Forest Labs, QKV and MLP are unified into one module
+alpha is scalar for each LoRA module
+
+0 to 18
+lora_unet_double_blocks_0_img_attn_proj.alpha torch.Size([])
+lora_unet_double_blocks_0_img_attn_proj.lora_down.weight torch.Size([4, 3072])
+lora_unet_double_blocks_0_img_attn_proj.lora_up.weight torch.Size([3072, 4])
+lora_unet_double_blocks_0_img_attn_qkv.alpha torch.Size([])
+lora_unet_double_blocks_0_img_attn_qkv.lora_down.weight torch.Size([4, 3072])
+lora_unet_double_blocks_0_img_attn_qkv.lora_up.weight torch.Size([9216, 4])
+lora_unet_double_blocks_0_img_mlp_0.alpha torch.Size([])
+lora_unet_double_blocks_0_img_mlp_0.lora_down.weight torch.Size([4, 3072])
+lora_unet_double_blocks_0_img_mlp_0.lora_up.weight torch.Size([12288, 4])
+lora_unet_double_blocks_0_img_mlp_2.alpha torch.Size([])
+lora_unet_double_blocks_0_img_mlp_2.lora_down.weight torch.Size([4, 12288])
+lora_unet_double_blocks_0_img_mlp_2.lora_up.weight torch.Size([3072, 4])
+lora_unet_double_blocks_0_img_mod_lin.alpha torch.Size([])
+lora_unet_double_blocks_0_img_mod_lin.lora_down.weight torch.Size([4, 3072])
+lora_unet_double_blocks_0_img_mod_lin.lora_up.weight torch.Size([18432, 4])
+lora_unet_double_blocks_0_txt_attn_proj.alpha torch.Size([])
+lora_unet_double_blocks_0_txt_attn_proj.lora_down.weight torch.Size([4, 3072])
+lora_unet_double_blocks_0_txt_attn_proj.lora_up.weight torch.Size([3072, 4])
+lora_unet_double_blocks_0_txt_attn_qkv.alpha torch.Size([])
+lora_unet_double_blocks_0_txt_attn_qkv.lora_down.weight torch.Size([4, 3072])
+lora_unet_double_blocks_0_txt_attn_qkv.lora_up.weight torch.Size([9216, 4])
+lora_unet_double_blocks_0_txt_mlp_0.alpha torch.Size([])
+lora_unet_double_blocks_0_txt_mlp_0.lora_down.weight torch.Size([4, 3072])
+lora_unet_double_blocks_0_txt_mlp_0.lora_up.weight torch.Size([12288, 4])
+lora_unet_double_blocks_0_txt_mlp_2.alpha torch.Size([])
+lora_unet_double_blocks_0_txt_mlp_2.lora_down.weight torch.Size([4, 12288])
+lora_unet_double_blocks_0_txt_mlp_2.lora_up.weight torch.Size([3072, 4])
+lora_unet_double_blocks_0_txt_mod_lin.alpha torch.Size([])
+lora_unet_double_blocks_0_txt_mod_lin.lora_down.weight torch.Size([4, 3072])
+lora_unet_double_blocks_0_txt_mod_lin.lora_up.weight torch.Size([18432, 4])
+
+0 to 37
+lora_unet_single_blocks_0_linear1.alpha torch.Size([])
+lora_unet_single_blocks_0_linear1.lora_down.weight torch.Size([4, 3072])
+lora_unet_single_blocks_0_linear1.lora_up.weight torch.Size([21504, 4])
+lora_unet_single_blocks_0_linear2.alpha torch.Size([])
+lora_unet_single_blocks_0_linear2.lora_down.weight torch.Size([4, 15360])
+lora_unet_single_blocks_0_linear2.lora_up.weight torch.Size([3072, 4])
+lora_unet_single_blocks_0_modulation_lin.alpha torch.Size([])
+lora_unet_single_blocks_0_modulation_lin.lora_down.weight torch.Size([4, 3072])
+lora_unet_single_blocks_0_modulation_lin.lora_up.weight torch.Size([9216, 4])
+"""
+"""
+ai-toolkit: Based on Diffusers, QKV and MLP are separated into 3 modules.
+A is down, B is up. No alpha for each LoRA module.
+
+0 to 18
+transformer.transformer_blocks.0.attn.add_k_proj.lora_A.weight torch.Size([16, 3072])
+transformer.transformer_blocks.0.attn.add_k_proj.lora_B.weight torch.Size([3072, 16])
+transformer.transformer_blocks.0.attn.add_q_proj.lora_A.weight torch.Size([16, 3072])
+transformer.transformer_blocks.0.attn.add_q_proj.lora_B.weight torch.Size([3072, 16])
+transformer.transformer_blocks.0.attn.add_v_proj.lora_A.weight torch.Size([16, 3072])
+transformer.transformer_blocks.0.attn.add_v_proj.lora_B.weight torch.Size([3072, 16])
+transformer.transformer_blocks.0.attn.to_add_out.lora_A.weight torch.Size([16, 3072])
+transformer.transformer_blocks.0.attn.to_add_out.lora_B.weight torch.Size([3072, 16])
+transformer.transformer_blocks.0.attn.to_k.lora_A.weight torch.Size([16, 3072])
+transformer.transformer_blocks.0.attn.to_k.lora_B.weight torch.Size([3072, 16])
+transformer.transformer_blocks.0.attn.to_out.0.lora_A.weight torch.Size([16, 3072])
+transformer.transformer_blocks.0.attn.to_out.0.lora_B.weight torch.Size([3072, 16])
+transformer.transformer_blocks.0.attn.to_q.lora_A.weight torch.Size([16, 3072])
+transformer.transformer_blocks.0.attn.to_q.lora_B.weight torch.Size([3072, 16])
+transformer.transformer_blocks.0.attn.to_v.lora_A.weight torch.Size([16, 3072])
+transformer.transformer_blocks.0.attn.to_v.lora_B.weight torch.Size([3072, 16])
+transformer.transformer_blocks.0.ff.net.0.proj.lora_A.weight torch.Size([16, 3072])
+transformer.transformer_blocks.0.ff.net.0.proj.lora_B.weight torch.Size([12288, 16])
+transformer.transformer_blocks.0.ff.net.2.lora_A.weight torch.Size([16, 12288])
+transformer.transformer_blocks.0.ff.net.2.lora_B.weight torch.Size([3072, 16])
+transformer.transformer_blocks.0.ff_context.net.0.proj.lora_A.weight torch.Size([16, 3072])
+transformer.transformer_blocks.0.ff_context.net.0.proj.lora_B.weight torch.Size([12288, 16])
+transformer.transformer_blocks.0.ff_context.net.2.lora_A.weight torch.Size([16, 12288])
+transformer.transformer_blocks.0.ff_context.net.2.lora_B.weight torch.Size([3072, 16])
+transformer.transformer_blocks.0.norm1.linear.lora_A.weight torch.Size([16, 3072])
+transformer.transformer_blocks.0.norm1.linear.lora_B.weight torch.Size([18432, 16])
+transformer.transformer_blocks.0.norm1_context.linear.lora_A.weight torch.Size([16, 3072])
+transformer.transformer_blocks.0.norm1_context.linear.lora_B.weight torch.Size([18432, 16])
+
+0 to 37
+transformer.single_transformer_blocks.0.attn.to_k.lora_A.weight torch.Size([16, 3072])
+transformer.single_transformer_blocks.0.attn.to_k.lora_B.weight torch.Size([3072, 16])
+transformer.single_transformer_blocks.0.attn.to_q.lora_A.weight torch.Size([16, 3072])
+transformer.single_transformer_blocks.0.attn.to_q.lora_B.weight torch.Size([3072, 16])
+transformer.single_transformer_blocks.0.attn.to_v.lora_A.weight torch.Size([16, 3072])
+transformer.single_transformer_blocks.0.attn.to_v.lora_B.weight torch.Size([3072, 16])
+transformer.single_transformer_blocks.0.norm.linear.lora_A.weight torch.Size([16, 3072])
+transformer.single_transformer_blocks.0.norm.linear.lora_B.weight torch.Size([9216, 16])
+transformer.single_transformer_blocks.0.proj_mlp.lora_A.weight torch.Size([16, 3072])
+transformer.single_transformer_blocks.0.proj_mlp.lora_B.weight torch.Size([12288, 16])
+transformer.single_transformer_blocks.0.proj_out.lora_A.weight torch.Size([16, 15360])
+transformer.single_transformer_blocks.0.proj_out.lora_B.weight torch.Size([3072, 16])
+"""
+"""
+xlabs: Unknown format.
+0 to 18
+double_blocks.0.processor.proj_lora1.down.weight torch.Size([16, 3072])
+double_blocks.0.processor.proj_lora1.up.weight torch.Size([3072, 16])
+double_blocks.0.processor.proj_lora2.down.weight torch.Size([16, 3072])
+double_blocks.0.processor.proj_lora2.up.weight torch.Size([3072, 16])
+double_blocks.0.processor.qkv_lora1.down.weight torch.Size([16, 3072])
+double_blocks.0.processor.qkv_lora1.up.weight torch.Size([9216, 16])
+double_blocks.0.processor.qkv_lora2.down.weight torch.Size([16, 3072])
+double_blocks.0.processor.qkv_lora2.up.weight torch.Size([9216, 16])
+"""
+
+
+import argparse
+from safetensors.torch import save_file
+from safetensors import safe_open
+import torch
+
+
+from library.utils import setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+
+def convert_to_sd_scripts(sds_sd, ait_sd, sds_key, ait_key):
+    ait_down_key = ait_key + ".lora_A.weight"
+    if ait_down_key not in ait_sd:
+        return
+    ait_up_key = ait_key + ".lora_B.weight"
+
+    down_weight = ait_sd.pop(ait_down_key)
+    sds_sd[sds_key + ".lora_down.weight"] = down_weight
+    sds_sd[sds_key + ".lora_up.weight"] = ait_sd.pop(ait_up_key)
+    rank = down_weight.shape[0]
+    sds_sd[sds_key + ".alpha"] = torch.scalar_tensor(rank, dtype=down_weight.dtype, device=down_weight.device)
+
+
+def convert_to_sd_scripts_cat(sds_sd, ait_sd, sds_key, ait_keys):
+    ait_down_keys = [k + ".lora_A.weight" for k in ait_keys]
+    if ait_down_keys[0] not in ait_sd:
+        return
+    ait_up_keys = [k + ".lora_B.weight" for k in ait_keys]
+
+    down_weights = [ait_sd.pop(k) for k in ait_down_keys]
+    up_weights = [ait_sd.pop(k) for k in ait_up_keys]
+
+    # lora_down is concatenated along dim=0, so rank is multiplied by the number of splits
+    rank = down_weights[0].shape[0]
+    num_splits = len(ait_keys)
+    sds_sd[sds_key + ".lora_down.weight"] = torch.cat(down_weights, dim=0)
+
+    merged_up_weights = torch.zeros(
+        (sum(w.shape[0] for w in up_weights), rank * num_splits),
+        dtype=up_weights[0].dtype,
+        device=up_weights[0].device,
+    )
+
+    i = 0
+    for j, up_weight in enumerate(up_weights):
+        merged_up_weights[i : i + up_weight.shape[0], j * rank : (j + 1) * rank] = up_weight
+        i += up_weight.shape[0]
+
+    sds_sd[sds_key + ".lora_up.weight"] = merged_up_weights
+
+    # set alpha to new_rank
+    new_rank = rank * num_splits
+    sds_sd[sds_key + ".alpha"] = torch.scalar_tensor(new_rank, dtype=down_weights[0].dtype, device=down_weights[0].device)
+
+
+def convert_ai_toolkit_to_sd_scripts(ait_sd):
+    sds_sd = {}
+    for i in range(19):
+        convert_to_sd_scripts(
+            sds_sd, ait_sd, f"lora_unet_double_blocks_{i}_img_attn_proj", f"transformer.transformer_blocks.{i}.attn.to_out.0"
+        )
+        convert_to_sd_scripts_cat(
+            sds_sd,
+            ait_sd,
+            f"lora_unet_double_blocks_{i}_img_attn_qkv",
+            [
+                f"transformer.transformer_blocks.{i}.attn.to_q",
+                f"transformer.transformer_blocks.{i}.attn.to_k",
+                f"transformer.transformer_blocks.{i}.attn.to_v",
+            ],
+        )
+        convert_to_sd_scripts(
+            sds_sd, ait_sd, f"lora_unet_double_blocks_{i}_img_mlp_0", f"transformer.transformer_blocks.{i}.ff.net.0.proj"
+        )
+        convert_to_sd_scripts(
+            sds_sd, ait_sd, f"lora_unet_double_blocks_{i}_img_mlp_2", f"transformer.transformer_blocks.{i}.ff.net.2"
+        )
+        convert_to_sd_scripts(
+            sds_sd, ait_sd, f"lora_unet_double_blocks_{i}_img_mod_lin", f"transformer.transformer_blocks.{i}.norm1.linear"
+        )
+        convert_to_sd_scripts(
+            sds_sd, ait_sd, f"lora_unet_double_blocks_{i}_txt_attn_proj", f"transformer.transformer_blocks.{i}.attn.to_add_out"
+        )
+        convert_to_sd_scripts_cat(
+            sds_sd,
+            ait_sd,
+            f"lora_unet_double_blocks_{i}_txt_attn_qkv",
+            [
+                f"transformer.transformer_blocks.{i}.attn.add_q_proj",
+                f"transformer.transformer_blocks.{i}.attn.add_k_proj",
+                f"transformer.transformer_blocks.{i}.attn.add_v_proj",
+            ],
+        )
+        convert_to_sd_scripts(
+            sds_sd, ait_sd, f"lora_unet_double_blocks_{i}_txt_mlp_0", f"transformer.transformer_blocks.{i}.ff_context.net.0.proj"
+        )
+        convert_to_sd_scripts(
+            sds_sd, ait_sd, f"lora_unet_double_blocks_{i}_txt_mlp_2", f"transformer.transformer_blocks.{i}.ff_context.net.2"
+        )
+        convert_to_sd_scripts(
+            sds_sd, ait_sd, f"lora_unet_double_blocks_{i}_txt_mod_lin", f"transformer.transformer_blocks.{i}.norm1_context.linear"
+        )
+
+    for i in range(38):
+        convert_to_sd_scripts_cat(
+            sds_sd,
+            ait_sd,
+            f"lora_unet_single_blocks_{i}_linear1",
+            [
+                f"transformer.single_transformer_blocks.{i}.attn.to_q",
+                f"transformer.single_transformer_blocks.{i}.attn.to_k",
+                f"transformer.single_transformer_blocks.{i}.attn.to_v",
+                f"transformer.single_transformer_blocks.{i}.proj_mlp",
+            ],
+        )
+        convert_to_sd_scripts(
+            sds_sd, ait_sd, f"lora_unet_single_blocks_{i}_linear2", f"transformer.single_transformer_blocks.{i}.proj_out"
+        )
+        convert_to_sd_scripts(
+            sds_sd, ait_sd, f"lora_unet_single_blocks_{i}_modulation_lin", f"transformer.single_transformer_blocks.{i}.norm.linear"
+        )
+
+    if len(ait_sd) > 0:
+        logger.warning(f"Unsuppored keys for sd-scripts: {ait_sd.keys()}")
+    return sds_sd
+
+
+def convert_to_ai_toolkit(sds_sd, ait_sd, sds_key, ait_key):
+    if sds_key + ".lora_down.weight" not in sds_sd:
+        return
+    down_weight = sds_sd.pop(sds_key + ".lora_down.weight")
+
+    # scale weight by alpha and dim
+    rank = down_weight.shape[0]
+    alpha = sds_sd.pop(sds_key + ".alpha").item()  # alpha is scalar
+    scale = alpha / rank  # LoRA is scaled by 'alpha / rank' in forward pass, so we need to scale it back here
+    # print(f"rank: {rank}, alpha: {alpha}, scale: {scale}")
+
+    # calculate scale_down and scale_up to keep the same value. if scale is 4, scale_down is 2 and scale_up is 2
+    scale_down = scale
+    scale_up = 1.0
+    while scale_down * 2 < scale_up:
+        scale_down *= 2
+        scale_up /= 2
+    # print(f"scale: {scale}, scale_down: {scale_down}, scale_up: {scale_up}")
+
+    ait_sd[ait_key + ".lora_A.weight"] = down_weight * scale_down
+    ait_sd[ait_key + ".lora_B.weight"] = sds_sd.pop(sds_key + ".lora_up.weight") * scale_up
+
+
+def convert_to_ai_toolkit_cat(sds_sd, ait_sd, sds_key, ait_keys, dims=None):
+    if sds_key + ".lora_down.weight" not in sds_sd:
+        return
+    down_weight = sds_sd.pop(sds_key + ".lora_down.weight")
+    up_weight = sds_sd.pop(sds_key + ".lora_up.weight")
+    sd_lora_rank = down_weight.shape[0]
+
+    # scale weight by alpha and dim
+    alpha = sds_sd.pop(sds_key + ".alpha")
+    scale = alpha / sd_lora_rank
+
+    # calculate scale_down and scale_up
+    scale_down = scale
+    scale_up = 1.0
+    while scale_down * 2 < scale_up:
+        scale_down *= 2
+        scale_up /= 2
+
+    down_weight = down_weight * scale_down
+    up_weight = up_weight * scale_up
+
+    # calculate dims if not provided
+    num_splits = len(ait_keys)
+    if dims is None:
+        dims = [up_weight.shape[0] // num_splits] * num_splits
+    else:
+        assert sum(dims) == up_weight.shape[0]
+
+    # check upweight is sparse or not
+    is_sparse = False
+    if sd_lora_rank % num_splits == 0:
+        ait_rank = sd_lora_rank // num_splits
+        is_sparse = True
+        i = 0
+        for j in range(len(dims)):
+            for k in range(len(dims)):
+                if j == k:
+                    continue
+                is_sparse = is_sparse and torch.all(up_weight[i : i + dims[j], k * ait_rank : (k + 1) * ait_rank] == 0)
+            i += dims[j]
+        if is_sparse:
+            logger.info(f"weight is sparse: {sds_key}")
+
+    # make ai-toolkit weight
+    ait_down_keys = [k + ".lora_A.weight" for k in ait_keys]
+    ait_up_keys = [k + ".lora_B.weight" for k in ait_keys]
+    if not is_sparse:
+        # down_weight is copied to each split
+        ait_sd.update({k: down_weight for k in ait_down_keys})
+
+        # up_weight is split to each split
+        ait_sd.update({k: v for k, v in zip(ait_up_keys, torch.split(up_weight, dims, dim=0))})
+    else:
+        # down_weight is chunked to each split
+        ait_sd.update({k: v for k, v in zip(ait_down_keys, torch.chunk(down_weight, num_splits, dim=0))})
+
+        # up_weight is sparse: only non-zero values are copied to each split
+        i = 0
+        for j in range(len(dims)):
+            ait_sd[ait_up_keys[j]] = up_weight[i : i + dims[j], j * ait_rank : (j + 1) * ait_rank].contiguous()
+            i += dims[j]
+
+
+def convert_sd_scripts_to_ai_toolkit(sds_sd):
+    ait_sd = {}
+    for i in range(19):
+        convert_to_ai_toolkit(
+            sds_sd, ait_sd, f"lora_unet_double_blocks_{i}_img_attn_proj", f"transformer.transformer_blocks.{i}.attn.to_out.0"
+        )
+        convert_to_ai_toolkit_cat(
+            sds_sd,
+            ait_sd,
+            f"lora_unet_double_blocks_{i}_img_attn_qkv",
+            [
+                f"transformer.transformer_blocks.{i}.attn.to_q",
+                f"transformer.transformer_blocks.{i}.attn.to_k",
+                f"transformer.transformer_blocks.{i}.attn.to_v",
+            ],
+        )
+        convert_to_ai_toolkit(
+            sds_sd, ait_sd, f"lora_unet_double_blocks_{i}_img_mlp_0", f"transformer.transformer_blocks.{i}.ff.net.0.proj"
+        )
+        convert_to_ai_toolkit(
+            sds_sd, ait_sd, f"lora_unet_double_blocks_{i}_img_mlp_2", f"transformer.transformer_blocks.{i}.ff.net.2"
+        )
+        convert_to_ai_toolkit(
+            sds_sd, ait_sd, f"lora_unet_double_blocks_{i}_img_mod_lin", f"transformer.transformer_blocks.{i}.norm1.linear"
+        )
+        convert_to_ai_toolkit(
+            sds_sd, ait_sd, f"lora_unet_double_blocks_{i}_txt_attn_proj", f"transformer.transformer_blocks.{i}.attn.to_add_out"
+        )
+        convert_to_ai_toolkit_cat(
+            sds_sd,
+            ait_sd,
+            f"lora_unet_double_blocks_{i}_txt_attn_qkv",
+            [
+                f"transformer.transformer_blocks.{i}.attn.add_q_proj",
+                f"transformer.transformer_blocks.{i}.attn.add_k_proj",
+                f"transformer.transformer_blocks.{i}.attn.add_v_proj",
+            ],
+        )
+        convert_to_ai_toolkit(
+            sds_sd, ait_sd, f"lora_unet_double_blocks_{i}_txt_mlp_0", f"transformer.transformer_blocks.{i}.ff_context.net.0.proj"
+        )
+        convert_to_ai_toolkit(
+            sds_sd, ait_sd, f"lora_unet_double_blocks_{i}_txt_mlp_2", f"transformer.transformer_blocks.{i}.ff_context.net.2"
+        )
+        convert_to_ai_toolkit(
+            sds_sd, ait_sd, f"lora_unet_double_blocks_{i}_txt_mod_lin", f"transformer.transformer_blocks.{i}.norm1_context.linear"
+        )
+
+    for i in range(38):
+        convert_to_ai_toolkit_cat(
+            sds_sd,
+            ait_sd,
+            f"lora_unet_single_blocks_{i}_linear1",
+            [
+                f"transformer.single_transformer_blocks.{i}.attn.to_q",
+                f"transformer.single_transformer_blocks.{i}.attn.to_k",
+                f"transformer.single_transformer_blocks.{i}.attn.to_v",
+                f"transformer.single_transformer_blocks.{i}.proj_mlp",
+            ],
+            dims=[3072, 3072, 3072, 12288],
+        )
+        convert_to_ai_toolkit(
+            sds_sd, ait_sd, f"lora_unet_single_blocks_{i}_linear2", f"transformer.single_transformer_blocks.{i}.proj_out"
+        )
+        convert_to_ai_toolkit(
+            sds_sd, ait_sd, f"lora_unet_single_blocks_{i}_modulation_lin", f"transformer.single_transformer_blocks.{i}.norm.linear"
+        )
+
+    if len(sds_sd) > 0:
+        logger.warning(f"Unsuppored keys for ai-toolkit: {sds_sd.keys()}")
+    return ait_sd
+
+
+def main(args):
+    # load source safetensors
+    logger.info(f"Loading source file {args.src_path}")
+    state_dict = {}
+    with safe_open(args.src_path, framework="pt") as f:
+        metadata = f.metadata()
+        for k in f.keys():
+            state_dict[k] = f.get_tensor(k)
+
+    logger.info(f"Converting {args.src} to {args.dst} format")
+    if args.src == "ai-toolkit" and args.dst == "sd-scripts":
+        state_dict = convert_ai_toolkit_to_sd_scripts(state_dict)
+    elif args.src == "sd-scripts" and args.dst == "ai-toolkit":
+        state_dict = convert_sd_scripts_to_ai_toolkit(state_dict)
+
+        # eliminate 'shared tensors' 
+        for k in list(state_dict.keys()):
+            state_dict[k] = state_dict[k].detach().clone()
+    else:
+        raise NotImplementedError(f"Conversion from {args.src} to {args.dst} is not supported")
+
+    # save destination safetensors
+    logger.info(f"Saving destination file {args.dst_path}")
+    save_file(state_dict, args.dst_path, metadata=metadata)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Convert LoRA format")
+    parser.add_argument("--src", type=str, default="ai-toolkit", help="source format, ai-toolkit or sd-scripts")
+    parser.add_argument("--dst", type=str, default="sd-scripts", help="destination format, ai-toolkit or sd-scripts")
+    parser.add_argument("--src_path", type=str, default=None, help="source path")
+    parser.add_argument("--dst_path", type=str, default=None, help="destination path")
+    args = parser.parse_args()
+    main(args)
--- a/networks/dylora.py
+++ b/networks/dylora.py
@@ -0,0 +1,529 @@
+# some codes are copied from:
+# https://github.com/huawei-noah/KD-NLP/blob/main/DyLoRA/
+
+# Copyright (C) 2022. Huawei Technologies Co., Ltd. All rights reserved.
+# Changes made to the original code:
+# 2022.08.20 - Integrate the DyLoRA layer for the LoRA Linear layer
+#  ------------------------------------------------------------------------------------------
+#  Copyright (c) Microsoft Corporation. All rights reserved.
+#  Licensed under the MIT License (MIT). See LICENSE in the repo root for license information.
+#  ------------------------------------------------------------------------------------------
+
+import math
+import os
+import random
+from typing import Dict, List, Optional, Tuple, Type, Union
+from diffusers import AutoencoderKL
+from transformers import CLIPTextModel
+import torch
+from torch import nn
+from library.utils import setup_logging
+
+setup_logging()
+import logging
+
+logger = logging.getLogger(__name__)
+
+
+class DyLoRAModule(torch.nn.Module):
+    """
+    replaces forward method of the original Linear, instead of replacing the original Linear module.
+    """
+
+    # NOTE: support dropout in future
+    def __init__(self, lora_name, org_module: torch.nn.Module, multiplier=1.0, lora_dim=4, alpha=1, unit=1):
+        super().__init__()
+        self.lora_name = lora_name
+        self.lora_dim = lora_dim
+        self.unit = unit
+        assert self.lora_dim % self.unit == 0, "rank must be a multiple of unit"
+
+        if org_module.__class__.__name__ == "Conv2d":
+            in_dim = org_module.in_channels
+            out_dim = org_module.out_channels
+        else:
+            in_dim = org_module.in_features
+            out_dim = org_module.out_features
+
+        if type(alpha) == torch.Tensor:
+            alpha = alpha.detach().float().numpy()  # without casting, bf16 causes error
+        alpha = self.lora_dim if alpha is None or alpha == 0 else alpha
+        self.scale = alpha / self.lora_dim
+        self.register_buffer("alpha", torch.tensor(alpha))  # 定数として扱える
+
+        self.is_conv2d = org_module.__class__.__name__ == "Conv2d"
+        self.is_conv2d_3x3 = self.is_conv2d and org_module.kernel_size == (3, 3)
+
+        if self.is_conv2d and self.is_conv2d_3x3:
+            kernel_size = org_module.kernel_size
+            self.stride = org_module.stride
+            self.padding = org_module.padding
+            self.lora_A = nn.ParameterList([org_module.weight.new_zeros((1, in_dim, *kernel_size)) for _ in range(self.lora_dim)])
+            self.lora_B = nn.ParameterList([org_module.weight.new_zeros((out_dim, 1, 1, 1)) for _ in range(self.lora_dim)])
+        else:
+            self.lora_A = nn.ParameterList([org_module.weight.new_zeros((1, in_dim)) for _ in range(self.lora_dim)])
+            self.lora_B = nn.ParameterList([org_module.weight.new_zeros((out_dim, 1)) for _ in range(self.lora_dim)])
+
+        # same as microsoft's
+        for lora in self.lora_A:
+            torch.nn.init.kaiming_uniform_(lora, a=math.sqrt(5))
+        for lora in self.lora_B:
+            torch.nn.init.zeros_(lora)
+
+        self.multiplier = multiplier
+        self.org_module = org_module  # remove in applying
+
+    def apply_to(self):
+        self.org_forward = self.org_module.forward
+        self.org_module.forward = self.forward
+        del self.org_module
+
+    def forward(self, x):
+        result = self.org_forward(x)
+
+        # specify the dynamic rank
+        trainable_rank = random.randint(0, self.lora_dim - 1)
+        trainable_rank = trainable_rank - trainable_rank % self.unit  # make sure the rank is a multiple of unit
+
+        # 一部のパラメータを固定して、残りのパラメータを学習する
+        for i in range(0, trainable_rank):
+            self.lora_A[i].requires_grad = False
+            self.lora_B[i].requires_grad = False
+        for i in range(trainable_rank, trainable_rank + self.unit):
+            self.lora_A[i].requires_grad = True
+            self.lora_B[i].requires_grad = True
+        for i in range(trainable_rank + self.unit, self.lora_dim):
+            self.lora_A[i].requires_grad = False
+            self.lora_B[i].requires_grad = False
+
+        lora_A = torch.cat(tuple(self.lora_A), dim=0)
+        lora_B = torch.cat(tuple(self.lora_B), dim=1)
+
+        # calculate with lora_A and lora_B
+        if self.is_conv2d_3x3:
+            ab = torch.nn.functional.conv2d(x, lora_A, stride=self.stride, padding=self.padding)
+            ab = torch.nn.functional.conv2d(ab, lora_B)
+        else:
+            ab = x
+            if self.is_conv2d:
+                ab = ab.reshape(ab.size(0), ab.size(1), -1).transpose(1, 2)  # (N, C, H, W) -> (N, H*W, C)
+
+            ab = torch.nn.functional.linear(ab, lora_A)
+            ab = torch.nn.functional.linear(ab, lora_B)
+
+            if self.is_conv2d:
+                ab = ab.transpose(1, 2).reshape(ab.size(0), -1, *x.size()[2:])  # (N, H*W, C) -> (N, C, H, W)
+
+        # 最後の項は、低rankをより大きくするためのスケーリング（じゃないかな）
+        result = result + ab * self.scale * math.sqrt(self.lora_dim / (trainable_rank + self.unit))
+
+        # NOTE weightに加算してからlinear/conv2dを呼んだほうが速いかも
+        return result
+
+    def state_dict(self, destination=None, prefix="", keep_vars=False):
+        # state dictを通常のLoRAと同じにする:
+        # nn.ParameterListは `.lora_A.0` みたいな名前になるので、forwardと同様にcatして入れ替える
+        sd = super().state_dict(destination=destination, prefix=prefix, keep_vars=keep_vars)
+
+        lora_A_weight = torch.cat(tuple(self.lora_A), dim=0)
+        if self.is_conv2d and not self.is_conv2d_3x3:
+            lora_A_weight = lora_A_weight.unsqueeze(-1).unsqueeze(-1)
+
+        lora_B_weight = torch.cat(tuple(self.lora_B), dim=1)
+        if self.is_conv2d and not self.is_conv2d_3x3:
+            lora_B_weight = lora_B_weight.unsqueeze(-1).unsqueeze(-1)
+
+        sd[self.lora_name + ".lora_down.weight"] = lora_A_weight if keep_vars else lora_A_weight.detach()
+        sd[self.lora_name + ".lora_up.weight"] = lora_B_weight if keep_vars else lora_B_weight.detach()
+
+        i = 0
+        while True:
+            key_a = f"{self.lora_name}.lora_A.{i}"
+            key_b = f"{self.lora_name}.lora_B.{i}"
+            if key_a in sd:
+                sd.pop(key_a)
+                sd.pop(key_b)
+            else:
+                break
+            i += 1
+        return sd
+
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs):
+        # 通常のLoRAと同じstate dictを読み込めるようにする：この方法はchatGPTに聞いた
+        lora_A_weight = state_dict.pop(self.lora_name + ".lora_down.weight", None)
+        lora_B_weight = state_dict.pop(self.lora_name + ".lora_up.weight", None)
+
+        if lora_A_weight is None or lora_B_weight is None:
+            if strict:
+                raise KeyError(f"{self.lora_name}.lora_down/up.weight is not found")
+            else:
+                return
+
+        if self.is_conv2d and not self.is_conv2d_3x3:
+            lora_A_weight = lora_A_weight.squeeze(-1).squeeze(-1)
+            lora_B_weight = lora_B_weight.squeeze(-1).squeeze(-1)
+
+        state_dict.update(
+            {f"{self.lora_name}.lora_A.{i}": nn.Parameter(lora_A_weight[i].unsqueeze(0)) for i in range(lora_A_weight.size(0))}
+        )
+        state_dict.update(
+            {f"{self.lora_name}.lora_B.{i}": nn.Parameter(lora_B_weight[:, i].unsqueeze(1)) for i in range(lora_B_weight.size(1))}
+        )
+
+        super()._load_from_state_dict(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
+
+
+def create_network(
+    multiplier: float,
+    network_dim: Optional[int],
+    network_alpha: Optional[float],
+    vae: AutoencoderKL,
+    text_encoder: Union[CLIPTextModel, List[CLIPTextModel]],
+    unet,
+    **kwargs,
+):
+    if network_dim is None:
+        network_dim = 4  # default
+    if network_alpha is None:
+        network_alpha = 1.0
+
+    # extract dim/alpha for conv2d, and block dim
+    conv_dim = kwargs.get("conv_dim", None)
+    conv_alpha = kwargs.get("conv_alpha", None)
+    unit = kwargs.get("unit", None)
+    if conv_dim is not None:
+        conv_dim = int(conv_dim)
+        assert conv_dim == network_dim, "conv_dim must be same as network_dim"
+        if conv_alpha is None:
+            conv_alpha = 1.0
+        else:
+            conv_alpha = float(conv_alpha)
+
+    if unit is not None:
+        unit = int(unit)
+    else:
+        unit = 1
+
+    network = DyLoRANetwork(
+        text_encoder,
+        unet,
+        multiplier=multiplier,
+        lora_dim=network_dim,
+        alpha=network_alpha,
+        apply_to_conv=conv_dim is not None,
+        unit=unit,
+        varbose=True,
+    )
+
+    loraplus_lr_ratio = kwargs.get("loraplus_lr_ratio", None)
+    loraplus_unet_lr_ratio = kwargs.get("loraplus_unet_lr_ratio", None)
+    loraplus_text_encoder_lr_ratio = kwargs.get("loraplus_text_encoder_lr_ratio", None)
+    loraplus_lr_ratio = float(loraplus_lr_ratio) if loraplus_lr_ratio is not None else None
+    loraplus_unet_lr_ratio = float(loraplus_unet_lr_ratio) if loraplus_unet_lr_ratio is not None else None
+    loraplus_text_encoder_lr_ratio = float(loraplus_text_encoder_lr_ratio) if loraplus_text_encoder_lr_ratio is not None else None
+    if loraplus_lr_ratio is not None or loraplus_unet_lr_ratio is not None or loraplus_text_encoder_lr_ratio is not None:
+        network.set_loraplus_lr_ratio(loraplus_lr_ratio, loraplus_unet_lr_ratio, loraplus_text_encoder_lr_ratio)
+
+    return network
+
+
+# Create network from weights for inference, weights are not loaded here (because can be merged)
+def create_network_from_weights(multiplier, file, vae, text_encoder, unet, weights_sd=None, for_inference=False, **kwargs):
+    if weights_sd is None:
+        if os.path.splitext(file)[1] == ".safetensors":
+            from safetensors.torch import load_file, safe_open
+
+            weights_sd = load_file(file)
+        else:
+            weights_sd = torch.load(file, map_location="cpu")
+
+    # get dim/alpha mapping
+    modules_dim = {}
+    modules_alpha = {}
+    for key, value in weights_sd.items():
+        if "." not in key:
+            continue
+
+        lora_name = key.split(".")[0]
+        if "alpha" in key:
+            modules_alpha[lora_name] = value
+        elif "lora_down" in key:
+            dim = value.size()[0]
+            modules_dim[lora_name] = dim
+            # logger.info(f"{lora_name} {value.size()} {dim}")
+
+    # support old LoRA without alpha
+    for key in modules_dim.keys():
+        if key not in modules_alpha:
+            modules_alpha = modules_dim[key]
+
+    module_class = DyLoRAModule
+
+    network = DyLoRANetwork(
+        text_encoder, unet, multiplier=multiplier, modules_dim=modules_dim, modules_alpha=modules_alpha, module_class=module_class
+    )
+    return network, weights_sd
+
+
+class DyLoRANetwork(torch.nn.Module):
+    UNET_TARGET_REPLACE_MODULE = ["Transformer2DModel"]
+    UNET_TARGET_REPLACE_MODULE_CONV2D_3X3 = ["ResnetBlock2D", "Downsample2D", "Upsample2D"]
+    TEXT_ENCODER_TARGET_REPLACE_MODULE = ["CLIPAttention", "CLIPSdpaAttention", "CLIPMLP"]
+    LORA_PREFIX_UNET = "lora_unet"
+    LORA_PREFIX_TEXT_ENCODER = "lora_te"
+
+    def __init__(
+        self,
+        text_encoder,
+        unet,
+        multiplier=1.0,
+        lora_dim=4,
+        alpha=1,
+        apply_to_conv=False,
+        modules_dim=None,
+        modules_alpha=None,
+        unit=1,
+        module_class=DyLoRAModule,
+        varbose=False,
+    ) -> None:
+        super().__init__()
+        self.multiplier = multiplier
+
+        self.lora_dim = lora_dim
+        self.alpha = alpha
+        self.apply_to_conv = apply_to_conv
+
+        self.loraplus_lr_ratio = None
+        self.loraplus_unet_lr_ratio = None
+        self.loraplus_text_encoder_lr_ratio = None
+
+        if modules_dim is not None:
+            logger.info("create LoRA network from weights")
+        else:
+            logger.info(f"create LoRA network. base dim (rank): {lora_dim}, alpha: {alpha}, unit: {unit}")
+            if self.apply_to_conv:
+                logger.info("apply LoRA to Conv2d with kernel size (3,3).")
+
+        # create module instances
+        def create_modules(is_unet, root_module: torch.nn.Module, target_replace_modules) -> List[DyLoRAModule]:
+            prefix = DyLoRANetwork.LORA_PREFIX_UNET if is_unet else DyLoRANetwork.LORA_PREFIX_TEXT_ENCODER
+            loras = []
+            for name, module in root_module.named_modules():
+                if module.__class__.__name__ in target_replace_modules:
+                    for child_name, child_module in module.named_modules():
+                        is_linear = child_module.__class__.__name__ == "Linear"
+                        is_conv2d = child_module.__class__.__name__ == "Conv2d"
+                        is_conv2d_1x1 = is_conv2d and child_module.kernel_size == (1, 1)
+
+                        if is_linear or is_conv2d:
+                            lora_name = prefix + "." + name + "." + child_name
+                            lora_name = lora_name.replace(".", "_")
+
+                            dim = None
+                            alpha = None
+                            if modules_dim is not None:
+                                if lora_name in modules_dim:
+                                    dim = modules_dim[lora_name]
+                                    alpha = modules_alpha[lora_name]
+                            else:
+                                if is_linear or is_conv2d_1x1 or apply_to_conv:
+                                    dim = self.lora_dim
+                                    alpha = self.alpha
+
+                            if dim is None or dim == 0:
+                                continue
+
+                            # dropout and fan_in_fan_out is default
+                            lora = module_class(lora_name, child_module, self.multiplier, dim, alpha, unit)
+                            loras.append(lora)
+            return loras
+
+        text_encoders = text_encoder if type(text_encoder) == list else [text_encoder]
+
+        self.text_encoder_loras = []
+        for i, text_encoder in enumerate(text_encoders):
+            if len(text_encoders) > 1:
+                index = i + 1
+                logger.info(f"create LoRA for Text Encoder {index}")
+            else:
+                index = None
+                logger.info("create LoRA for Text Encoder")
+
+            text_encoder_loras = create_modules(False, text_encoder, DyLoRANetwork.TEXT_ENCODER_TARGET_REPLACE_MODULE)
+            self.text_encoder_loras.extend(text_encoder_loras)
+
+        # self.text_encoder_loras = create_modules(False, text_encoder, DyLoRANetwork.TEXT_ENCODER_TARGET_REPLACE_MODULE)
+        logger.info(f"create LoRA for Text Encoder: {len(self.text_encoder_loras)} modules.")
+
+        # extend U-Net target modules if conv2d 3x3 is enabled, or load from weights
+        target_modules = DyLoRANetwork.UNET_TARGET_REPLACE_MODULE
+        if modules_dim is not None or self.apply_to_conv:
+            target_modules += DyLoRANetwork.UNET_TARGET_REPLACE_MODULE_CONV2D_3X3
+
+        self.unet_loras = create_modules(True, unet, target_modules)
+        logger.info(f"create LoRA for U-Net: {len(self.unet_loras)} modules.")
+
+    def set_loraplus_lr_ratio(self, loraplus_lr_ratio, loraplus_unet_lr_ratio, loraplus_text_encoder_lr_ratio):
+        self.loraplus_lr_ratio = loraplus_lr_ratio
+        self.loraplus_unet_lr_ratio = loraplus_unet_lr_ratio
+        self.loraplus_text_encoder_lr_ratio = loraplus_text_encoder_lr_ratio
+
+        logger.info(f"LoRA+ UNet LR Ratio: {self.loraplus_unet_lr_ratio or self.loraplus_lr_ratio}")
+        logger.info(f"LoRA+ Text Encoder LR Ratio: {self.loraplus_text_encoder_lr_ratio or self.loraplus_lr_ratio}")
+
+    def set_multiplier(self, multiplier):
+        self.multiplier = multiplier
+        for lora in self.text_encoder_loras + self.unet_loras:
+            lora.multiplier = self.multiplier
+
+    def load_weights(self, file):
+        if os.path.splitext(file)[1] == ".safetensors":
+            from safetensors.torch import load_file
+
+            weights_sd = load_file(file)
+        else:
+            weights_sd = torch.load(file, map_location="cpu")
+
+        info = self.load_state_dict(weights_sd, False)
+        return info
+
+    def apply_to(self, text_encoder, unet, apply_text_encoder=True, apply_unet=True):
+        if apply_text_encoder:
+            logger.info("enable LoRA for text encoder")
+        else:
+            self.text_encoder_loras = []
+
+        if apply_unet:
+            logger.info("enable LoRA for U-Net")
+        else:
+            self.unet_loras = []
+
+        for lora in self.text_encoder_loras + self.unet_loras:
+            lora.apply_to()
+            self.add_module(lora.lora_name, lora)
+
+    """
+    def merge_to(self, text_encoder, unet, weights_sd, dtype, device):
+        apply_text_encoder = apply_unet = False
+        for key in weights_sd.keys():
+            if key.startswith(DyLoRANetwork.LORA_PREFIX_TEXT_ENCODER):
+                apply_text_encoder = True
+            elif key.startswith(DyLoRANetwork.LORA_PREFIX_UNET):
+                apply_unet = True
+
+        if apply_text_encoder:
+            logger.info("enable LoRA for text encoder")
+        else:
+            self.text_encoder_loras = []
+
+        if apply_unet:
+            logger.info("enable LoRA for U-Net")
+        else:
+            self.unet_loras = []
+
+        for lora in self.text_encoder_loras + self.unet_loras:
+            sd_for_lora = {}
+            for key in weights_sd.keys():
+                if key.startswith(lora.lora_name):
+                    sd_for_lora[key[len(lora.lora_name) + 1 :]] = weights_sd[key]
+            lora.merge_to(sd_for_lora, dtype, device)
+
+        logger.info(f"weights are merged")
+    """
+
+    # 二つのText Encoderに別々の学習率を設定できるようにするといいかも
+    def prepare_optimizer_params(self, text_encoder_lr, unet_lr, default_lr):
+        self.requires_grad_(True)
+        all_params = []
+
+        def assemble_params(loras, lr, ratio):
+            param_groups = {"lora": {}, "plus": {}}
+            for lora in loras:
+                for name, param in lora.named_parameters():
+                    if ratio is not None and "lora_B" in name:
+                        param_groups["plus"][f"{lora.lora_name}.{name}"] = param
+                    else:
+                        param_groups["lora"][f"{lora.lora_name}.{name}"] = param
+
+            params = []
+            for key in param_groups.keys():
+                param_data = {"params": param_groups[key].values()}
+
+                if len(param_data["params"]) == 0:
+                    continue
+
+                if lr is not None:
+                    if key == "plus":
+                        param_data["lr"] = lr * ratio
+                    else:
+                        param_data["lr"] = lr
+
+                if param_data.get("lr", None) == 0 or param_data.get("lr", None) is None:
+                    continue
+
+                params.append(param_data)
+
+            return params
+
+        if self.text_encoder_loras:
+            params = assemble_params(
+                self.text_encoder_loras,
+                text_encoder_lr if text_encoder_lr is not None else default_lr,
+                self.loraplus_text_encoder_lr_ratio or self.loraplus_lr_ratio,
+            )
+            all_params.extend(params)
+
+        if self.unet_loras:
+            params = assemble_params(
+                self.unet_loras, default_lr if unet_lr is None else unet_lr, self.loraplus_unet_lr_ratio or self.loraplus_lr_ratio
+            )
+            all_params.extend(params)
+
+        return all_params
+
+    def enable_gradient_checkpointing(self):
+        # not supported
+        pass
+
+    def prepare_grad_etc(self, text_encoder, unet):
+        self.requires_grad_(True)
+
+    def on_epoch_start(self, text_encoder, unet):
+        self.train()
+
+    def get_trainable_params(self):
+        return self.parameters()
+
+    def save_weights(self, file, dtype, metadata):
+        if metadata is not None and len(metadata) == 0:
+            metadata = None
+
+        state_dict = self.state_dict()
+
+        if dtype is not None:
+            for key in list(state_dict.keys()):
+                v = state_dict[key]
+                v = v.detach().clone().to("cpu").to(dtype)
+                state_dict[key] = v
+
+        if os.path.splitext(file)[1] == ".safetensors":
+            from safetensors.torch import save_file
+            from library import train_util
+
+            # Precalculate model hashes to save time on indexing
+            if metadata is None:
+                metadata = {}
+            model_hash, legacy_hash = train_util.precalculate_safetensors_hashes(state_dict, metadata)
+            metadata["sshs_model_hash"] = model_hash
+            metadata["sshs_legacy_hash"] = legacy_hash
+
+            save_file(state_dict, file, metadata)
+        else:
+            torch.save(state_dict, file)
+
+    # mask is a tensor with values from 0 to 1
+    def set_region(self, sub_prompt_index, is_last_network, mask):
+        pass
+
+    def set_current_generation(self, batch_size, num_sub_prompts, width, height, shared):
+        pass
--- a/networks/extract_lora_from_dylora.py
+++ b/networks/extract_lora_from_dylora.py
@@ -0,0 +1,128 @@
+# Convert LoRA to different rank approximation (should only be used to go to lower rank)
+# This code is based off the extract_lora_from_models.py file which is based on https://github.com/cloneofsimo/lora/blob/develop/lora_diffusion/cli_svd.py
+# Thanks to cloneofsimo
+
+import argparse
+import math
+import os
+import torch
+from safetensors.torch import load_file, save_file, safe_open
+from tqdm import tqdm
+from library import train_util, model_util
+import numpy as np
+from library.utils import setup_logging
+setup_logging()
+import logging
+logger = logging.getLogger(__name__)
+
+def load_state_dict(file_name):
+    if model_util.is_safetensors(file_name):
+        sd = load_file(file_name)
+        with safe_open(file_name, framework="pt") as f:
+            metadata = f.metadata()
+    else:
+        sd = torch.load(file_name, map_location="cpu")
+        metadata = None
+
+    return sd, metadata
+
+
+def save_to_file(file_name, model, metadata):
+    if model_util.is_safetensors(file_name):
+        save_file(model, file_name, metadata)
+    else:
+        torch.save(model, file_name)
+
+
+def split_lora_model(lora_sd, unit):
+    max_rank = 0
+
+    # Extract loaded lora dim and alpha
+    for key, value in lora_sd.items():
+        if "lora_down" in key:
+            rank = value.size()[0]
+            if rank > max_rank:
+                max_rank = rank
+    logger.info(f"Max rank: {max_rank}")
+
+    rank = unit
+    split_models = []
+    new_alpha = None
+    while rank < max_rank:
+        logger.info(f"Splitting rank {rank}")
+        new_sd = {}
+        for key, value in lora_sd.items():
+            if "lora_down" in key:
+                new_sd[key] = value[:rank].contiguous()
+            elif "lora_up" in key:
+                new_sd[key] = value[:, :rank].contiguous()
+            else:
+                # なぜかscaleするとおかしくなる……
+                # this_rank = lora_sd[key.replace("alpha", "lora_down.weight")].size()[0]
+                # scale = math.sqrt(this_rank / rank)  # rank is > unit
+                # logger.info(key, value.size(), this_rank, rank, value, scale)
+                # new_alpha = value * scale  # always same
+                # new_sd[key] = new_alpha
+                new_sd[key] = value
+
+        split_models.append((new_sd, rank, new_alpha))
+        rank += unit
+
+    return max_rank, split_models
+
+
+def split(args):
+    logger.info("loading Model...")
+    lora_sd, metadata = load_state_dict(args.model)
+
+    logger.info("Splitting Model...")
+    original_rank, split_models = split_lora_model(lora_sd, args.unit)
+
+    comment = metadata.get("ss_training_comment", "")
+    for state_dict, new_rank, new_alpha in split_models:
+        # update metadata
+        if metadata is None:
+            new_metadata = {}
+        else:
+            new_metadata = metadata.copy()
+
+        new_metadata["ss_training_comment"] = f"split from DyLoRA, rank {original_rank} to {new_rank}; {comment}"
+        new_metadata["ss_network_dim"] = str(new_rank)
+        # new_metadata["ss_network_alpha"] = str(new_alpha.float().numpy())
+
+        model_hash, legacy_hash = train_util.precalculate_safetensors_hashes(state_dict, metadata)
+        metadata["sshs_model_hash"] = model_hash
+        metadata["sshs_legacy_hash"] = legacy_hash
+
+        filename, ext = os.path.splitext(args.save_to)
+        model_file_name = filename + f"-{new_rank:04d}{ext}"
+
+        logger.info(f"saving model to: {model_file_name}")
+        save_to_file(model_file_name, state_dict, new_metadata)
+
+
+def setup_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--unit", type=int, default=None, help="size of rank to split into / rankを分割するサイズ")
+    parser.add_argument(
+        "--save_to",
+        type=str,
+        default=None,
+        help="destination base file name: ckpt or safetensors file / 保存先のファイル名のbase、ckptまたはsafetensors",
+    )
+    parser.add_argument(
+        "--model",
+        type=str,
+        default=None,
+        help="DyLoRA model to resize at to new rank: ckpt or safetensors file / 読み込むDyLoRAモデル、ckptまたはsafetensors",
+    )
+
+    return parser
+
+
+if __name__ == "__main__":
+    parser = setup_parser()
+
+    args = parser.parse_args()
+    split(args)
--- a/Show More
+++ b/Show More