Merge pull request #1067 from kohya-ss/dev

Dev
update readme
2026-04-06 21:52:27 +00:00 · 2024-01-23 21:04:16 +09:00 · 2024-01-23 21:02:40 +09:00 · 2024-01-23 20:33:55 +09:00 · 2024-01-23 20:25:37 +09:00 · 2024-01-23 20:17:19 +09:00
30 changed files with 1383 additions and 1011 deletions
--- a/README.md
+++ b/README.md
@@ -249,98 +249,101 @@ ControlNet-LLLite, a novel method for ControlNet with SDXL, is added. See [docum

 ## Change History

-### Dec 24, 2023 / 2023/12/24
+### Jan 23, 2024 / 2024/1/23: v0.8.2

- Fixed to work `tools/convert_diffusers20_original_sd.py`. Thanks to Disty0! PR [#1016](https://github.com/kohya-ss/sd-scripts/pull/1016)
+- [Experimental] The `--fp8_base` option is added to the training scripts for LoRA etc. The base model (U-Net, and Text Encoder when training modules for Text Encoder) can be trained with fp8. PR [#1057](https://github.com/kohya-ss/sd-scripts/pull/1057) Thanks to KohakuBlueleaf!
+  - Please specify `--fp8_base` in `train_network.py` or `sdxl_train_network.py`.
+  - PyTorch 2.1 or later is required.
+  - If you use xformers with PyTorch 2.1, please see [xformers repository](https://github.com/facebookresearch/xformers) and install the appropriate version according to your CUDA version.
+  - The sample image generation during training consumes a lot of memory. It is recommended to turn it off.

- `tools/convert_diffusers20_original_sd.py` が動かなくなっていたのが修正されました。Disty0 氏に感謝します。 PR [#1016](https://github.com/kohya-ss/sd-scripts/pull/1016) 
+- [Experimental] The network multiplier can be specified for each dataset in the training scripts for LoRA etc.
+  - This is an experimental option and may be removed or changed in the future.
+  - For example, if you train with state A as `1.0` and state B as `-1.0`, you may be able to generate by switching between state A and B depending on the LoRA application rate.
+  - Also, if you prepare five states and train them as `0.2`, `0.4`, `0.6`, `0.8`, and `1.0`, you may be able to generate by switching the states smoothly depending on the application rate.
+  - Please specify `network_multiplier` in `[[datasets]]` in `.toml` file.
+- Some options are added to `networks/extract_lora_from_models.py` to reduce the memory usage.
+  - `--load_precision` option can be used to specify the precision when loading the model. If the model is saved in fp16, you can reduce the memory usage by specifying `--load_precision fp16` without losing precision.
+  - `--load_original_model_to` option can be used to specify the device to load the original model. `--load_tuned_model_to` option can be used to specify the device to load the derived model. The default is `cpu` for both options, but you can specify `cuda` etc. You can reduce the memory usage by loading one of them to GPU. This option is available only for SDXL.
+
+- The gradient synchronization in LoRA training with multi-GPU is improved. PR [#1064](https://github.com/kohya-ss/sd-scripts/pull/1064) Thanks to KohakuBlueleaf!
+- The code for Intel IPEX support is improved. PR [#1060](https://github.com/kohya-ss/sd-scripts/pull/1060) Thanks to akx!
+- Fixed a bug in multi-GPU Textual Inversion training.
+
+- （実験的）　LoRA等の学習スクリプトで、ベースモデル（U-Net、および Text Encoder のモジュール学習時は Text Encoder も）の重みを fp8 にして学習するオプションが追加されました。 PR [#1057](https://github.com/kohya-ss/sd-scripts/pull/1057) KohakuBlueleaf 氏に感謝します。
+  - `train_network.py` または `sdxl_train_network.py` で `--fp8_base` を指定してください。
+  - PyTorch 2.1 以降が必要です。
+  - PyTorch 2.1 で xformers を使用する場合は、[xformers のリポジトリ](https://github.com/facebookresearch/xformers) を参照し、CUDA バージョンに応じて適切なバージョンをインストールしてください。
+  - 学習中のサンプル画像生成はメモリを大量に消費するため、オフにすることをお勧めします。
+- (実験的)　LoRA 等の学習で、データセットごとに異なるネットワーク適用率を指定できるようになりました。 
+  - 実験的オプションのため、将来的に削除または仕様変更される可能性があります。
+  - たとえば状態 A を `1.0`、状態 B を `-1.0` として学習すると、LoRA の適用率に応じて状態 A と B を切り替えつつ生成できるかもしれません。
+  - また、五段階の状態を用意し、それぞれ `0.2`、`0.4`、`0.6`、`0.8`、`1.0` として学習すると、適用率でなめらかに状態を切り替えて生成できるかもしれません。 
+  - `.toml` ファイルで `[[datasets]]` に `network_multiplier` を指定してください。
+- `networks/extract_lora_from_models.py` に使用メモリ量を削減するいくつかのオプションを追加しました。 
+  - `--load_precision` で読み込み時の精度を指定できます。モデルが fp16 で保存されている場合は `--load_precision fp16` を指定して精度を変えずにメモリ量を削減できます。
+  - `--load_original_model_to` で元モデルを読み込むデバイスを、`--load_tuned_model_to` で派生モデルを読み込むデバイスを指定できます。デフォルトは両方とも `cpu` ですがそれぞれ `cuda` 等を指定できます。片方を GPU に読み込むことでメモリ量を削減できます。SDXL の場合のみ有効です。
+- マルチ GPU での LoRA 等の学習時に勾配の同期が改善されました。 PR [#1064](https://github.com/kohya-ss/sd-scripts/pull/1064) KohakuBlueleaf 氏に感謝します。
+- Intel IPEX サポートのコードが改善されました。PR [#1060](https://github.com/kohya-ss/sd-scripts/pull/1060) akx 氏に感謝します。
+- マルチ GPU での Textual Inversion 学習の不具合を修正しました。
+
+- `.toml` example for network multiplier / ネットワーク適用率の `.toml` の記述例
+
+```toml
+[general]
+[[datasets]]
+resolution = 512
+batch_size = 8
+network_multiplier = 1.0
+
+... subset settings ...
+
+[[datasets]]
+resolution = 512
+batch_size = 8
+network_multiplier = -1.0
+
+... subset settings ...
+```


-### Dec 21, 2023 / 2023/12/21
+### Jan 17, 2024 / 2024/1/17: v0.8.1

- The issues in multi-GPU training are fixed. Thanks to Isotr0py! PR [#989](https://github.com/kohya-ss/sd-scripts/pull/989) and [#1000](https://github.com/kohya-ss/sd-scripts/pull/1000)
-  - `--ddp_gradient_as_bucket_view` and `--ddp_bucket_view`options are added to `sdxl_train.py`. Please specify these options for multi-GPU training.
- IPEX support is updated. Thanks to Disty0!
- Fixed the bug that the size of the bucket becomes less than `min_bucket_reso`. Thanks to Cauldrath! PR [#1008](https://github.com/kohya-ss/sd-scripts/pull/1008)
- `--sample_at_first` option is added to each training script. This option is useful to generate images at the first step, before training. Thanks to shirayu! PR [#907](https://github.com/kohya-ss/sd-scripts/pull/907)
- `--ss` option is added to the sampling prompt in training. You can specify the scheduler for the sampling like `--ss euler_a`. Thanks to shirayu! PR [#906](https://github.com/kohya-ss/sd-scripts/pull/906)
- `keep_tokens_separator` is added to the dataset config. This option is useful to keep (prevent from shuffling) the tokens in the captions. See [#975](https://github.com/kohya-ss/sd-scripts/pull/975) for details. Thanks to Linaqruf!
-  - You can specify the separator with an option like `--keep_tokens_separator "|||"` or with `keep_tokens_separator: "|||"` in `.toml`. The tokens before `|||` are not shuffled.
- Attention processor hook is added. See [#961](https://github.com/kohya-ss/sd-scripts/pull/961) for details. Thanks to rockerBOO!
- The optimizer `PagedAdamW` is added. Thanks to xzuyn! PR [#955](https://github.com/kohya-ss/sd-scripts/pull/955)
- NaN replacement in SDXL VAE is sped up. Thanks to liubo0902! PR [#1009](https://github.com/kohya-ss/sd-scripts/pull/1009)
- Fixed the path error in `finetune/make_captions.py`. Thanks to CjangCjengh! PR [#986](https://github.com/kohya-ss/sd-scripts/pull/986)
+- Fixed a bug that the VRAM usage without Text Encoder training is larger than before in training scripts for LoRA etc (`train_network.py`, `sdxl_train_network.py`).
+  - Text Encoders were not moved to CPU.
+- Fixed typos. Thanks to akx! [PR #1053](https://github.com/kohya-ss/sd-scripts/pull/1053)

- マルチGPUでの学習の不具合を修正しました。Isotr0py 氏に感謝します。 PR [#989](https://github.com/kohya-ss/sd-scripts/pull/989) および [#1000](https://github.com/kohya-ss/sd-scripts/pull/1000)
-  - `sdxl_train.py` に `--ddp_gradient_as_bucket_view` と `--ddp_bucket_view` オプションが追加されました。マルチGPUでの学習時にはこれらのオプションを指定してください。
- IPEX サポートが更新されました。Disty0 氏に感謝します。
- Aspect Ratio Bucketing で bucket のサイズが `min_bucket_reso` 未満になる不具合を修正しました。Cauldrath 氏に感謝します。 PR [#1008](https://github.com/kohya-ss/sd-scripts/pull/1008)
- 各学習スクリプトに `--sample_at_first` オプションが追加されました。学習前に画像を生成することで、学習結果が比較しやすくなります。shirayu 氏に感謝します。 PR [#907](https://github.com/kohya-ss/sd-scripts/pull/907)
- 学習時のプロンプトに `--ss` オプションが追加されました。`--ss euler_a` のようにスケジューラを指定できます。shirayu 氏に感謝します。 PR [#906](https://github.com/kohya-ss/sd-scripts/pull/906)
- データセット設定に `keep_tokens_separator` が追加されました。キャプション内のトークンをどの位置までシャッフルしないかを指定できます。詳細は [#975](https://github.com/kohya-ss/sd-scripts/pull/975) を参照してください。Linaqruf 氏に感謝します。
-  - オプションで `--keep_tokens_separator "|||"` のように指定するか、`.toml` で `keep_tokens_separator: "|||"` のように指定します。`|||` の前のトークンはシャッフルされません。
- Attention processor hook が追加されました。詳細は [#961](https://github.com/kohya-ss/sd-scripts/pull/961) を参照してください。rockerBOO 氏に感謝します。
- オプティマイザ `PagedAdamW` が追加されました。xzuyn 氏に感謝します。 PR [#955](https://github.com/kohya-ss/sd-scripts/pull/955)
- 学習時、SDXL VAE で NaN が発生した時の置き換えが高速化されました。liubo0902 氏に感謝します。 PR [#1009](https://github.com/kohya-ss/sd-scripts/pull/1009)
- `finetune/make_captions.py` で相対パス指定時のエラーが修正されました。CjangCjengh 氏に感謝します。 PR [#986](https://github.com/kohya-ss/sd-scripts/pull/986)
+- LoRA 等の学習スクリプト（`train_network.py`、`sdxl_train_network.py`）で、Text Encoder を学習しない場合の VRAM 使用量が以前に比べて大きくなっていた不具合を修正しました。 
+  - Text Encoder が GPU に保持されたままになっていました。
+- 誤字が修正されました。 [PR #1053](https://github.com/kohya-ss/sd-scripts/pull/1053) akx 氏に感謝します。

-### Dec 3, 2023 / 2023/12/3
+### Jan 15, 2024 / 2024/1/15: v0.8.0

- `finetune\tag_images_by_wd14_tagger.py` now supports the separator other than `,` with `--caption_separator` option. Thanks to KohakuBlueleaf! PR [#913](https://github.com/kohya-ss/sd-scripts/pull/913)
- Min SNR Gamma with V-predicition (SD 2.1) is fixed. Thanks to feffy380! PR[#934](https://github.com/kohya-ss/sd-scripts/pull/934)
-  - See [#673](https://github.com/kohya-ss/sd-scripts/issues/673) for details.
- `--min_diff` and `--clamp_quantile` options are added to `networks/extract_lora_from_models.py`. Thanks to wkpark! PR [#936](https://github.com/kohya-ss/sd-scripts/pull/936)
-  - The default values are same as the previous version.
- Deep Shrink hires fix is supported in `sdxl_gen_img.py` and `gen_img_diffusers.py`.
-  - `--ds_timesteps_1` and `--ds_timesteps_2` options denote the timesteps of the Deep Shrink for the first and second stages.
-  - `--ds_depth_1` and `--ds_depth_2` options denote the depth (block index) of the Deep Shrink for the first and second stages.
-  - `--ds_ratio` option denotes the ratio of the Deep Shrink. `0.5` means the half of the original latent size for the Deep Shrink.
-  - `--dst1`, `--dst2`, `--dsd1`, `--dsd2` and `--dsr` prompt options are also available.
+- Diffusers, Accelerate, Transformers and other related libraries have been updated. Please update the libraries with [Upgrade](#upgrade).
+  - Some model files (Text Encoder without position_id) based on the latest Transformers can be loaded.
+- `torch.compile` is supported (experimental). PR [#1024](https://github.com/kohya-ss/sd-scripts/pull/1024) Thanks to p1atdev!
+  - This feature works only on Linux or WSL.
+  - Please specify `--torch_compile` option in each training script.
+  - You can select the backend with `--dynamo_backend` option. The default is `"inductor"`. `inductor` or `eager` seems to work.
+  - Please use `--sdpa` option instead of `--xformers` option.
+  - PyTorch 2.1 or later is recommended.
+  - Please see [PR](https://github.com/kohya-ss/sd-scripts/pull/1024) for details.
+- The session name for wandb can be specified with `--wandb_run_name` option. PR [#1032](https://github.com/kohya-ss/sd-scripts/pull/1032) Thanks to hopl1t!
+- IPEX library is updated. PR [#1030](https://github.com/kohya-ss/sd-scripts/pull/1030) Thanks to Disty0!
+- Fixed a bug that Diffusers format model cannot be saved.

- `finetune\tag_images_by_wd14_tagger.py` で `--caption_separator` オプションでカンマ以外の区切り文字を指定できるようになりました。KohakuBlueleaf 氏に感謝します。 PR [#913](https://github.com/kohya-ss/sd-scripts/pull/913)
- V-predicition (SD 2.1) での Min SNR Gamma が修正されました。feffy380 氏に感謝します。 PR[#934](https://github.com/kohya-ss/sd-scripts/pull/934)
-  - 詳細は [#673](https://github.com/kohya-ss/sd-scripts/issues/673) を参照してください。
- `networks/extract_lora_from_models.py` に `--min_diff` と `--clamp_quantile` オプションが追加されました。wkpark 氏に感謝します。 PR [#936](https://github.com/kohya-ss/sd-scripts/pull/936)
-  - デフォルト値は前のバージョンと同じです。
- `sdxl_gen_img.py` と `gen_img_diffusers.py` で Deep Shrink hires fix をサポートしました。
-  - `--ds_timesteps_1` と `--ds_timesteps_2` オプションは Deep Shrink の第一段階と第二段階の timesteps を指定します。
-  - `--ds_depth_1` と `--ds_depth_2` オプションは Deep Shrink の第一段階と第二段階の深さ（ブロックの index）を指定します。
-  - `--ds_ratio` オプションは Deep Shrink の比率を指定します。`0.5` を指定すると Deep Shrink 適用時の latent は元のサイズの半分になります。
-  - `--dst1`、`--dst2`、`--dsd1`、`--dsd2`、`--dsr` プロンプトオプションも使用できます。
-
-### Nov 5, 2023 / 2023/11/5
-
- `sdxl_train.py` now supports different learning rates for each Text Encoder.
-  - Example:
-    - `--learning_rate 1e-6`: train U-Net only
-    - `--train_text_encoder --learning_rate 1e-6`: train U-Net and two Text Encoders with the same learning rate (same as the previous version)
-    - `--train_text_encoder --learning_rate 1e-6 --learning_rate_te1 1e-6 --learning_rate_te2 1e-6`: train U-Net and two Text Encoders with the different learning rates
-    - `--train_text_encoder --learning_rate 0 --learning_rate_te1 1e-6 --learning_rate_te2 1e-6`: train two Text Encoders only 
-    - `--train_text_encoder --learning_rate 1e-6 --learning_rate_te1 1e-6 --learning_rate_te2 0`: train U-Net and one Text Encoder only
-    - `--train_text_encoder --learning_rate 0 --learning_rate_te1 0 --learning_rate_te2 1e-6`: train one Text Encoder only
-
- `train_db.py` and `fine_tune.py` now support different learning rates for Text Encoder. Specify with `--learning_rate_te` option. 
-  - To train Text Encoder with `fine_tune.py`, specify `--train_text_encoder` option too. `train_db.py` trains Text Encoder by default.
-
- Fixed the bug that Text Encoder is not trained when block lr is specified in `sdxl_train.py`.
-
- Debiased Estimation loss is added to each training script. Thanks to sdbds!
-  - Specify `--debiased_estimation_loss` option to enable it. See PR [#889](https://github.com/kohya-ss/sd-scripts/pull/889) for details.
- Training of Text Encoder is improved in `train_network.py` and `sdxl_train_network.py`. Thanks to KohakuBlueleaf! PR [#895](https://github.com/kohya-ss/sd-scripts/pull/895)
- The moving average of the loss is now displayed in the progress bar in each training script. Thanks to shirayu! PR [#899](https://github.com/kohya-ss/sd-scripts/pull/899)
- PagedAdamW32bit optimizer is supported. Specify `--optimizer_type=PagedAdamW32bit`. Thanks to xzuyn! PR [#900](https://github.com/kohya-ss/sd-scripts/pull/900)
- Other bug fixes and improvements.
-
- `sdxl_train.py` で、二つのText Encoderそれぞれに独立した学習率が指定できるようになりました。サンプルは上の英語版を参照してください。
- `train_db.py` および `fine_tune.py` で Text Encoder に別の学習率を指定できるようになりました。`--learning_rate_te` オプションで指定してください。
-  - `fine_tune.py` で Text Encoder を学習するには `--train_text_encoder` オプションをあわせて指定してください。`train_db.py` はデフォルトで学習します。
- `sdxl_train.py` で block lr を指定すると Text Encoder が学習されない不具合を修正しました。
- Debiased Estimation loss が各学習スクリプトに追加されました。sdbsd 氏に感謝します。
-  - `--debiased_estimation_loss` を指定すると有効になります。詳細は PR [#889](https://github.com/kohya-ss/sd-scripts/pull/889) を参照してください。
- `train_network.py` と `sdxl_train_network.py` でText Encoderの学習が改善されました。KohakuBlueleaf 氏に感謝します。 PR [#895](https://github.com/kohya-ss/sd-scripts/pull/895)
- 各学習スクリプトで移動平均のlossがプログレスバーに表示されるようになりました。shirayu 氏に感謝します。 PR [#899](https://github.com/kohya-ss/sd-scripts/pull/899)
- PagedAdamW32bit オプティマイザがサポートされました。`--optimizer_type=PagedAdamW32bit` と指定してください。xzuyn 氏に感謝します。 PR [#900](https://github.com/kohya-ss/sd-scripts/pull/900)
- その他のバグ修正と改善。
+- Diffusers、Accelerate、Transformers 等の関連ライブラリを更新しました。[Upgrade](#upgrade) を参照し更新をお願いします。
+  - 最新の Transformers を前提とした一部のモデルファイル（Text Encoder が position_id を持たないもの）が読み込めるようになりました。
+- `torch.compile` がサポートされしました（実験的）。 PR [#1024](https://github.com/kohya-ss/sd-scripts/pull/1024) p1atdev 氏に感謝します。
+  - Linux または WSL でのみ動作します。
+  - 各学習スクリプトで `--torch_compile` オプションを指定してください。
+  - `--dynamo_backend` オプションで使用される backend を選択できます。デフォルトは `"inductor"` です。 `inductor` または `eager` が動作するようです。
+  - `--xformers` オプションとは互換性がありません。 代わりに `--sdpa` オプションを使用してください。
+  - PyTorch 2.1以降を推奨します。
+  - 詳細は [PR](https://github.com/kohya-ss/sd-scripts/pull/1024) をご覧ください。
+- wandb 保存時のセッション名が各学習スクリプトの `--wandb_run_name` オプションで指定できるようになりました。 PR [#1032](https://github.com/kohya-ss/sd-scripts/pull/1032) hopl1t 氏に感謝します。
+- IPEX ライブラリが更新されました。[PR #1030](https://github.com/kohya-ss/sd-scripts/pull/1030) Disty0 氏に感謝します。
+- Diffusers 形式でのモデル保存ができなくなっていた不具合を修正しました。


 Please read [Releases](https://github.com/kohya-ss/sd-scripts/releases) for recent updates.
--- a/XTI_hijack.py
+++ b/XTI_hijack.py
@@ -1,11 +1,7 @@
 import torch
-try:
-    import intel_extension_for_pytorch as ipex
-    if torch.xpu.is_available():
-        from library.ipex import ipex_init
-        ipex_init()
-except Exception:
-    pass
+from library.ipex_interop import init_ipex
+
+init_ipex()
 from typing import Union, List, Optional, Dict, Any, Tuple
 from diffusers.models.unet_2d_condition import UNet2DConditionOutput

--- a/fine_tune.py
+++ b/fine_tune.py
@@ -11,15 +11,10 @@ import toml
 from tqdm import tqdm
 import torch

-try:
-    import intel_extension_for_pytorch as ipex
+from library.ipex_interop import init_ipex

-    if torch.xpu.is_available():
-        from library.ipex import ipex_init
+init_ipex()

-        ipex_init()
-except Exception:
-    pass
 from accelerate.utils import set_seed
 from diffusers import DDPMScheduler

@@ -291,6 +286,8 @@ def train(args):

    if accelerator.is_main_process:
        init_kwargs = {}
+        if args.wandb_run_name:
+            init_kwargs['wandb'] = {'name': args.wandb_run_name}
        if args.log_tracker_config is not None:
            init_kwargs = toml.load(args.log_tracker_config)
        accelerator.init_trackers("finetuning" if args.log_tracker_name is None else args.log_tracker_name, init_kwargs=init_kwargs)
--- a/gen_img_diffusers.py
+++ b/gen_img_diffusers.py
@@ -66,15 +66,10 @@ import diffusers
 import numpy as np
 import torch

-try:
-    import intel_extension_for_pytorch as ipex
+from library.ipex_interop import init_ipex

-    if torch.xpu.is_available():
-        from library.ipex import ipex_init
+init_ipex()

-        ipex_init()
-except Exception:
-    pass
 import torchvision
 from diffusers import (
    AutoencoderKL,
--- a/library/config_util.py
+++ b/library/config_util.py
--- a/library/ipex/init.py
+++ b/library/ipex/init.py
@@ -140,6 +140,7 @@ def ipex_init(): # pylint: disable=too-many-statements

        # C
        torch._C._cuda_getCurrentRawStream = ipex._C._getCurrentStream
+        ipex._C._DeviceProperties.multi_processor_count = ipex._C._DeviceProperties.gpu_eu_count
        ipex._C._DeviceProperties.major = 2023
        ipex._C._DeviceProperties.minor = 2

--- a/library/ipex/attention.py
+++ b/library/ipex/attention.py
@@ -1,41 +1,98 @@
+import os
 import torch
 import intel_extension_for_pytorch as ipex # pylint: disable=import-error, unused-import
+from functools import cache

 # pylint: disable=protected-access, missing-function-docstring, line-too-long

-original_torch_bmm = torch.bmm
-def torch_bmm_32_bit(input, mat2, *, out=None):
-    # ARC GPUs can't allocate more than 4GB to a single block, Slice it:
-    batch_size_attention, input_tokens, mat2_shape = input.shape[0], input.shape[1], mat2.shape[2]
-    block_multiply = input.element_size()
-    slice_block_size = input_tokens * mat2_shape / 1024 / 1024 * block_multiply
+# ARC GPUs can't allocate more than 4GB to a single block so we slice the attetion layers
+
+sdpa_slice_trigger_rate = float(os.environ.get('IPEX_SDPA_SLICE_TRIGGER_RATE', 4))
+attention_slice_rate = float(os.environ.get('IPEX_ATTENTION_SLICE_RATE', 4))
+
+# Find something divisible with the input_tokens
+@cache
+def find_slice_size(slice_size, slice_block_size):
+    while (slice_size * slice_block_size) > attention_slice_rate:
+        slice_size = slice_size // 2
+        if slice_size <= 1:
+            slice_size = 1
+            break
+    return slice_size
+
+# Find slice sizes for SDPA
+@cache
+def find_sdpa_slice_sizes(query_shape, query_element_size):
+    if len(query_shape) == 3:
+        batch_size_attention, query_tokens, shape_three = query_shape
+        shape_four = 1
+    else:
+        batch_size_attention, query_tokens, shape_three, shape_four = query_shape
+
+    slice_block_size = query_tokens * shape_three * shape_four / 1024 / 1024 * query_element_size
    block_size = batch_size_attention * slice_block_size

    split_slice_size = batch_size_attention
-    if block_size > 4:
-        do_split = True
-        # Find something divisible with the input_tokens
-        while (split_slice_size * slice_block_size) > 4:
-            split_slice_size = split_slice_size // 2
-            if split_slice_size <= 1:
-                split_slice_size = 1
-                break
-        split_2_slice_size = input_tokens
-        if split_slice_size * slice_block_size > 4:
-            slice_block_size_2 = split_slice_size * mat2_shape / 1024 / 1024 * block_multiply
-            do_split_2 = True
-            # Find something divisible with the input_tokens
-            while (split_2_slice_size * slice_block_size_2) > 4:
-                split_2_slice_size = split_2_slice_size // 2
-                if split_2_slice_size <= 1:
-                    split_2_slice_size = 1
-                    break
-        else:
-            do_split_2 = False
-    else:
-        do_split = False
+    split_2_slice_size = query_tokens
+    split_3_slice_size = shape_three

+    do_split = False
+    do_split_2 = False
+    do_split_3 = False
+
+    if block_size > sdpa_slice_trigger_rate:
+        do_split = True
+        split_slice_size = find_slice_size(split_slice_size, slice_block_size)
+        if split_slice_size * slice_block_size > attention_slice_rate:
+            slice_2_block_size = split_slice_size * shape_three * shape_four / 1024 / 1024 * query_element_size
+            do_split_2 = True
+            split_2_slice_size = find_slice_size(split_2_slice_size, slice_2_block_size)
+            if split_2_slice_size * slice_2_block_size > attention_slice_rate:
+                slice_3_block_size = split_slice_size * split_2_slice_size * shape_four / 1024 / 1024 * query_element_size
+                do_split_3 = True
+                split_3_slice_size = find_slice_size(split_3_slice_size, slice_3_block_size)
+
+    return do_split, do_split_2, do_split_3, split_slice_size, split_2_slice_size, split_3_slice_size
+
+# Find slice sizes for BMM
+@cache
+def find_bmm_slice_sizes(input_shape, input_element_size, mat2_shape):
+    batch_size_attention, input_tokens, mat2_atten_shape = input_shape[0], input_shape[1], mat2_shape[2]
+    slice_block_size = input_tokens * mat2_atten_shape / 1024 / 1024 * input_element_size
+    block_size = batch_size_attention * slice_block_size
+
+    split_slice_size = batch_size_attention
+    split_2_slice_size = input_tokens
+    split_3_slice_size = mat2_atten_shape
+
+    do_split = False
+    do_split_2 = False
+    do_split_3 = False
+
+    if block_size > attention_slice_rate:
+        do_split = True
+        split_slice_size = find_slice_size(split_slice_size, slice_block_size)
+        if split_slice_size * slice_block_size > attention_slice_rate:
+            slice_2_block_size = split_slice_size * mat2_atten_shape / 1024 / 1024 * input_element_size
+            do_split_2 = True
+            split_2_slice_size = find_slice_size(split_2_slice_size, slice_2_block_size)
+            if split_2_slice_size * slice_2_block_size > attention_slice_rate:
+                slice_3_block_size = split_slice_size * split_2_slice_size / 1024 / 1024 * input_element_size
+                do_split_3 = True
+                split_3_slice_size = find_slice_size(split_3_slice_size, slice_3_block_size)
+
+    return do_split, do_split_2, do_split_3, split_slice_size, split_2_slice_size, split_3_slice_size
+
+
+original_torch_bmm = torch.bmm
+def torch_bmm_32_bit(input, mat2, *, out=None):
+    if input.device.type != "xpu":
+        return original_torch_bmm(input, mat2, out=out)
+    do_split, do_split_2, do_split_3, split_slice_size, split_2_slice_size, split_3_slice_size = find_bmm_slice_sizes(input.shape, input.element_size(), mat2.shape)
+
+    # Slice BMM
    if do_split:
+        batch_size_attention, input_tokens, mat2_atten_shape = input.shape[0], input.shape[1], mat2.shape[2]
        hidden_states = torch.zeros(input.shape[0], input.shape[1], mat2.shape[2], device=input.device, dtype=input.dtype)
        for i in range(batch_size_attention // split_slice_size):
            start_idx = i * split_slice_size
@@ -44,11 +101,21 @@ def torch_bmm_32_bit(input, mat2, *, out=None):
                for i2 in range(input_tokens // split_2_slice_size): # pylint: disable=invalid-name
                    start_idx_2 = i2 * split_2_slice_size
                    end_idx_2 = (i2 + 1) * split_2_slice_size
-                    hidden_states[start_idx:end_idx, start_idx_2:end_idx_2] = original_torch_bmm(
-                        input[start_idx:end_idx, start_idx_2:end_idx_2],
-                        mat2[start_idx:end_idx, start_idx_2:end_idx_2],
-                        out=out
-                    )
+                    if do_split_3:
+                        for i3 in range(mat2_atten_shape // split_3_slice_size): # pylint: disable=invalid-name
+                            start_idx_3 = i3 * split_3_slice_size
+                            end_idx_3 = (i3 + 1) * split_3_slice_size
+                            hidden_states[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3] = original_torch_bmm(
+                                input[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3],
+                                mat2[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3],
+                                out=out
+                            )
+                    else:
+                        hidden_states[start_idx:end_idx, start_idx_2:end_idx_2] = original_torch_bmm(
+                            input[start_idx:end_idx, start_idx_2:end_idx_2],
+                            mat2[start_idx:end_idx, start_idx_2:end_idx_2],
+                            out=out
+                        )
            else:
                hidden_states[start_idx:end_idx] = original_torch_bmm(
                    input[start_idx:end_idx],
@@ -61,54 +128,13 @@ def torch_bmm_32_bit(input, mat2, *, out=None):

 original_scaled_dot_product_attention = torch.nn.functional.scaled_dot_product_attention
 def scaled_dot_product_attention_32_bit(query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False):
-    # ARC GPUs can't allocate more than 4GB to a single block, Slice it:
-    if len(query.shape) == 3:
-        batch_size_attention, query_tokens, shape_three = query.shape
-        shape_four = 1
-    else:
-        batch_size_attention, query_tokens, shape_three, shape_four = query.shape
-
-    block_multiply = query.element_size()
-    slice_block_size = query_tokens * shape_three * shape_four / 1024 / 1024 * block_multiply
-    block_size = batch_size_attention * slice_block_size
-
-    split_slice_size = batch_size_attention
-    if block_size > 4:
-        do_split = True
-        # Find something divisible with the batch_size_attention
-        while (split_slice_size * slice_block_size) > 4:
-            split_slice_size = split_slice_size // 2
-            if split_slice_size <= 1:
-                split_slice_size = 1
-                break
-        split_2_slice_size = query_tokens
-        if split_slice_size * slice_block_size > 4:
-            slice_block_size_2 = split_slice_size * shape_three * shape_four / 1024 / 1024 * block_multiply
-            do_split_2 = True
-            # Find something divisible with the query_tokens
-            while (split_2_slice_size * slice_block_size_2) > 4:
-                split_2_slice_size = split_2_slice_size // 2
-                if split_2_slice_size <= 1:
-                    split_2_slice_size = 1
-                    break
-            split_3_slice_size = shape_three
-            if split_2_slice_size * slice_block_size_2 > 4:
-                slice_block_size_3 = split_slice_size * split_2_slice_size * shape_four / 1024 / 1024 * block_multiply
-                do_split_3 = True
-                # Find something divisible with the shape_three
-                while (split_3_slice_size * slice_block_size_3) > 4:
-                    split_3_slice_size = split_3_slice_size // 2
-                    if split_3_slice_size <= 1:
-                        split_3_slice_size = 1
-                        break
-            else:
-                do_split_3 = False
-        else:
-            do_split_2 = False
-    else:
-        do_split = False
+    if query.device.type != "xpu":
+        return original_scaled_dot_product_attention(query, key, value, attn_mask=attn_mask, dropout_p=dropout_p, is_causal=is_causal)
+    do_split, do_split_2, do_split_3, split_slice_size, split_2_slice_size, split_3_slice_size = find_sdpa_slice_sizes(query.shape, query.element_size())

+    # Slice SDPA
    if do_split:
+        batch_size_attention, query_tokens, shape_three = query.shape[0], query.shape[1], query.shape[2]
        hidden_states = torch.zeros(query.shape, device=query.device, dtype=query.dtype)
        for i in range(batch_size_attention // split_slice_size):
            start_idx = i * split_slice_size
@@ -145,7 +171,5 @@ def scaled_dot_product_attention_32_bit(query, key, value, attn_mask=None, dropo
                    dropout_p=dropout_p, is_causal=is_causal
                )
    else:
-        return original_scaled_dot_product_attention(
-            query, key, value, attn_mask=attn_mask, dropout_p=dropout_p, is_causal=is_causal
-        )
+        return original_scaled_dot_product_attention(query, key, value, attn_mask=attn_mask, dropout_p=dropout_p, is_causal=is_causal)
    return hidden_states
--- a/library/ipex/diffusers.py
+++ b/library/ipex/diffusers.py
@@ -1,10 +1,62 @@
+import os
 import torch
 import intel_extension_for_pytorch as ipex # pylint: disable=import-error, unused-import
 import diffusers #0.24.0 # pylint: disable=import-error
 from diffusers.models.attention_processor import Attention
+from diffusers.utils import USE_PEFT_BACKEND
+from functools import cache

 # pylint: disable=protected-access, missing-function-docstring, line-too-long

+attention_slice_rate = float(os.environ.get('IPEX_ATTENTION_SLICE_RATE', 4))
+
+@cache
+def find_slice_size(slice_size, slice_block_size):
+    while (slice_size * slice_block_size) > attention_slice_rate:
+        slice_size = slice_size // 2
+        if slice_size <= 1:
+            slice_size = 1
+            break
+    return slice_size
+
+@cache
+def find_attention_slice_sizes(query_shape, query_element_size, query_device_type, slice_size=None):
+    if len(query_shape) == 3:
+        batch_size_attention, query_tokens, shape_three = query_shape
+        shape_four = 1
+    else:
+        batch_size_attention, query_tokens, shape_three, shape_four = query_shape
+    if slice_size is not None:
+        batch_size_attention = slice_size
+
+    slice_block_size = query_tokens * shape_three * shape_four / 1024 / 1024 * query_element_size
+    block_size = batch_size_attention * slice_block_size
+
+    split_slice_size = batch_size_attention
+    split_2_slice_size = query_tokens
+    split_3_slice_size = shape_three
+
+    do_split = False
+    do_split_2 = False
+    do_split_3 = False
+
+    if query_device_type != "xpu":
+        return do_split, do_split_2, do_split_3, split_slice_size, split_2_slice_size, split_3_slice_size
+
+    if block_size > attention_slice_rate:
+        do_split = True
+        split_slice_size = find_slice_size(split_slice_size, slice_block_size)
+        if split_slice_size * slice_block_size > attention_slice_rate:
+            slice_2_block_size = split_slice_size * shape_three * shape_four / 1024 / 1024 * query_element_size
+            do_split_2 = True
+            split_2_slice_size = find_slice_size(split_2_slice_size, slice_2_block_size)
+            if split_2_slice_size * slice_2_block_size > attention_slice_rate:
+                slice_3_block_size = split_slice_size * split_2_slice_size * shape_four / 1024 / 1024 * query_element_size
+                do_split_3 = True
+                split_3_slice_size = find_slice_size(split_3_slice_size, slice_3_block_size)
+
+    return do_split, do_split_2, do_split_3, split_slice_size, split_2_slice_size, split_3_slice_size
+
 class SlicedAttnProcessor: # pylint: disable=too-few-public-methods
    r"""
    Processor for implementing sliced attention.
@@ -18,7 +70,9 @@ class SlicedAttnProcessor: # pylint: disable=too-few-public-methods
    def __init__(self, slice_size):
        self.slice_size = slice_size

-    def __call__(self, attn: Attention, hidden_states, encoder_hidden_states=None, attention_mask=None): # pylint: disable=too-many-statements, too-many-locals, too-many-branches
+    def __call__(self, attn: Attention, hidden_states: torch.FloatTensor,
+    encoder_hidden_states=None, attention_mask=None) -> torch.FloatTensor: # pylint: disable=too-many-statements, too-many-locals, too-many-branches
+
        residual = hidden_states

        input_ndim = hidden_states.ndim
@@ -54,49 +108,61 @@ class SlicedAttnProcessor: # pylint: disable=too-few-public-methods
            (batch_size_attention, query_tokens, dim // attn.heads), device=query.device, dtype=query.dtype
        )

-        #ARC GPUs can't allocate more than 4GB to a single block, Slice it:
-        block_multiply = query.element_size()
-        slice_block_size = self.slice_size * shape_three / 1024 / 1024 * block_multiply
-        block_size = query_tokens * slice_block_size
-        split_2_slice_size = query_tokens
-        if block_size > 4:
-            do_split_2 = True
-            #Find something divisible with the query_tokens
-            while (split_2_slice_size * slice_block_size) > 4:
-                split_2_slice_size = split_2_slice_size // 2
-                if split_2_slice_size <= 1:
-                    split_2_slice_size = 1
-                    break
-        else:
-            do_split_2 = False
-
-        for i in range(batch_size_attention // self.slice_size):
-            start_idx = i * self.slice_size
-            end_idx = (i + 1) * self.slice_size
+        ####################################################################
+        # ARC GPUs can't allocate more than 4GB to a single block, Slice it:
+        _, do_split_2, do_split_3, split_slice_size, split_2_slice_size, split_3_slice_size = find_attention_slice_sizes(query.shape, query.element_size(), query.device.type, slice_size=self.slice_size)

+        for i in range(batch_size_attention // split_slice_size):
+            start_idx = i * split_slice_size
+            end_idx = (i + 1) * split_slice_size
            if do_split_2:
                for i2 in range(query_tokens // split_2_slice_size): # pylint: disable=invalid-name
                    start_idx_2 = i2 * split_2_slice_size
                    end_idx_2 = (i2 + 1) * split_2_slice_size
+                    if do_split_3:
+                        for i3 in range(shape_three // split_3_slice_size): # pylint: disable=invalid-name
+                            start_idx_3 = i3 * split_3_slice_size
+                            end_idx_3 = (i3 + 1) * split_3_slice_size

-                    query_slice = query[start_idx:end_idx, start_idx_2:end_idx_2]
-                    key_slice = key[start_idx:end_idx, start_idx_2:end_idx_2]
-                    attn_mask_slice = attention_mask[start_idx:end_idx, start_idx_2:end_idx_2] if attention_mask is not None else None
+                            query_slice = query[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3]
+                            key_slice = key[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3]
+                            attn_mask_slice = attention_mask[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3] if attention_mask is not None else None

-                    attn_slice = attn.get_attention_scores(query_slice, key_slice, attn_mask_slice)
-                    attn_slice = torch.bmm(attn_slice, value[start_idx:end_idx, start_idx_2:end_idx_2])
+                            attn_slice = attn.get_attention_scores(query_slice, key_slice, attn_mask_slice)
+                            del query_slice
+                            del key_slice
+                            del attn_mask_slice
+                            attn_slice = torch.bmm(attn_slice, value[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3])

-                    hidden_states[start_idx:end_idx, start_idx_2:end_idx_2] = attn_slice
+                            hidden_states[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3] = attn_slice
+                            del attn_slice
+                    else:
+                        query_slice = query[start_idx:end_idx, start_idx_2:end_idx_2]
+                        key_slice = key[start_idx:end_idx, start_idx_2:end_idx_2]
+                        attn_mask_slice = attention_mask[start_idx:end_idx, start_idx_2:end_idx_2] if attention_mask is not None else None
+
+                        attn_slice = attn.get_attention_scores(query_slice, key_slice, attn_mask_slice)
+                        del query_slice
+                        del key_slice
+                        del attn_mask_slice
+                        attn_slice = torch.bmm(attn_slice, value[start_idx:end_idx, start_idx_2:end_idx_2])
+
+                        hidden_states[start_idx:end_idx, start_idx_2:end_idx_2] = attn_slice
+                        del attn_slice
            else:
                query_slice = query[start_idx:end_idx]
                key_slice = key[start_idx:end_idx]
                attn_mask_slice = attention_mask[start_idx:end_idx] if attention_mask is not None else None

                attn_slice = attn.get_attention_scores(query_slice, key_slice, attn_mask_slice)
-
+                del query_slice
+                del key_slice
+                del attn_mask_slice
                attn_slice = torch.bmm(attn_slice, value[start_idx:end_idx])

                hidden_states[start_idx:end_idx] = attn_slice
+                del attn_slice
+        ####################################################################

        hidden_states = attn.batch_to_head_dim(hidden_states)

@@ -115,6 +181,130 @@ class SlicedAttnProcessor: # pylint: disable=too-few-public-methods

        return hidden_states

+
+class AttnProcessor:
+    r"""
+    Default processor for performing attention-related computations.
+    """
+
+    def __call__(self, attn: Attention, hidden_states: torch.FloatTensor,
+    encoder_hidden_states=None, attention_mask=None,
+    temb=None, scale: float = 1.0) -> torch.Tensor: # pylint: disable=too-many-statements, too-many-locals, too-many-branches
+
+        residual = hidden_states
+
+        args = () if USE_PEFT_BACKEND else (scale,)
+
+        if attn.spatial_norm is not None:
+            hidden_states = attn.spatial_norm(hidden_states, temb)
+
+        input_ndim = hidden_states.ndim
+
+        if input_ndim == 4:
+            batch_size, channel, height, width = hidden_states.shape
+            hidden_states = hidden_states.view(batch_size, channel, height * width).transpose(1, 2)
+
+        batch_size, sequence_length, _ = (
+            hidden_states.shape if encoder_hidden_states is None else encoder_hidden_states.shape
+        )
+        attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length, batch_size)
+
+        if attn.group_norm is not None:
+            hidden_states = attn.group_norm(hidden_states.transpose(1, 2)).transpose(1, 2)
+
+        query = attn.to_q(hidden_states, *args)
+
+        if encoder_hidden_states is None:
+            encoder_hidden_states = hidden_states
+        elif attn.norm_cross:
+            encoder_hidden_states = attn.norm_encoder_hidden_states(encoder_hidden_states)
+
+        key = attn.to_k(encoder_hidden_states, *args)
+        value = attn.to_v(encoder_hidden_states, *args)
+
+        query = attn.head_to_batch_dim(query)
+        key = attn.head_to_batch_dim(key)
+        value = attn.head_to_batch_dim(value)
+
+        ####################################################################
+        # ARC GPUs can't allocate more than 4GB to a single block, Slice it:
+        batch_size_attention, query_tokens, shape_three = query.shape[0], query.shape[1], query.shape[2]
+        hidden_states = torch.zeros(query.shape, device=query.device, dtype=query.dtype)
+        do_split, do_split_2, do_split_3, split_slice_size, split_2_slice_size, split_3_slice_size = find_attention_slice_sizes(query.shape, query.element_size(), query.device.type)
+
+        if do_split:
+            for i in range(batch_size_attention // split_slice_size):
+                start_idx = i * split_slice_size
+                end_idx = (i + 1) * split_slice_size
+                if do_split_2:
+                    for i2 in range(query_tokens // split_2_slice_size): # pylint: disable=invalid-name
+                        start_idx_2 = i2 * split_2_slice_size
+                        end_idx_2 = (i2 + 1) * split_2_slice_size
+                        if do_split_3:
+                            for i3 in range(shape_three // split_3_slice_size): # pylint: disable=invalid-name
+                                start_idx_3 = i3 * split_3_slice_size
+                                end_idx_3 = (i3 + 1) * split_3_slice_size
+
+                                query_slice = query[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3]
+                                key_slice = key[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3]
+                                attn_mask_slice = attention_mask[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3] if attention_mask is not None else None
+
+                                attn_slice = attn.get_attention_scores(query_slice, key_slice, attn_mask_slice)
+                                del query_slice
+                                del key_slice
+                                del attn_mask_slice
+                                attn_slice = torch.bmm(attn_slice, value[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3])
+
+                                hidden_states[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3] = attn_slice
+                                del attn_slice
+                        else:
+                            query_slice = query[start_idx:end_idx, start_idx_2:end_idx_2]
+                            key_slice = key[start_idx:end_idx, start_idx_2:end_idx_2]
+                            attn_mask_slice = attention_mask[start_idx:end_idx, start_idx_2:end_idx_2] if attention_mask is not None else None
+
+                            attn_slice = attn.get_attention_scores(query_slice, key_slice, attn_mask_slice)
+                            del query_slice
+                            del key_slice
+                            del attn_mask_slice
+                            attn_slice = torch.bmm(attn_slice, value[start_idx:end_idx, start_idx_2:end_idx_2])
+
+                            hidden_states[start_idx:end_idx, start_idx_2:end_idx_2] = attn_slice
+                            del attn_slice
+                else:
+                    query_slice = query[start_idx:end_idx]
+                    key_slice = key[start_idx:end_idx]
+                    attn_mask_slice = attention_mask[start_idx:end_idx] if attention_mask is not None else None
+
+                    attn_slice = attn.get_attention_scores(query_slice, key_slice, attn_mask_slice)
+                    del query_slice
+                    del key_slice
+                    del attn_mask_slice
+                    attn_slice = torch.bmm(attn_slice, value[start_idx:end_idx])
+
+                    hidden_states[start_idx:end_idx] = attn_slice
+                    del attn_slice
+        else:
+            attention_probs = attn.get_attention_scores(query, key, attention_mask)
+            hidden_states = torch.bmm(attention_probs, value)
+        ####################################################################
+        hidden_states = attn.batch_to_head_dim(hidden_states)
+
+        # linear proj
+        hidden_states = attn.to_out[0](hidden_states, *args)
+        # dropout
+        hidden_states = attn.to_out[1](hidden_states)
+
+        if input_ndim == 4:
+            hidden_states = hidden_states.transpose(-1, -2).reshape(batch_size, channel, height, width)
+
+        if attn.residual_connection:
+            hidden_states = hidden_states + residual
+
+        hidden_states = hidden_states / attn.rescale_output_factor
+
+        return hidden_states
+
 def ipex_diffusers():
    #ARC GPUs can't allocate more than 4GB to a single block:
    diffusers.models.attention_processor.SlicedAttnProcessor = SlicedAttnProcessor
+    diffusers.models.attention_processor.AttnProcessor = AttnProcessor
--- a/library/ipex/hijacks.py
+++ b/library/ipex/hijacks.py
@@ -1,67 +1,9 @@
 import contextlib
-import importlib
 import torch
 import intel_extension_for_pytorch as ipex # pylint: disable=import-error, unused-import

 # pylint: disable=protected-access, missing-function-docstring, line-too-long, unnecessary-lambda, no-else-return

-class CondFunc: # pylint: disable=missing-class-docstring
-    def __new__(cls, orig_func, sub_func, cond_func):
-        self = super(CondFunc, cls).__new__(cls)
-        if isinstance(orig_func, str):
-            func_path = orig_func.split('.')
-            for i in range(len(func_path)-1, -1, -1):
-                try:
-                    resolved_obj = importlib.import_module('.'.join(func_path[:i]))
-                    break
-                except ImportError:
-                    pass
-            for attr_name in func_path[i:-1]:
-                resolved_obj = getattr(resolved_obj, attr_name)
-            orig_func = getattr(resolved_obj, func_path[-1])
-            setattr(resolved_obj, func_path[-1], lambda *args, **kwargs: self(*args, **kwargs))
-        self.__init__(orig_func, sub_func, cond_func)
-        return lambda *args, **kwargs: self(*args, **kwargs)
-    def __init__(self, orig_func, sub_func, cond_func):
-        self.__orig_func = orig_func
-        self.__sub_func = sub_func
-        self.__cond_func = cond_func
-    def __call__(self, *args, **kwargs):
-        if not self.__cond_func or self.__cond_func(self.__orig_func, *args, **kwargs):
-            return self.__sub_func(self.__orig_func, *args, **kwargs)
-        else:
-            return self.__orig_func(*args, **kwargs)
-
-_utils = torch.utils.data._utils
-def _shutdown_workers(self):
-    if torch.utils.data._utils is None or torch.utils.data._utils.python_exit_status is True or torch.utils.data._utils.python_exit_status is None:
-        return
-    if hasattr(self, "_shutdown") and not self._shutdown:
-        self._shutdown = True
-        try:
-            if hasattr(self, '_pin_memory_thread'):
-                self._pin_memory_thread_done_event.set()
-                self._worker_result_queue.put((None, None))
-                self._pin_memory_thread.join()
-                self._worker_result_queue.cancel_join_thread()
-                self._worker_result_queue.close()
-            self._workers_done_event.set()
-            for worker_id in range(len(self._workers)):
-                if self._persistent_workers or self._workers_status[worker_id]:
-                    self._mark_worker_as_unavailable(worker_id, shutdown=True)
-            for w in self._workers: # pylint: disable=invalid-name
-                w.join(timeout=torch.utils.data._utils.MP_STATUS_CHECK_INTERVAL)
-            for q in self._index_queues: # pylint: disable=invalid-name
-                q.cancel_join_thread()
-                q.close()
-        finally:
-            if self._worker_pids_set:
-                torch.utils.data._utils.signal_handling._remove_worker_pids(id(self))
-                self._worker_pids_set = False
-            for w in self._workers: # pylint: disable=invalid-name
-                if w.is_alive():
-                    w.terminate()
-
 class DummyDataParallel(torch.nn.Module): # pylint: disable=missing-class-docstring, unused-argument, too-few-public-methods
    def __new__(cls, module, device_ids=None, output_device=None, dim=0): # pylint: disable=unused-argument
        if isinstance(device_ids, list) and len(device_ids) > 1:
@@ -71,17 +13,18 @@ class DummyDataParallel(torch.nn.Module): # pylint: disable=missing-class-docstr
 def return_null_context(*args, **kwargs): # pylint: disable=unused-argument
    return contextlib.nullcontext()

+@property
+def is_cuda(self):
+    return self.device.type == 'xpu' or self.device.type == 'cuda'
+
 def check_device(device):
    return bool((isinstance(device, torch.device) and device.type == "cuda") or (isinstance(device, str) and "cuda" in device) or isinstance(device, int))

 def return_xpu(device):
    return f"xpu:{device.split(':')[-1]}" if isinstance(device, str) and ":" in device else f"xpu:{device}" if isinstance(device, int) else torch.device("xpu") if isinstance(device, torch.device) else "xpu"

-def ipex_no_cuda(orig_func, *args, **kwargs):
-    torch.cuda.is_available = lambda: False
-    orig_func(*args, **kwargs)
-    torch.cuda.is_available = torch.xpu.is_available

+# Autocast
 original_autocast = torch.autocast
 def ipex_autocast(*args, **kwargs):
    if len(args) > 0 and args[0] == "cuda":
@@ -89,15 +32,7 @@ def ipex_autocast(*args, **kwargs):
    else:
        return original_autocast(*args, **kwargs)

-# Embedding BF16
-original_torch_cat = torch.cat
-def torch_cat(tensor, *args, **kwargs):
-    if len(tensor) == 3 and (tensor[0].dtype != tensor[1].dtype or tensor[2].dtype != tensor[1].dtype):
-        return original_torch_cat([tensor[0].to(tensor[1].dtype), tensor[1], tensor[2].to(tensor[1].dtype)], *args, **kwargs)
-    else:
-        return original_torch_cat(tensor, *args, **kwargs)
-
-# Latent antialias:
+# Latent Antialias CPU Offload:
 original_interpolate = torch.nn.functional.interpolate
 def interpolate(tensor, size=None, scale_factor=None, mode='nearest', align_corners=None, recompute_scale_factor=None, antialias=False): # pylint: disable=too-many-arguments
    if antialias or align_corners is not None:
@@ -109,19 +44,19 @@ def interpolate(tensor, size=None, scale_factor=None, mode='nearest', align_corn
        return original_interpolate(tensor, size=size, scale_factor=scale_factor, mode=mode,
        align_corners=align_corners, recompute_scale_factor=recompute_scale_factor, antialias=antialias)

-original_linalg_solve = torch.linalg.solve
-def linalg_solve(A, B, *args, **kwargs): # pylint: disable=invalid-name
-    if A.device != torch.device("cpu") or B.device != torch.device("cpu"):
-        return_device = A.device
-        return original_linalg_solve(A.to("cpu"), B.to("cpu"), *args, **kwargs).to(return_device)
+# Diffusers Float64 (Alchemist GPUs doesn't support 64 bit):
+original_from_numpy = torch.from_numpy
+def from_numpy(ndarray):
+    if ndarray.dtype == float:
+        return original_from_numpy(ndarray.astype('float32'))
    else:
-        return original_linalg_solve(A, B, *args, **kwargs)
+        return original_from_numpy(ndarray)

 if torch.xpu.has_fp64_dtype():
    original_torch_bmm = torch.bmm
    original_scaled_dot_product_attention = torch.nn.functional.scaled_dot_product_attention
 else:
-    # 64 bit attention workarounds for Alchemist:
+    # 32 bit attention workarounds for Alchemist:
    try:
        from .attention import torch_bmm_32_bit as original_torch_bmm
        from .attention import scaled_dot_product_attention_32_bit as original_scaled_dot_product_attention
@@ -129,7 +64,8 @@ else:
        original_torch_bmm = torch.bmm
        original_scaled_dot_product_attention = torch.nn.functional.scaled_dot_product_attention

-# dtype errors:
+
+# Data Type Errors:
 def torch_bmm(input, mat2, *, out=None):
    if input.dtype != mat2.dtype:
        mat2 = mat2.to(input.dtype)
@@ -142,111 +78,171 @@ def scaled_dot_product_attention(query, key, value, attn_mask=None, dropout_p=0.
        value = value.to(dtype=query.dtype)
    return original_scaled_dot_product_attention(query, key, value, attn_mask=attn_mask, dropout_p=dropout_p, is_causal=is_causal)

-@property
-def is_cuda(self):
-    return self.device.type == 'xpu'
+# A1111 FP16
+original_functional_group_norm = torch.nn.functional.group_norm
+def functional_group_norm(input, num_groups, weight=None, bias=None, eps=1e-05):
+    if weight is not None and input.dtype != weight.data.dtype:
+        input = input.to(dtype=weight.data.dtype)
+    if bias is not None and weight is not None and bias.data.dtype != weight.data.dtype:
+        bias.data = bias.data.to(dtype=weight.data.dtype)
+    return original_functional_group_norm(input, num_groups, weight=weight, bias=bias, eps=eps)

-def ipex_hijacks():
-    CondFunc('torch.tensor',
-        lambda orig_func, *args, device=None, **kwargs: orig_func(*args, device=return_xpu(device), **kwargs),
-        lambda orig_func, *args, device=None, **kwargs: check_device(device))
-    CondFunc('torch.Tensor.to',
-        lambda orig_func, self, device=None, *args, **kwargs: orig_func(self, return_xpu(device), *args, **kwargs),
-        lambda orig_func, self, device=None, *args, **kwargs: check_device(device))
-    CondFunc('torch.Tensor.cuda',
-        lambda orig_func, self, device=None, *args, **kwargs: orig_func(self, return_xpu(device), *args, **kwargs),
-        lambda orig_func, self, device=None, *args, **kwargs: check_device(device))
-    CondFunc('torch.UntypedStorage.__init__',
-        lambda orig_func, *args, device=None, **kwargs: orig_func(*args, device=return_xpu(device), **kwargs),
-        lambda orig_func, *args, device=None, **kwargs: check_device(device))
-    CondFunc('torch.UntypedStorage.cuda',
-        lambda orig_func, self, device=None, *args, **kwargs: orig_func(self, return_xpu(device), *args, **kwargs),
-        lambda orig_func, self, device=None, *args, **kwargs: check_device(device))
-    CondFunc('torch.empty',
-        lambda orig_func, *args, device=None, **kwargs: orig_func(*args, device=return_xpu(device), **kwargs),
-        lambda orig_func, *args, device=None, **kwargs: check_device(device))
-    CondFunc('torch.randn',
-        lambda orig_func, *args, device=None, **kwargs: orig_func(*args, device=return_xpu(device), **kwargs),
-        lambda orig_func, *args, device=None, **kwargs: check_device(device))
-    CondFunc('torch.ones',
-        lambda orig_func, *args, device=None, **kwargs: orig_func(*args, device=return_xpu(device), **kwargs),
-        lambda orig_func, *args, device=None, **kwargs: check_device(device))
-    CondFunc('torch.zeros',
-        lambda orig_func, *args, device=None, **kwargs: orig_func(*args, device=return_xpu(device), **kwargs),
-        lambda orig_func, *args, device=None, **kwargs: check_device(device))
-    CondFunc('torch.linspace',
-        lambda orig_func, *args, device=None, **kwargs: orig_func(*args, device=return_xpu(device), **kwargs),
-        lambda orig_func, *args, device=None, **kwargs: check_device(device))
-    CondFunc('torch.load',
-        lambda orig_func, f, map_location=None, pickle_module=None, *, weights_only=False, mmap=None, **kwargs:
-        orig_func(orig_func, f, map_location=return_xpu(map_location), pickle_module=pickle_module, weights_only=weights_only, mmap=mmap, **kwargs),
-        lambda orig_func, f, map_location=None, pickle_module=None, *, weights_only=False, mmap=None, **kwargs: check_device(map_location))
-    if hasattr(torch.xpu, "Generator"):
-        CondFunc('torch.Generator',
-            lambda orig_func, device=None: torch.xpu.Generator(return_xpu(device)),
-            lambda orig_func, device=None: device is not None and device != torch.device("cpu") and device != "cpu")
+# A1111 BF16
+original_functional_layer_norm = torch.nn.functional.layer_norm
+def functional_layer_norm(input, normalized_shape, weight=None, bias=None, eps=1e-05):
+    if weight is not None and input.dtype != weight.data.dtype:
+        input = input.to(dtype=weight.data.dtype)
+    if bias is not None and weight is not None and bias.data.dtype != weight.data.dtype:
+        bias.data = bias.data.to(dtype=weight.data.dtype)
+    return original_functional_layer_norm(input, normalized_shape, weight=weight, bias=bias, eps=eps)
+
+# Training
+original_functional_linear = torch.nn.functional.linear
+def functional_linear(input, weight, bias=None):
+    if input.dtype != weight.data.dtype:
+        input = input.to(dtype=weight.data.dtype)
+    if bias is not None and bias.data.dtype != weight.data.dtype:
+        bias.data = bias.data.to(dtype=weight.data.dtype)
+    return original_functional_linear(input, weight, bias=bias)
+
+original_functional_conv2d = torch.nn.functional.conv2d
+def functional_conv2d(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1):
+    if input.dtype != weight.data.dtype:
+        input = input.to(dtype=weight.data.dtype)
+    if bias is not None and bias.data.dtype != weight.data.dtype:
+        bias.data = bias.data.to(dtype=weight.data.dtype)
+    return original_functional_conv2d(input, weight, bias=bias, stride=stride, padding=padding, dilation=dilation, groups=groups)
+
+# A1111 Embedding BF16
+original_torch_cat = torch.cat
+def torch_cat(tensor, *args, **kwargs):
+    if len(tensor) == 3 and (tensor[0].dtype != tensor[1].dtype or tensor[2].dtype != tensor[1].dtype):
+        return original_torch_cat([tensor[0].to(tensor[1].dtype), tensor[1], tensor[2].to(tensor[1].dtype)], *args, **kwargs)
    else:
-        CondFunc('torch.Generator',
-            lambda orig_func, device=None: orig_func(return_xpu(device)),
-            lambda orig_func, device=None: check_device(device))
+        return original_torch_cat(tensor, *args, **kwargs)

-    # TiledVAE and ControlNet:
-    CondFunc('torch.batch_norm',
-        lambda orig_func, input, weight, bias, *args, **kwargs: orig_func(input,
-        weight if weight is not None else torch.ones(input.size()[1], device=input.device),
-        bias if bias is not None else torch.zeros(input.size()[1], device=input.device), *args, **kwargs),
-        lambda orig_func, input, *args, **kwargs: input.device != torch.device("cpu"))
-    CondFunc('torch.instance_norm',
-        lambda orig_func, input, weight, bias, *args, **kwargs: orig_func(input,
-        weight if weight is not None else torch.ones(input.size()[1], device=input.device),
-        bias if bias is not None else torch.zeros(input.size()[1], device=input.device), *args, **kwargs),
-        lambda orig_func, input, *args, **kwargs: input.device != torch.device("cpu"))
+# SwinIR BF16:
+original_functional_pad = torch.nn.functional.pad
+def functional_pad(input, pad, mode='constant', value=None):
+    if mode == 'reflect' and input.dtype == torch.bfloat16:
+        return original_functional_pad(input.to(torch.float32), pad, mode=mode, value=value).to(dtype=torch.bfloat16)
+    else:
+        return original_functional_pad(input, pad, mode=mode, value=value)

-    # Functions with dtype errors:
-    CondFunc('torch.nn.modules.GroupNorm.forward',
-        lambda orig_func, self, input: orig_func(self, input.to(self.weight.data.dtype)),
-        lambda orig_func, self, input: input.dtype != self.weight.data.dtype)
-    # Training:
-    CondFunc('torch.nn.modules.linear.Linear.forward',
-        lambda orig_func, self, input: orig_func(self, input.to(self.weight.data.dtype)),
-        lambda orig_func, self, input: input.dtype != self.weight.data.dtype)
-    CondFunc('torch.nn.modules.conv.Conv2d.forward',
-        lambda orig_func, self, input: orig_func(self, input.to(self.weight.data.dtype)),
-        lambda orig_func, self, input: input.dtype != self.weight.data.dtype)
-    # BF16:
-    CondFunc('torch.nn.functional.layer_norm',
-        lambda orig_func, input, normalized_shape=None, weight=None, *args, **kwargs:
-        orig_func(input.to(weight.data.dtype), normalized_shape, weight, *args, **kwargs),
-        lambda orig_func, input, normalized_shape=None, weight=None, *args, **kwargs:
-        weight is not None and input.dtype != weight.data.dtype)
-    # SwinIR BF16:
-    CondFunc('torch.nn.functional.pad',
-        lambda orig_func, input, pad, mode='constant', value=None: orig_func(input.to(torch.float32), pad, mode=mode, value=value).to(dtype=torch.bfloat16),
-        lambda orig_func, input, pad, mode='constant', value=None: mode == 'reflect' and input.dtype == torch.bfloat16)

-    # Diffusers Float64 (Alchemist GPUs doesn't support 64 bit):
-    if not torch.xpu.has_fp64_dtype():
-        CondFunc('torch.from_numpy',
-        lambda orig_func, ndarray: orig_func(ndarray.astype('float32')),
-        lambda orig_func, ndarray: ndarray.dtype == float)
+original_torch_tensor = torch.tensor
+def torch_tensor(*args, device=None, **kwargs):
+    if check_device(device):
+        return original_torch_tensor(*args, device=return_xpu(device), **kwargs)
+    else:
+        return original_torch_tensor(*args, device=device, **kwargs)

-    # Broken functions when torch.cuda.is_available is True:
-    # Pin Memory:
-    CondFunc('torch.utils.data.dataloader._BaseDataLoaderIter.__init__',
-        lambda orig_func, *args, **kwargs: ipex_no_cuda(orig_func, *args, **kwargs),
-        lambda orig_func, *args, **kwargs: True)
+original_Tensor_to = torch.Tensor.to
+def Tensor_to(self, device=None, *args, **kwargs):
+    if check_device(device):
+        return original_Tensor_to(self, return_xpu(device), *args, **kwargs)
+    else:
+        return original_Tensor_to(self, device, *args, **kwargs)

-    # Functions that make compile mad with CondFunc:
-    torch.nn.DataParallel = DummyDataParallel
-    torch.utils.data.dataloader._MultiProcessingDataLoaderIter._shutdown_workers = _shutdown_workers
+original_Tensor_cuda = torch.Tensor.cuda
+def Tensor_cuda(self, device=None, *args, **kwargs):
+    if check_device(device):
+        return original_Tensor_cuda(self, return_xpu(device), *args, **kwargs)
+    else:
+        return original_Tensor_cuda(self, device, *args, **kwargs)
+
+original_UntypedStorage_init = torch.UntypedStorage.__init__
+def UntypedStorage_init(*args, device=None, **kwargs):
+    if check_device(device):
+        return original_UntypedStorage_init(*args, device=return_xpu(device), **kwargs)
+    else:
+        return original_UntypedStorage_init(*args, device=device, **kwargs)
+
+original_UntypedStorage_cuda = torch.UntypedStorage.cuda
+def UntypedStorage_cuda(self, device=None, *args, **kwargs):
+    if check_device(device):
+        return original_UntypedStorage_cuda(self, return_xpu(device), *args, **kwargs)
+    else:
+        return original_UntypedStorage_cuda(self, device, *args, **kwargs)
+
+original_torch_empty = torch.empty
+def torch_empty(*args, device=None, **kwargs):
+    if check_device(device):
+        return original_torch_empty(*args, device=return_xpu(device), **kwargs)
+    else:
+        return original_torch_empty(*args, device=device, **kwargs)
+
+original_torch_randn = torch.randn
+def torch_randn(*args, device=None, **kwargs):
+    if check_device(device):
+        return original_torch_randn(*args, device=return_xpu(device), **kwargs)
+    else:
+        return original_torch_randn(*args, device=device, **kwargs)
+
+original_torch_ones = torch.ones
+def torch_ones(*args, device=None, **kwargs):
+    if check_device(device):
+        return original_torch_ones(*args, device=return_xpu(device), **kwargs)
+    else:
+        return original_torch_ones(*args, device=device, **kwargs)
+
+original_torch_zeros = torch.zeros
+def torch_zeros(*args, device=None, **kwargs):
+    if check_device(device):
+        return original_torch_zeros(*args, device=return_xpu(device), **kwargs)
+    else:
+        return original_torch_zeros(*args, device=device, **kwargs)
+
+original_torch_linspace = torch.linspace
+def torch_linspace(*args, device=None, **kwargs):
+    if check_device(device):
+        return original_torch_linspace(*args, device=return_xpu(device), **kwargs)
+    else:
+        return original_torch_linspace(*args, device=device, **kwargs)
+
+original_torch_Generator = torch.Generator
+def torch_Generator(device=None):
+    if check_device(device):
+        return original_torch_Generator(return_xpu(device))
+    else:
+        return original_torch_Generator(device)
+
+original_torch_load = torch.load
+def torch_load(f, map_location=None, pickle_module=None, *, weights_only=False, mmap=None, **kwargs):
+    if check_device(map_location):
+        return original_torch_load(f, map_location=return_xpu(map_location), pickle_module=pickle_module, weights_only=weights_only, mmap=mmap, **kwargs)
+    else:
+        return original_torch_load(f, map_location=map_location, pickle_module=pickle_module, weights_only=weights_only, mmap=mmap, **kwargs)
+
+# Hijack Functions:
+def ipex_hijacks():
+    torch.tensor = torch_tensor
+    torch.Tensor.to = Tensor_to
+    torch.Tensor.cuda = Tensor_cuda
+    torch.UntypedStorage.__init__ = UntypedStorage_init
+    torch.UntypedStorage.cuda = UntypedStorage_cuda
+    torch.empty = torch_empty
+    torch.randn = torch_randn
+    torch.ones = torch_ones
+    torch.zeros = torch_zeros
+    torch.linspace = torch_linspace
+    torch.Generator = torch_Generator
+    torch.load = torch_load

-    torch.autocast = ipex_autocast
    torch.backends.cuda.sdp_kernel = return_null_context
+    torch.nn.DataParallel = DummyDataParallel
    torch.UntypedStorage.is_cuda = is_cuda
+    torch.autocast = ipex_autocast

+    torch.nn.functional.scaled_dot_product_attention = scaled_dot_product_attention
+    torch.nn.functional.group_norm = functional_group_norm
+    torch.nn.functional.layer_norm = functional_layer_norm
+    torch.nn.functional.linear = functional_linear
+    torch.nn.functional.conv2d = functional_conv2d
    torch.nn.functional.interpolate = interpolate
-    torch.linalg.solve = linalg_solve
+    torch.nn.functional.pad = functional_pad

    torch.bmm = torch_bmm
    torch.cat = torch_cat
-    torch.nn.functional.scaled_dot_product_attention = scaled_dot_product_attention
+    if not torch.xpu.has_fp64_dtype():
+        torch.from_numpy = from_numpy
--- a/library/ipex_interop.py
+++ b/library/ipex_interop.py
@@ -0,0 +1,24 @@
+import torch
+
+
+def init_ipex():
+    """
+    Try to import `intel_extension_for_pytorch`, and apply
+    the hijacks using `library.ipex.ipex_init`.
+
+    If IPEX is not installed, this function does nothing.
+    """
+    try:
+        import intel_extension_for_pytorch as ipex  # noqa
+    except ImportError:
+        return
+
+    try:
+        from library.ipex import ipex_init
+
+        if torch.xpu.is_available():
+            is_initialized, error_message = ipex_init()
+            if not is_initialized:
+                print("failed to initialize ipex:", error_message)
+    except Exception as e:
+        print("failed to initialize ipex:", e)
--- a/library/lpw_stable_diffusion.py
+++ b/library/lpw_stable_diffusion.py
@@ -9,7 +9,7 @@ import numpy as np
 import PIL.Image
 import torch
 from packaging import version
-from transformers import CLIPFeatureExtractor, CLIPTextModel, CLIPTokenizer
+from transformers import CLIPFeatureExtractor, CLIPTextModel, CLIPTokenizer, CLIPVisionModelWithProjection

 import diffusers
 from diffusers import SchedulerMixin, StableDiffusionPipeline
@@ -520,6 +520,7 @@ class StableDiffusionLongPromptWeightingPipeline(StableDiffusionPipeline):
        safety_checker: StableDiffusionSafetyChecker,
        feature_extractor: CLIPFeatureExtractor,
        requires_safety_checker: bool = True,
+        image_encoder: CLIPVisionModelWithProjection = None,
        clip_skip: int = 1,
    ):
        super().__init__(
@@ -531,32 +532,11 @@ class StableDiffusionLongPromptWeightingPipeline(StableDiffusionPipeline):
            safety_checker=safety_checker,
            feature_extractor=feature_extractor,
            requires_safety_checker=requires_safety_checker,
+            image_encoder=image_encoder,
        )
-        self.clip_skip = clip_skip
+        self.custom_clip_skip = clip_skip
        self.__init__additional__()

-    # else:
-    #     def __init__(
-    #         self,
-    #         vae: AutoencoderKL,
-    #         text_encoder: CLIPTextModel,
-    #         tokenizer: CLIPTokenizer,
-    #         unet: UNet2DConditionModel,
-    #         scheduler: SchedulerMixin,
-    #         safety_checker: StableDiffusionSafetyChecker,
-    #         feature_extractor: CLIPFeatureExtractor,
-    #     ):
-    #         super().__init__(
-    #             vae=vae,
-    #             text_encoder=text_encoder,
-    #             tokenizer=tokenizer,
-    #             unet=unet,
-    #             scheduler=scheduler,
-    #             safety_checker=safety_checker,
-    #             feature_extractor=feature_extractor,
-    #         )
-    #         self.__init__additional__()
-
    def __init__additional__(self):
        if not hasattr(self, "vae_scale_factor"):
            setattr(self, "vae_scale_factor", 2 ** (len(self.vae.config.block_out_channels) - 1))
@@ -624,7 +604,7 @@ class StableDiffusionLongPromptWeightingPipeline(StableDiffusionPipeline):
            prompt=prompt,
            uncond_prompt=negative_prompt if do_classifier_free_guidance else None,
            max_embeddings_multiples=max_embeddings_multiples,
-            clip_skip=self.clip_skip,
+            clip_skip=self.custom_clip_skip,
        )
        bs_embed, seq_len, _ = text_embeddings.shape
        text_embeddings = text_embeddings.repeat(1, num_images_per_prompt, 1)
--- a/library/model_util.py
+++ b/library/model_util.py
@@ -4,13 +4,10 @@
 import math
 import os
 import torch
-try:
-    import intel_extension_for_pytorch as ipex
-    if torch.xpu.is_available():
-        from library.ipex import ipex_init
-        ipex_init()
-except Exception:
-    pass
+
+from library.ipex_interop import init_ipex
+
+init_ipex()
 import diffusers
 from transformers import CLIPTextModel, CLIPTokenizer, CLIPTextConfig, logging
 from diffusers import AutoencoderKL, DDIMScheduler, StableDiffusionPipeline  # , UNet2DConditionModel
@@ -571,9 +568,9 @@ def convert_ldm_clip_checkpoint_v1(checkpoint):
        if key.startswith("cond_stage_model.transformer"):
            text_model_dict[key[len("cond_stage_model.transformer.") :]] = checkpoint[key]

-    # support checkpoint without position_ids (invalid checkpoint)
-    if "text_model.embeddings.position_ids" not in text_model_dict:
-        text_model_dict["text_model.embeddings.position_ids"] = torch.arange(77).unsqueeze(0)  # 77 is the max length of the text
+    # remove position_ids for newer transformer, which causes error :(
+    if "text_model.embeddings.position_ids" in text_model_dict:
+        text_model_dict.pop("text_model.embeddings.position_ids")

    return text_model_dict

@@ -1242,8 +1239,13 @@ def save_diffusers_checkpoint(v2, output_dir, text_encoder, unet, pretrained_mod
    if vae is None:
        vae = AutoencoderKL.from_pretrained(pretrained_model_name_or_path, subfolder="vae")

+    # original U-Net cannot be saved, so we need to convert it to the Diffusers version
+    # TODO this consumes a lot of memory
+    diffusers_unet = diffusers.UNet2DConditionModel.from_pretrained(pretrained_model_name_or_path, subfolder="unet")
+    diffusers_unet.load_state_dict(unet.state_dict())
+
    pipeline = StableDiffusionPipeline(
-        unet=unet,
+        unet=diffusers_unet,
        text_encoder=text_encoder,
        vae=vae,
        scheduler=scheduler,
--- a/library/original_unet.py
+++ b/library/original_unet.py
@@ -1262,9 +1262,9 @@ class CrossAttnUpBlock2D(nn.Module):
        for attn in self.attentions:
            attn.set_use_memory_efficient_attention(xformers, mem_eff)

-    def set_use_sdpa(self, spda):
+    def set_use_sdpa(self, sdpa):
        for attn in self.attentions:
-            attn.set_use_sdpa(spda)
+            attn.set_use_sdpa(sdpa)

    def forward(
        self,
--- a/library/sdxl_lpw_stable_diffusion.py
+++ b/library/sdxl_lpw_stable_diffusion.py
@@ -923,7 +923,11 @@ class SdxlStableDiffusionLongPromptWeightingPipeline:
            if up1 is not None:
                uncond_pool = up1

-        dtype = self.unet.dtype
+        unet_dtype = self.unet.dtype
+        dtype = unet_dtype
+        if hasattr(dtype, "itemsize") and dtype.itemsize == 1:  # fp8
+            dtype = torch.float16
+            self.unet.to(dtype)

        # 4. Preprocess image and mask
        if isinstance(image, PIL.Image.Image):
@@ -1028,6 +1032,7 @@ class SdxlStableDiffusionLongPromptWeightingPipeline:
                if is_cancelled_callback is not None and is_cancelled_callback():
                    return None

+        self.unet.to(unet_dtype)
        return latents

    def latents_to_image(self, latents):
--- a/library/sdxl_model_util.py
+++ b/library/sdxl_model_util.py
@@ -100,7 +100,7 @@ def convert_sdxl_text_encoder_2_checkpoint(checkpoint, max_length):
            key = key.replace(".ln_final", ".final_layer_norm")
        # ckpt from comfy has this key: text_model.encoder.text_model.embeddings.position_ids
        elif ".embeddings.position_ids" in key:
-            key = None  # remove this key: make position_ids by ourselves
+            key = None  # remove this key: position_ids is not used in newer transformers
        return key

    keys = list(checkpoint.keys())
@@ -126,10 +126,6 @@ def convert_sdxl_text_encoder_2_checkpoint(checkpoint, max_length):
            new_sd[key_pfx + "k_proj" + key_suffix] = values[1]
            new_sd[key_pfx + "v_proj" + key_suffix] = values[2]

-    # original SD にはないので、position_idsを追加
-    position_ids = torch.Tensor([list(range(max_length))]).to(torch.int64)
-    new_sd["text_model.embeddings.position_ids"] = position_ids
-
    # logit_scale はDiffusersには含まれないが、保存時に戻したいので別途返す
    logit_scale = checkpoint.get(SDXL_KEY_PREFIX + "logit_scale", None)

@@ -265,9 +261,9 @@ def load_models_from_sdxl_checkpoint(model_version, ckpt_path, map_location, dty
        elif k.startswith("conditioner.embedders.1.model."):
            te2_sd[k] = state_dict.pop(k)

-    # 一部のposition_idsがないモデルへの対応 / add position_ids for some models
-    if "text_model.embeddings.position_ids" not in te1_sd:
-        te1_sd["text_model.embeddings.position_ids"] = torch.arange(77).unsqueeze(0)
+    # 最新の transformers では position_ids を含むとエラーになるので削除 / remove position_ids for latest transformers
+    if "text_model.embeddings.position_ids" in te1_sd:
+        te1_sd.pop("text_model.embeddings.position_ids")

    info1 = _load_state_dict_on_device(text_model1, te1_sd, device=map_location)  # remain fp32
    print("text encoder 1:", info1)
--- a/library/train_util.py
+++ b/library/train_util.py
@@ -558,6 +558,7 @@ class BaseDataset(torch.utils.data.Dataset):
        tokenizer: Union[CLIPTokenizer, List[CLIPTokenizer]],
        max_token_length: int,
        resolution: Optional[Tuple[int, int]],
+        network_multiplier: float,
        debug_dataset: bool,
    ) -> None:
        super().__init__()
@@ -567,6 +568,7 @@ class BaseDataset(torch.utils.data.Dataset):
        self.max_token_length = max_token_length
        # width/height is used when enable_bucket==False
        self.width, self.height = (None, None) if resolution is None else resolution
+        self.network_multiplier = network_multiplier
        self.debug_dataset = debug_dataset

        self.subsets: List[Union[DreamBoothSubset, FineTuningSubset]] = []
@@ -1106,7 +1108,9 @@ class BaseDataset(torch.utils.data.Dataset):
        for image_key in bucket[image_index : image_index + bucket_batch_size]:
            image_info = self.image_data[image_key]
            subset = self.image_to_subset[image_key]
-            loss_weights.append(self.prior_loss_weight if image_info.is_reg else 1.0)
+            loss_weights.append(
+                self.prior_loss_weight if image_info.is_reg else 1.0
+            )  # in case of fine tuning, is_reg is always False

            flipped = subset.flip_aug and random.random() < 0.5  # not flipped or flipped with 50% chance

@@ -1272,6 +1276,8 @@ class BaseDataset(torch.utils.data.Dataset):
        example["target_sizes_hw"] = torch.stack([torch.LongTensor(x) for x in target_sizes_hw])
        example["flippeds"] = flippeds

+        example["network_multipliers"] = torch.FloatTensor([self.network_multiplier] * len(captions))
+
        if self.debug_dataset:
            example["image_keys"] = bucket[image_index : image_index + self.batch_size]
        return example
@@ -1346,15 +1352,16 @@ class DreamBoothDataset(BaseDataset):
        tokenizer,
        max_token_length,
        resolution,
+        network_multiplier: float,
        enable_bucket: bool,
        min_bucket_reso: int,
        max_bucket_reso: int,
        bucket_reso_steps: int,
        bucket_no_upscale: bool,
        prior_loss_weight: float,
-        debug_dataset,
+        debug_dataset: bool,
    ) -> None:
-        super().__init__(tokenizer, max_token_length, resolution, debug_dataset)
+        super().__init__(tokenizer, max_token_length, resolution, network_multiplier, debug_dataset)

        assert resolution is not None, f"resolution is required / resolution（解像度）指定は必須です"

@@ -1520,14 +1527,15 @@ class FineTuningDataset(BaseDataset):
        tokenizer,
        max_token_length,
        resolution,
+        network_multiplier: float,
        enable_bucket: bool,
        min_bucket_reso: int,
        max_bucket_reso: int,
        bucket_reso_steps: int,
        bucket_no_upscale: bool,
-        debug_dataset,
+        debug_dataset: bool,
    ) -> None:
-        super().__init__(tokenizer, max_token_length, resolution, debug_dataset)
+        super().__init__(tokenizer, max_token_length, resolution, network_multiplier, debug_dataset)

        self.batch_size = batch_size

@@ -1724,14 +1732,15 @@ class ControlNetDataset(BaseDataset):
        tokenizer,
        max_token_length,
        resolution,
+        network_multiplier: float,
        enable_bucket: bool,
        min_bucket_reso: int,
        max_bucket_reso: int,
        bucket_reso_steps: int,
        bucket_no_upscale: bool,
-        debug_dataset,
+        debug_dataset: float,
    ) -> None:
-        super().__init__(tokenizer, max_token_length, resolution, debug_dataset)
+        super().__init__(tokenizer, max_token_length, resolution, network_multiplier, debug_dataset)

        db_subsets = []
        for subset in subsets:
@@ -2039,6 +2048,8 @@ def debug_dataset(train_dataset, show_input_ids=False):
                print(
                    f'{ik}, size: {train_dataset.image_data[ik].image_size}, loss weight: {lw}, caption: "{cap}", original size: {orgsz}, crop top left: {crptl}, target size: {trgsz}, flipped: {flpdz}'
                )
+                if "network_multipliers" in example:
+                    print(f"network multiplier: {example['network_multipliers'][j]}")

                if show_input_ids:
                    print(f"input ids: {iid}")
@@ -2105,8 +2116,8 @@ def glob_images_pathlib(dir_path, recursive):


 class MinimalDataset(BaseDataset):
-    def __init__(self, tokenizer, max_token_length, resolution, debug_dataset=False):
-        super().__init__(tokenizer, max_token_length, resolution, debug_dataset)
+    def __init__(self, tokenizer, max_token_length, resolution, network_multiplier, debug_dataset=False):
+        super().__init__(tokenizer, max_token_length, resolution, network_multiplier, debug_dataset)

        self.num_train_images = 0  # update in subclass
        self.num_reg_images = 0  # update in subclass
@@ -2848,6 +2859,17 @@ def add_training_arguments(parser: argparse.ArgumentParser, support_dreambooth:
        action="store_true",
        help="use memory efficient attention for CrossAttention / CrossAttentionに省メモリ版attentionを使う",
    )
+    parser.add_argument("--torch_compile", action="store_true", help="use torch.compile (requires PyTorch 2.0) / torch.compile を使う")
+    parser.add_argument(
+        "--dynamo_backend",
+        type=str,
+        default="inductor",
+        # available backends:
+        # https://github.com/huggingface/accelerate/blob/d1abd59114ada8ba673e1214218cb2878c13b82d/src/accelerate/utils/dataclasses.py#L376-L388C5
+        # https://pytorch.org/docs/stable/torch.compiler.html
+        choices=["eager", "aot_eager", "inductor", "aot_ts_nvfuser", "nvprims_nvfuser", "cudagraphs", "ofi", "fx2trt", "onnxrt"],
+        help="dynamo backend type (default is inductor) / dynamoのbackendの種類（デフォルトは inductor）",
+    )
    parser.add_argument("--xformers", action="store_true", help="use xformers for CrossAttention / CrossAttentionにxformersを使う")
    parser.add_argument(
        "--sdpa",
@@ -2893,6 +2915,7 @@ def add_training_arguments(parser: argparse.ArgumentParser, support_dreambooth:
    parser.add_argument(
        "--full_bf16", action="store_true", help="bf16 training including gradients / 勾配も含めてbf16で学習する"
    )  # TODO move to SDXL training, because it is not supported by SD1/2
+    parser.add_argument("--fp8_base", action="store_true", help="use fp8 for base model / base modelにfp8を使う")
    parser.add_argument(
        "--ddp_timeout",
        type=int,
@@ -3876,6 +3899,11 @@ def prepare_accelerator(args: argparse.Namespace):
            if args.wandb_api_key is not None:
                wandb.login(key=args.wandb_api_key)

+    # torch.compile のオプション。 NO の場合は torch.compile は使わない
+    dynamo_backend = "NO"
+    if args.torch_compile:
+        dynamo_backend = args.dynamo_backend
+
    kwargs_handlers = (
        InitProcessGroupKwargs(timeout=datetime.timedelta(minutes=args.ddp_timeout)) if args.ddp_timeout else None,
        DistributedDataParallelKwargs(gradient_as_bucket_view=args.ddp_gradient_as_bucket_view, static_graph=args.ddp_static_graph)
@@ -3889,6 +3917,7 @@ def prepare_accelerator(args: argparse.Namespace):
        log_with=log_with,
        project_dir=logging_dir,
        kwargs_handlers=kwargs_handlers,
+        dynamo_backend=dynamo_backend,
    )
    return accelerator

--- a/networks/extract_lora_from_models.py
+++ b/networks/extract_lora_from_models.py
@@ -43,6 +43,9 @@ def svd(
    clamp_quantile=0.99,
    min_diff=0.01,
    no_metadata=False,
+    load_precision=None,
+    load_original_model_to=None,
+    load_tuned_model_to=None,
 ):
    def str_to_dtype(p):
        if p == "float":
@@ -57,28 +60,51 @@ def svd(
    if v_parameterization is None:
        v_parameterization = v2

+    load_dtype = str_to_dtype(load_precision) if load_precision else None
    save_dtype = str_to_dtype(save_precision)
+    work_device = "cpu"

    # load models
    if not sdxl:
        print(f"loading original SD model : {model_org}")
        text_encoder_o, _, unet_o = model_util.load_models_from_stable_diffusion_checkpoint(v2, model_org)
        text_encoders_o = [text_encoder_o]
+        if load_dtype is not None:
+            text_encoder_o = text_encoder_o.to(load_dtype)
+            unet_o = unet_o.to(load_dtype)
+
        print(f"loading tuned SD model : {model_tuned}")
        text_encoder_t, _, unet_t = model_util.load_models_from_stable_diffusion_checkpoint(v2, model_tuned)
        text_encoders_t = [text_encoder_t]
+        if load_dtype is not None:
+            text_encoder_t = text_encoder_t.to(load_dtype)
+            unet_t = unet_t.to(load_dtype)
+
        model_version = model_util.get_model_version_str_for_sd1_sd2(v2, v_parameterization)
    else:
+        device_org = load_original_model_to if load_original_model_to else "cpu"
+        device_tuned = load_tuned_model_to if load_tuned_model_to else "cpu"
+
        print(f"loading original SDXL model : {model_org}")
        text_encoder_o1, text_encoder_o2, _, unet_o, _, _ = sdxl_model_util.load_models_from_sdxl_checkpoint(
-            sdxl_model_util.MODEL_VERSION_SDXL_BASE_V1_0, model_org, "cpu"
+            sdxl_model_util.MODEL_VERSION_SDXL_BASE_V1_0, model_org, device_org
        )
        text_encoders_o = [text_encoder_o1, text_encoder_o2]
+        if load_dtype is not None:
+            text_encoder_o1 = text_encoder_o1.to(load_dtype)
+            text_encoder_o2 = text_encoder_o2.to(load_dtype)
+            unet_o = unet_o.to(load_dtype)
+
        print(f"loading original SDXL model : {model_tuned}")
        text_encoder_t1, text_encoder_t2, _, unet_t, _, _ = sdxl_model_util.load_models_from_sdxl_checkpoint(
-            sdxl_model_util.MODEL_VERSION_SDXL_BASE_V1_0, model_tuned, "cpu"
+            sdxl_model_util.MODEL_VERSION_SDXL_BASE_V1_0, model_tuned, device_tuned
        )
        text_encoders_t = [text_encoder_t1, text_encoder_t2]
+        if load_dtype is not None:
+            text_encoder_t1 = text_encoder_t1.to(load_dtype)
+            text_encoder_t2 = text_encoder_t2.to(load_dtype)
+            unet_t = unet_t.to(load_dtype)
+
        model_version = sdxl_model_util.MODEL_VERSION_SDXL_BASE_V1_0

    # create LoRA network to extract weights: Use dim (rank) as alpha
@@ -100,38 +126,54 @@ def svd(
        lora_name = lora_o.lora_name
        module_o = lora_o.org_module
        module_t = lora_t.org_module
-        diff = module_t.weight - module_o.weight
+        diff = module_t.weight.to(work_device) - module_o.weight.to(work_device)
+
+        # clear weight to save memory
+        module_o.weight = None
+        module_t.weight = None

        # Text Encoder might be same
        if not text_encoder_different and torch.max(torch.abs(diff)) > min_diff:
            text_encoder_different = True
            print(f"Text encoder is different. {torch.max(torch.abs(diff))} > {min_diff}")

-        diff = diff.float()
        diffs[lora_name] = diff

+    # clear target Text Encoder to save memory
+    for text_encoder in text_encoders_t:
+        del text_encoder
+
    if not text_encoder_different:
        print("Text encoder is same. Extract U-Net only.")
        lora_network_o.text_encoder_loras = []
-        diffs = {}
+        diffs = {}  # clear diffs

    for i, (lora_o, lora_t) in enumerate(zip(lora_network_o.unet_loras, lora_network_t.unet_loras)):
        lora_name = lora_o.lora_name
        module_o = lora_o.org_module
        module_t = lora_t.org_module
-        diff = module_t.weight - module_o.weight
-        diff = diff.float()
+        diff = module_t.weight.to(work_device) - module_o.weight.to(work_device)

-        if args.device:
-            diff = diff.to(args.device)
+        # clear weight to save memory
+        module_o.weight = None
+        module_t.weight = None

        diffs[lora_name] = diff

+    # clear LoRA network, target U-Net to save memory
+    del lora_network_o
+    del lora_network_t
+    del unet_t
+
    # make LoRA with svd
    print("calculating by svd")
    lora_weights = {}
    with torch.no_grad():
        for lora_name, mat in tqdm(list(diffs.items())):
+            if args.device:
+                mat = mat.to(args.device)
+            mat = mat.to(torch.float)  # calc by float
+
            # if conv_dim is None, diffs do not include LoRAs for conv2d-3x3
            conv2d = len(mat.size()) == 4
            kernel_size = None if not conv2d else mat.size()[2:4]
@@ -171,8 +213,8 @@ def svd(
                U = U.reshape(out_dim, rank, 1, 1)
                Vh = Vh.reshape(rank, in_dim, kernel_size[0], kernel_size[1])

-            U = U.to("cpu").contiguous()
-            Vh = Vh.to("cpu").contiguous()
+            U = U.to(work_device, dtype=save_dtype).contiguous()
+            Vh = Vh.to(work_device, dtype=save_dtype).contiguous()

            lora_weights[lora_name] = (U, Vh)

@@ -230,6 +272,13 @@ def setup_parser() -> argparse.ArgumentParser:
    parser.add_argument(
        "--sdxl", action="store_true", help="load Stable Diffusion SDXL base model / Stable Diffusion SDXL baseのモデルを読み込む"
    )
+    parser.add_argument(
+        "--load_precision",
+        type=str,
+        default=None,
+        choices=[None, "float", "fp16", "bf16"],
+        help="precision in loading, model default if omitted / 読み込み時に精度を変更して読み込む、省略時はモデルファイルによる"
+    )
    parser.add_argument(
        "--save_precision",
        type=str,
@@ -285,6 +334,18 @@ def setup_parser() -> argparse.ArgumentParser:
        help="do not save sai modelspec metadata (minimum ss_metadata for LoRA is saved) / "
        + "sai modelspecのメタデータを保存しない（LoRAの最低限のss_metadataは保存される）",
    )
+    parser.add_argument(
+        "--load_original_model_to",
+        type=str,
+        default=None,
+        help="location to load original model, cpu or cuda, cuda:0, etc, default is cpu, only for SDXL / 元モデル読み込み先、cpuまたはcuda、cuda:0など、省略時はcpu、SDXLのみ有効",
+    )
+    parser.add_argument(
+        "--load_tuned_model_to",
+        type=str,
+        default=None,
+        help="location to load tuned model, cpu or cuda, cuda:0, etc, default is cpu, only for SDXL / 派生モデル読み込み先、cpuまたはcuda、cuda:0など、省略時はcpu、SDXLのみ有効",
+    )

    return parser

--- a/requirements.txt
+++ b/requirements.txt
@@ -1,10 +1,10 @@
-accelerate==0.23.0
-transformers==4.30.2
-diffusers[torch]==0.21.2
+accelerate==0.25.0
+transformers==4.36.2
+diffusers[torch]==0.25.0
 ftfy==6.1.1
 # albumentations==1.3.0
 opencv-python==4.7.0.68
-einops==0.6.0
+einops==0.6.1
 pytorch-lightning==1.9.0
 # bitsandbytes==0.39.1
 tensorboard==2.10.1
@@ -14,7 +14,7 @@ altair==4.2.2
 easygui==0.98.3
 toml==0.10.2
 voluptuous==0.13.1
-huggingface-hub==0.15.1
+huggingface-hub==0.20.1
 # for BLIP captioning
 # requests==2.28.2
 # timm==0.6.12
--- a/sdxl_gen_img.py
+++ b/sdxl_gen_img.py
@@ -18,15 +18,10 @@ import diffusers
 import numpy as np
 import torch

-try:
-    import intel_extension_for_pytorch as ipex
+from library.ipex_interop import init_ipex

-    if torch.xpu.is_available():
-        from library.ipex import ipex_init
+init_ipex()

-        ipex_init()
-except Exception:
-    pass
 import torchvision
 from diffusers import (
    AutoencoderKL,
--- a/sdxl_minimal_inference.py
+++ b/sdxl_minimal_inference.py
@@ -9,13 +9,11 @@ import random
 from einops import repeat
 import numpy as np
 import torch
-try:
-    import intel_extension_for_pytorch as ipex
-    if torch.xpu.is_available():
-        from library.ipex import ipex_init
-        ipex_init()
-except Exception:
-    pass
+
+from library.ipex_interop import init_ipex
+
+init_ipex()
+
 from tqdm import tqdm
 from transformers import CLIPTokenizer
 from diffusers import EulerDiscreteScheduler
--- a/sdxl_train.py
+++ b/sdxl_train.py
@@ -11,15 +11,10 @@ import toml
 from tqdm import tqdm
 import torch

-try:
-    import intel_extension_for_pytorch as ipex
+from library.ipex_interop import init_ipex

-    if torch.xpu.is_available():
-        from library.ipex import ipex_init
+init_ipex()

-        ipex_init()
-except Exception:
-    pass
 from accelerate.utils import set_seed
 from diffusers import DDPMScheduler
 from library import sdxl_model_util
@@ -457,6 +452,8 @@ def train(args):

    if accelerator.is_main_process:
        init_kwargs = {}
+        if args.wandb_run_name:
+            init_kwargs['wandb'] = {'name': args.wandb_run_name}
        if args.log_tracker_config is not None:
            init_kwargs = toml.load(args.log_tracker_config)
        accelerator.init_trackers("finetuning" if args.log_tracker_name is None else args.log_tracker_name, init_kwargs=init_kwargs)
--- a/sdxl_train_control_net_lllite.py
+++ b/sdxl_train_control_net_lllite.py
@@ -14,13 +14,11 @@ import toml

 from tqdm import tqdm
 import torch
-try:
-    import intel_extension_for_pytorch as ipex
-    if torch.xpu.is_available():
-        from library.ipex import ipex_init
-        ipex_init()
-except Exception:
-    pass
+
+from library.ipex_interop import init_ipex
+
+init_ipex()
+
 from torch.nn.parallel import DistributedDataParallel as DDP
 from accelerate.utils import set_seed
 import accelerate
@@ -342,6 +340,8 @@ def train(args):

    if accelerator.is_main_process:
        init_kwargs = {}
+        if args.wandb_run_name:
+            init_kwargs['wandb'] = {'name': args.wandb_run_name}
        if args.log_tracker_config is not None:
            init_kwargs = toml.load(args.log_tracker_config)
        accelerator.init_trackers(
--- a/sdxl_train_control_net_lllite_old.py
+++ b/sdxl_train_control_net_lllite_old.py
@@ -11,13 +11,11 @@ import toml

 from tqdm import tqdm
 import torch
-try:
-    import intel_extension_for_pytorch as ipex
-    if torch.xpu.is_available():
-        from library.ipex import ipex_init
-        ipex_init()
-except Exception:
-    pass
+
+from library.ipex_interop import init_ipex
+
+init_ipex()
+
 from torch.nn.parallel import DistributedDataParallel as DDP
 from accelerate.utils import set_seed
 from diffusers import DDPMScheduler, ControlNetModel
--- a/sdxl_train_network.py
+++ b/sdxl_train_network.py
@@ -1,15 +1,10 @@
 import argparse
 import torch

-try:
-    import intel_extension_for_pytorch as ipex
+from library.ipex_interop import init_ipex

-    if torch.xpu.is_available():
-        from library.ipex import ipex_init
+init_ipex()

-        ipex_init()
-except Exception:
-    pass
 from library import sdxl_model_util, sdxl_train_util, train_util
 import train_network

@@ -95,8 +90,8 @@ class SdxlNetworkTrainer(train_network.NetworkTrainer):
                unet.to(org_unet_device)
        else:
            # Text Encoderから毎回出力を取得するので、GPUに乗せておく
-            text_encoders[0].to(accelerator.device)
-            text_encoders[1].to(accelerator.device)
+            text_encoders[0].to(accelerator.device, dtype=weight_dtype)
+            text_encoders[1].to(accelerator.device, dtype=weight_dtype)

    def get_text_cond(self, args, accelerator, batch, tokenizers, text_encoders, weight_dtype):
        if "text_encoder_outputs1_list" not in batch or batch["text_encoder_outputs1_list"] is None:
--- a/sdxl_train_textual_inversion.py
+++ b/sdxl_train_textual_inversion.py
@@ -3,13 +3,9 @@ import os

 import regex
 import torch
-try:
-    import intel_extension_for_pytorch as ipex
-    if torch.xpu.is_available():
-        from library.ipex import ipex_init
-        ipex_init()
-except Exception:
-    pass
+from library.ipex_interop import init_ipex
+
+init_ipex()
 import open_clip
 from library import sdxl_model_util, sdxl_train_util, train_util

--- a/train_controlnet.py
+++ b/train_controlnet.py
@@ -12,15 +12,10 @@ import toml
 from tqdm import tqdm
 import torch

-try:
-    import intel_extension_for_pytorch as ipex
+from library.ipex_interop import init_ipex

-    if torch.xpu.is_available():
-        from library.ipex import ipex_init
+init_ipex()

-        ipex_init()
-except Exception:
-    pass
 from torch.nn.parallel import DistributedDataParallel as DDP
 from accelerate.utils import set_seed
 from diffusers import DDPMScheduler, ControlNetModel
@@ -336,6 +331,8 @@ def train(args):
    )
    if accelerator.is_main_process:
        init_kwargs = {}
+        if args.wandb_run_name:
+            init_kwargs['wandb'] = {'name': args.wandb_run_name}
        if args.log_tracker_config is not None:
            init_kwargs = toml.load(args.log_tracker_config)
        accelerator.init_trackers(
--- a/train_db.py
+++ b/train_db.py
@@ -12,15 +12,10 @@ import toml
 from tqdm import tqdm
 import torch

-try:
-    import intel_extension_for_pytorch as ipex
+from library.ipex_interop import init_ipex

-    if torch.xpu.is_available():
-        from library.ipex import ipex_init
+init_ipex()

-        ipex_init()
-except Exception:
-    pass
 from accelerate.utils import set_seed
 from diffusers import DDPMScheduler

@@ -268,6 +263,8 @@ def train(args):

    if accelerator.is_main_process:
        init_kwargs = {}
+        if args.wandb_run_name:
+            init_kwargs['wandb'] = {'name': args.wandb_run_name}
        if args.log_tracker_config is not None:
            init_kwargs = toml.load(args.log_tracker_config)
        accelerator.init_trackers("dreambooth" if args.log_tracker_name is None else args.log_tracker_name, init_kwargs=init_kwargs)
--- a/train_network.py
+++ b/train_network.py
@@ -14,15 +14,10 @@ from tqdm import tqdm
 import torch
 from torch.nn.parallel import DistributedDataParallel as DDP

-try:
-    import intel_extension_for_pytorch as ipex
+from library.ipex_interop import init_ipex

-    if torch.xpu.is_available():
-        from library.ipex import ipex_init
+init_ipex()

-        ipex_init()
-except Exception:
-    pass
 from accelerate.utils import set_seed
 from diffusers import DDPMScheduler
 from library import model_util
@@ -117,7 +112,7 @@ class NetworkTrainer:
        self, args, accelerator, unet, vae, tokenizers, text_encoders, data_loader, weight_dtype
    ):
        for t_enc in text_encoders:
-            t_enc.to(accelerator.device)
+            t_enc.to(accelerator.device, dtype=weight_dtype)

    def get_text_cond(self, args, accelerator, batch, tokenizers, text_encoders, weight_dtype):
        input_ids = batch["input_ids"].to(accelerator.device)
@@ -278,6 +273,7 @@ class NetworkTrainer:
            accelerator.wait_for_everyone()

        # 必要ならテキストエンコーダーの出力をキャッシュする: Text Encoderはcpuまたはgpuへ移される
+        # cache text encoder outputs if needed: Text Encoder is moved to cpu or gpu
        self.cache_text_encoder_outputs_if_needed(
            args, accelerator, unet, vae, tokenizers, text_encoders, train_dataset_group, weight_dtype
        )
@@ -309,6 +305,7 @@ class NetworkTrainer:
            )
        if network is None:
            return
+        network_has_multiplier = hasattr(network, "set_multiplier")

        if hasattr(network, "prepare_network"):
            network.prepare_network(args)
@@ -389,17 +386,33 @@ class NetworkTrainer:
            accelerator.print("enable full bf16 training.")
            network.to(weight_dtype)

+        unet_weight_dtype = te_weight_dtype = weight_dtype
+        # Experimental Feature: Put base model into fp8 to save vram
+        if args.fp8_base:
+            assert torch.__version__ >= "2.1.0", "fp8_base requires torch>=2.1.0 / fp8を使う場合はtorch>=2.1.0が必要です。"
+            assert (
+                args.mixed_precision != "no"
+            ), "fp8_base requires mixed precision='fp16' or 'bf16' / fp8を使う場合はmixed_precision='fp16'または'bf16'が必要です。"
+            accelerator.print("enable fp8 training.")
+            unet_weight_dtype = torch.float8_e4m3fn
+            te_weight_dtype = torch.float8_e4m3fn
+
        unet.requires_grad_(False)
-        unet.to(dtype=weight_dtype)
+        unet.to(dtype=unet_weight_dtype)
        for t_enc in text_encoders:
            t_enc.requires_grad_(False)

-        # acceleratorがなんかよろしくやってくれるらしい
-        # TODO めちゃくちゃ冗長なのでコードを整理する
+            # in case of cpu, dtype is already set to fp32 because cpu does not support fp8/fp16/bf16
+            if t_enc.device.type != "cpu":
+                t_enc.to(dtype=te_weight_dtype)
+                # nn.Embedding not support FP8
+                t_enc.text_model.embeddings.to(dtype=(weight_dtype if te_weight_dtype != weight_dtype else te_weight_dtype))
+
+        # acceleratorがなんかよろしくやってくれるらしい / accelerator will do something good
        if train_unet:
            unet = accelerator.prepare(unet)
        else:
-            unet.to(accelerator.device, dtype=weight_dtype)  # move to device because unet is not prepared by accelerator
+            unet.to(accelerator.device, dtype=unet_weight_dtype)  # move to device because unet is not prepared by accelerator
        if train_text_encoder:
            if len(text_encoders) > 1:
                text_encoder = text_encoders = [accelerator.prepare(t_enc) for t_enc in text_encoders]
@@ -407,8 +420,8 @@ class NetworkTrainer:
                text_encoder = accelerator.prepare(text_encoder)
                text_encoders = [text_encoder]
        else:
-            for t_enc in text_encoders:
-                t_enc.to(accelerator.device, dtype=weight_dtype)
+            pass  # if text_encoder is not trained, no need to prepare. and device and dtype are already set
+
        network, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(network, optimizer, train_dataloader, lr_scheduler)

        if args.gradient_checkpointing:
@@ -421,9 +434,6 @@ class NetworkTrainer:
                if train_text_encoder:
                    t_enc.text_model.embeddings.requires_grad_(True)

-            # set top parameter requires_grad = True for gradient checkpointing works
-            if not train_text_encoder:  # train U-Net only
-                unet.parameters().__next__().requires_grad_(True)
        else:
            unet.eval()
            for t_enc in text_encoders:
@@ -685,7 +695,7 @@ class NetworkTrainer:
        if accelerator.is_main_process:
            init_kwargs = {}
            if args.wandb_run_name:
-                init_kwargs['wandb'] = {'name': args.wandb_run_name}
+                init_kwargs["wandb"] = {"name": args.wandb_run_name}
            if args.log_tracker_config is not None:
                init_kwargs = toml.load(args.log_tracker_config)
            accelerator.init_trackers(
@@ -754,7 +764,17 @@ class NetworkTrainer:
                                accelerator.print("NaN found in latents, replacing with zeros")
                                latents = torch.nan_to_num(latents, 0, out=latents)
                        latents = latents * self.vae_scale_factor
-                    b_size = latents.shape[0]
+
+                    # get multiplier for each sample
+                    if network_has_multiplier:
+                        multipliers = batch["network_multipliers"]
+                        # if all multipliers are same, use single multiplier
+                        if torch.all(multipliers == multipliers[0]):
+                            multipliers = multipliers[0].item()
+                        else:
+                            raise NotImplementedError("multipliers for each sample is not supported yet")
+                        # print(f"set multiplier: {multipliers}")
+                        network.set_multiplier(multipliers)

                    with torch.set_grad_enabled(train_text_encoder), accelerator.autocast():
                        # Get the text embedding for conditioning
@@ -778,10 +798,24 @@ class NetworkTrainer:
                        args, noise_scheduler, latents
                    )

+                    # ensure the hidden state will require grad
+                    if args.gradient_checkpointing:
+                        for x in noisy_latents:
+                            x.requires_grad_(True)
+                        for t in text_encoder_conds:
+                            t.requires_grad_(True)
+
                    # Predict the noise residual
                    with accelerator.autocast():
                        noise_pred = self.call_unet(
-                            args, accelerator, unet, noisy_latents, timesteps, text_encoder_conds, batch, weight_dtype
+                            args,
+                            accelerator,
+                            unet,
+                            noisy_latents.requires_grad_(train_unet),
+                            timesteps,
+                            text_encoder_conds,
+                            batch,
+                            weight_dtype,
                        )

                    if args.v_parameterization:
@@ -808,10 +842,11 @@ class NetworkTrainer:
                    loss = loss.mean()  # 平均なのでbatch_sizeで割る必要なし

                    accelerator.backward(loss)
-                    self.all_reduce_network(accelerator, network)  # sync DDP grad manually
-                    if accelerator.sync_gradients and args.max_grad_norm != 0.0:
-                        params_to_clip = accelerator.unwrap_model(network).get_trainable_params()
-                        accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
+                    if accelerator.sync_gradients:
+                        self.all_reduce_network(accelerator, network)  # sync DDP grad manually
+                        if args.max_grad_norm != 0.0:
+                            params_to_clip = accelerator.unwrap_model(network).get_trainable_params()
+                            accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)

                    optimizer.step()
                    lr_scheduler.step()
--- a/train_textual_inversion.py
+++ b/train_textual_inversion.py
@@ -8,15 +8,10 @@ import toml
 from tqdm import tqdm
 import torch

-try:
-    import intel_extension_for_pytorch as ipex
+from library.ipex_interop import init_ipex

-    if torch.xpu.is_available():
-        from library.ipex import ipex_init
+init_ipex()

-        ipex_init()
-except Exception:
-    pass
 from accelerate.utils import set_seed
 from diffusers import DDPMScheduler
 from transformers import CLIPTokenizer
@@ -441,9 +436,10 @@ class TextualInversionTrainer:

            # Freeze all parameters except for the token embeddings in text encoder
            text_encoder.requires_grad_(True)
-            text_encoder.text_model.encoder.requires_grad_(False)
-            text_encoder.text_model.final_layer_norm.requires_grad_(False)
-            text_encoder.text_model.embeddings.position_embedding.requires_grad_(False)
+            unwrapped_text_encoder = accelerator.unwrap_model(text_encoder)
+            unwrapped_text_encoder.text_model.encoder.requires_grad_(False)
+            unwrapped_text_encoder.text_model.final_layer_norm.requires_grad_(False)
+            unwrapped_text_encoder.text_model.embeddings.position_embedding.requires_grad_(False)
            # text_encoder.text_model.embeddings.token_embedding.requires_grad_(True)

        unet.requires_grad_(False)
@@ -503,6 +499,8 @@ class TextualInversionTrainer:

        if accelerator.is_main_process:
            init_kwargs = {}
+            if args.wandb_run_name:
+                init_kwargs["wandb"] = {"name": args.wandb_run_name}
            if args.log_tracker_config is not None:
                init_kwargs = toml.load(args.log_tracker_config)
            accelerator.init_trackers(
@@ -603,7 +601,7 @@ class TextualInversionTrainer:

                    accelerator.backward(loss)
                    if accelerator.sync_gradients and args.max_grad_norm != 0.0:
-                        params_to_clip = text_encoder.get_input_embeddings().parameters()
+                        params_to_clip = accelerator.unwrap_model(text_encoder).get_input_embeddings().parameters()
                        accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)

                    optimizer.step()
@@ -615,9 +613,11 @@ class TextualInversionTrainer:
                        for text_encoder, orig_embeds_params, index_no_updates in zip(
                            text_encoders, orig_embeds_params_list, index_no_updates_list
                        ):
-                            accelerator.unwrap_model(text_encoder).get_input_embeddings().weight[
+                            # if full_fp16/bf16, input_embeddings_weight is fp16/bf16, orig_embeds_params is fp32
+                            input_embeddings_weight = accelerator.unwrap_model(text_encoder).get_input_embeddings().weight
+                            input_embeddings_weight[index_no_updates] = orig_embeds_params.to(input_embeddings_weight.dtype)[
                                index_no_updates
-                            ] = orig_embeds_params[index_no_updates]
+                            ]

                # Checks if the accelerator has performed an optimization step behind the scenes
                if accelerator.sync_gradients:
@@ -725,14 +725,13 @@ class TextualInversionTrainer:
        is_main_process = accelerator.is_main_process
        if is_main_process:
            text_encoder = accelerator.unwrap_model(text_encoder)
+            updated_embs = text_encoder.get_input_embeddings().weight[token_ids].data.detach().clone()

        accelerator.end_training()

        if args.save_state and is_main_process:
            train_util.save_state_on_train_end(args, accelerator)

-        updated_embs = text_encoder.get_input_embeddings().weight[token_ids].data.detach().clone()
-
        if is_main_process:
            ckpt_name = train_util.get_last_ckpt_name(args, "." + args.save_model_as)
            save_model(ckpt_name, updated_embs_list, global_step, num_train_epochs, force_sync_upload=True)
--- a/train_textual_inversion_XTI.py
+++ b/train_textual_inversion_XTI.py
@@ -8,13 +8,11 @@ from multiprocessing import Value

 from tqdm import tqdm
 import torch
-try:
-    import intel_extension_for_pytorch as ipex
-    if torch.xpu.is_available():
-        from library.ipex import ipex_init
-        ipex_init()
-except Exception:
-    pass
+
+from library.ipex_interop import init_ipex
+
+init_ipex()
+
 from accelerate.utils import set_seed
 import diffusers
 from diffusers import DDPMScheduler
@@ -394,6 +392,8 @@ def train(args):

    if accelerator.is_main_process:
        init_kwargs = {}
+        if args.wandb_run_name:
+            init_kwargs['wandb'] = {'name': args.wandb_run_name}
        if args.log_tracker_config is not None:
            init_kwargs = toml.load(args.log_tracker_config)
        accelerator.init_trackers("textual_inversion" if args.log_tracker_name is None else args.log_tracker_name, init_kwargs=init_kwargs)
Author	SHA1	Message	Date
Kohya S	d5ab97b69b	Merge pull request #1067 from kohya-ss/dev Dev	2024-01-23 21:04:16 +09:00
Kohya S	7cb44e4502	update readme	2024-01-23 21:02:40 +09:00
Kohya S	7a20df5ad5	Merge pull request #1064 from KohakuBlueleaf/fix-grad-sync Avoid grad sync on each step even when doing accumulation	2024-01-23 20:33:55 +09:00
Kohya S	bea4362e21	Merge pull request #1060 from akx/refactor-xpu-init Deduplicate ipex initialization code	2024-01-23 20:25:37 +09:00
Kohya S	6805cafa9b	fix TI training crashes in multigpu #1019	2024-01-23 20:17:19 +09:00
Kohaku-Blueleaf	711b40ccda	Avoid always sync	2024-01-23 11:49:03 +08:00
Kohya S	696dd7f668	Fix dtype issue in PyTorch 2.0 for generating samples in training sdxl network	2024-01-22 12:43:37 +09:00
Kohya S	e0a3c69223	update readme	2024-01-20 18:47:10 +09:00
Kohya S	c59249a664	Add options to reduce memory usage in extract_lora_from_models.py closes #1059	2024-01-20 18:45:54 +09:00
Kohya S	fef172966f	Add network_multiplier for dataset and train LoRA	2024-01-20 16:24:43 +09:00
Kohya S	5a1ebc4c7c	format by black	2024-01-20 13:10:45 +09:00
Kohya S	2a0f45aea9	update readme	2024-01-20 11:08:20 +09:00
Kohya S	1f77bb6e73	fix to work sample generation in fp8 ref #1057	2024-01-20 10:57:42 +09:00
Kohya S	a7ef6422b6	fix to work with torch 2.0	2024-01-20 10:00:30 +09:00
Kohaku-Blueleaf	9cfa68c92f	[Experimental Feature] FP8 weight dtype for base model when running train_network (or sdxl_train_network) (#1057 ) * Add fp8 support * remove some debug prints * Better implementation for te * Fix some misunderstanding * as same as unet, add explicit convert * better impl for convert TE to fp8 * fp8 for not only unet * Better cache TE and TE lr * match arg name * Fix with list * Add timeout settings * Fix arg style * Add custom seperator * Fix typo * Fix typo again * Fix dtype error * Fix gradient problem * Fix req grad * fix merge * Fix merge * Resolve merge * arrangement and document * Resolve merge error * Add assert for mixed precision	2024-01-20 09:46:53 +09:00
Aarni Koskela	6f3f701d3d	Deduplicate ipex initialization code	2024-01-19 18:07:36 +02:00
Kohya S	d2a99a19d4	Merge pull request #1056 from kohya-ss/dev fix vram usage in LoRA training	2024-01-17 21:41:36 +09:00
Kohya S	0395a35543	Merge branch 'main' into dev	2024-01-17 21:39:13 +09:00
Kohya S	987d4a969d	update readme	2024-01-17 21:38:49 +09:00
Kohya S	976d092c68	fix text encodes are on gpu even when not trained	2024-01-17 21:31:50 +09:00
Kohya S	e6b15c7e4a	Merge pull request #1053 from akx/sdpa Fix typo `--spda` (it's `--sdpa`)	2024-01-16 21:50:45 +09:00
Aarni Koskela	ef50436464	Fix typo `--spda` (it's `--sdpa`)	2024-01-16 14:32:48 +02:00
Kohya S	26d35794e3	Merge pull request #1052 from kohya-ss/dev merge dev	2024-01-15 21:39:02 +09:00
Kohya S	dcf0eeb5b6	update readme	2024-01-15 21:35:26 +09:00
Kohya S	32b759a328	Add wandb_run_name parameter to init_kwargs #1032	2024-01-14 22:02:03 +09:00
Kohya S	09ef3ffa8b	Merge branch 'main' into dev	2024-01-14 21:49:25 +09:00
Kohya S	aab265e431	Fix an issue with saving as diffusers sd1/2 model close #1033	2024-01-04 21:43:50 +09:00
Kohya S	716bad188b	Update dependencies ref #1024	2024-01-04 19:53:25 +09:00
Kohya S	07bf2a21ac	Merge pull request #1024 from p1atdev/main Add support for `torch.compile`	2024-01-04 10:49:52 +09:00
Kohya S	8ac2d2a92f	Merge pull request #1030 from Disty0/dev Update IPEX Libs	2024-01-04 10:46:07 +09:00
Kohya S	76aee71257	Merge branch 'main' into dev	2024-01-04 10:42:16 +09:00
Kohya S	663b481029	fix TI training with full_fp16/bf16 ref #1019	2024-01-03 23:22:00 +09:00
Kohya S	1ab6493268	Merge branch 'main' into dev	2024-01-03 21:36:31 +09:00
Disty0	b9d2181192	Cleanup	2024-01-02 11:51:29 +03:00
Disty0	49148eb36e	Disable Diffusers slicing if device is not XPU	2024-01-02 11:50:08 +03:00
Disty0	479bac447e	Fix typo	2024-01-01 12:51:23 +03:00
Disty0	15d5e78ac2	Update IPEX Libs	2024-01-01 12:44:26 +03:00
Plat	62e7516537	feat: support torch.compile	2023-12-27 02:17:24 +09:00
Plat	20296b4f0e	chore: bump eniops version due to support torch.compile	2023-12-27 02:17:24 +09:00
Kohya S	5cae6db804	Fix to work with DDP TextualInversionTrainer ref #1019	2023-12-24 22:05:56 +09:00