How to boost Stable Diffusion Web UI speed

·
18 November 2023
·
9 minutes read

When you are generating images in large sizes and batches, knowing how to improve the performance of Stable Diffusion Web UI mean a significant reduction in generation time required.

The minimum requirement for Stable Diffusion Web UI is 2GB VRAM, but generation will be slow and you will run out of memory once you try to create images larger than 512 x 512. Fortunately, there are several ways to optimise Stable Diffusion Web UI to speed up the image generation process.

From my experience, the best setup to use Stable Diffusion is a Windows machine with Nvidia GPU that meets the recommendation of 6GB VRAM. 

Bear in mind that many variables affect the optimisation options, so it is best to test the different combinations to find what gives you the best performance. Test with different settings using the same checkpoint to generate 512 x 512 images with 20 steps using the Euler sampling method. Compare how fast it takes Web UI to generate an image.

Cross-attention optimisation

One of the critical operation Stable Diffusion uses is cross-attention calculation. It involves the interaction between two sets of data or vectors: the query and the key. Cross-attention can consume significant amount of memory and time.

Imagine you have a box of building blocks, and you want to build a tall tower. Some blocks are important for making it tall and stable, while others are not so important. You have a pair of special glasses that make the important blocks glow when you look at them through the glasses.

Cross-attention is like using the special glasses to allow the model to focus on the different parts of the input data on what’s important to generate the image.

Setting cross-attention optimisation

Due to the impact of cross-attention calculation, optimising its is the key to speeding up Stable Diffusion. You can set the cross-attention optimisation method in the Stable Diffusion Web UI.

  1. Launch Stable Diffusion Web UI.

  2. Go to the Settings tab and select the Optimization in the sidebar.

  3. Choose your preferred cross-attention optimisation from the dropdown menu. The default is set to Automatic.

  4. Click Apply Settings to save the settings.

Doggettx

This is a historical improvement to cross-attention operations that offers a decent performance boost, but has been surpassed by newer options. Doggettx submitted the improvements to the original implementation in Stable Diffusion.

xFormers

The Meta AI team developed the xFormers, pronounced transformers. It is a transformer library that increased the attention operation’s speed while reducing memory usage through memory-efficient attention and Flash Attention techniques.

Transformers are a type of neural network architecture that uses self-attention to determine the importance of different parts of the input data. xFormers integrates with PyTorch and CUDA libraries. CUDA is limited to Nvidia hardware, and hence xFormers is only available if you are using an Nvidia GPU.

Memory-efficient attention uses an algorithm that uses less steps and memory to compute the attention operation, making it more efficient for large models and inputs.

Flash Attention uses tiling to compute attention one small piece at a time, reducing memory usage and speeding up calculations.

Scaled-Dot-Product (sdp) Attention

SDP attention is an alternative implementation of memory-efficient attention and Flash Attention native to PyTorch that is available in PyTorch 2 and newer. Depending on your hardware setup, you might get better performance with SDP attention than xFormers. Note that it uses more VRAM than xFormers, so your hardware might run into issues with it.

SDP attention gives non-deterministic output, meaning that the results are reproducible. This is a problem if you want to be able to reproduce the same image when you use the same parameters.

If you are using Stable Diffusion to create art or images for general use, you generally won’t need deterministic output in your workflow. It is only crucial in research.

SDP Attention without Memory-Efficient Attention (SDP-no-mem)

SDP-no-mem is an implementation of SDP attention without the memory-efficient attention technique. This makes it produce deterministic output, and hence allows you to reproduce the results with the same parameters.

The drawback of using SDP-no-mem is sacrificing the memory-efficient optimisations in exchange for deterministic output.

Sub-Quadratic (sub-quad) Attention

Sub-quad attention is another implementation of memory-efficient attention. It significantly reduces the required memory, but this comes at a cost of speed.

This is useful if you’re unable to run xFormers or SDP. Sub-quad attention allows you to generate larger image sizes if you are on macOS.

Split-Attention v1

Split-attention v1 is an older implementation of memory-efficient attention that has been surpassed by the other techniques like xFormers or SDP that use memory-efficient attention.

You should be using xFormers or SDP where possible. Split-attention v1 uses less VRAM, so it might be a useful option if your hardware has limited memory. However, it is more limiting on the maximum image size it can generate.

Invoke AI

The Invoke AI is an alternative GUI. Its cross-attention optimisation is useful for macOS machines without Nvidia GPUs.

Token merging

Token merging (ToMe) is a new technique that accelerates Stable Diffusion by reducing the number of tokens that need processing. It does this by identifying and combining redundant tokens. Merging tokens changes the prompt processed, and hence changes the image output. This could be an issue if you are trying to reproduce the same image with the same parameters.

I personally find it a better habit to practice good prompt engineering and optimise your prompt length. Be mindful when creating prompts and avoid using redundant prompts.

You’ll find that many prompts out there are very badly structured. Instead of just copying prompts, take the time to remove redundancies. If you have a sample image to refer to, remove the tokens that don’t appear in the output that you want to generate.

With less tokens to process, the generation is naturally faster. However, it doesn’t seem to deliver that much improvements compared to cross-attention optimisations. I would avoid using this unless you are getting very long generation times with your setup.

Setting token merging

  1. Launch Stable Diffusion Web UI.

  2. Go to the Settings tab and select the Optimization in the sidebar.

  3. Choose your preferred token merging ratio by dragging the slider or keying in the ratio value.

  4. Click Apply Settings to save the settings.

Negative guidance minimum sigma

Negative guidance minimum sigma is an optimisation that adjusts the sigma, a parameter that represents randomness in the generation process. By increasing the minimum sigma value, you are increasing the chances of the generation process skipping the negative prompt for some steps when the image is almost ready.

Increasing the sigma value reduces the generation time, though I find the performance boost on par with token merging. Negative guidance minimum sigma alters the image output, but to a lesser extent than token merging. If you had to choose between the two, I would suggest going with negative guidance minimum sigma.

Again, I would avoid using this unless you are getting very slow performance with your setup.

Setting negative guidance minimum sigma

  1. Launch Stable Diffusion Web UI.

  2. Go to the Settings tab and select the Optimization in the sidebar.

  3. Choose your preferred negative guidance minimum sigma by dragging the slider or keying in the sigma value.

  4. Click Apply Settings to save the settings.

Command line arguments

Since Stable Diffusion Web UI is a command-line application, you can provide command-line arguments to configure it when launching Web UI. Some of these arguments can be used in combination to improve the performance of Stable Diffusion Web UI.

If you launch Web UI from the terminal, you can add the arguments to the command. If you launch Web UI by double-clicking on the webui-user.bat or run.bat files, you can edit the webui-user.bat (Windows) or webui-user.sh (Mac or Linux) in a text editor and add the variables.

In webui-user.bat, add the arguments to the line set COMMANDLINE_ARGS=.
In webui-user.sh, add the arguments to the line export COMMANDLINE_ARGS=.

For example, set COMMANDLINE_ARGS=--skip-torch-cuda-test --no-half-vae —api --opt-sdp-attention

There is a full list of command line arguments you can use with Stable Diffusion Web UI on GitHub.

Optimisation method arguments

These are the arguments that enable the optimisations mentioned in this article:

  • --opt-sdp-attention – Enables SDP attention optimisation

  • --opt-sdp-no-mem-attention – Enables SDP-no-mem

  • --xformers – Enables xFormers

  • --force-enable-xformers – Enables xFormers regardless of whether the program thinks you can run it

  • --opt-split-attention – Enables cross-attention layer optimisation; enabled by default for torch.cuda for both Nvidia and AMD cards

  • --disable-opt-split-attention – Disables the cross-attention optimisation

  • --opt-sub-quad-attention – Enables sub-quad attention optimisation

  • --opt-split-attention-v1 – Enables split attention v1

Performance options arguments

You can also add other arguments to improve the performance of Stable Diffusion Web UI:

  • --medvram – Splits the Stable Diffusion into three parts and only loads one in VRAM at all times, keeping the others in CPU RAM. It slows down generation speed but allows you to generate the image with a lower VRAM ceiling.

  • --medvram-sdxl – Enables --medvram only for SDXL models

  • --lowvram – An even more thorough optimisation that splits the third part, the unet, into many modules, and keeping only one module is kept in VRAM. Very, very slow generation.

  • --lowram – Load Stable Diffusion checkpoint weights to VRAM instead of RAM for machines that have limited RAM

  • --upcast-sampling – Improves generation speed for machines that need to run with --no-half. Better performance and VRAM usage than --no-half.

See also

Don't miss a post

Join 1000+ others and get new posts delivered to your inbox.

I hate spam and won't send you any.