How much VRAM do I need to train a LoRA in 2026?

SD 1.5: 6-8GB. SDXL: 10GB with fused backward pass and bf16 (Kohya SS v0.9+), 24GB without. FLUX.1: 16-24GB depending on quantization. FLUX.2 dev: 24GB+ (80GB recommended for full dev). FLUX.2 klein: 12-16GB.

What is the fused backward pass and when should I use it?

Fused backward pass combines the optimizer step with the backward pass for each parameter, significantly reducing VRAM usage. It requires Kohya SS v0.9+ and only works with the Adafactor optimizer. It can reduce SDXL training from 24GB to approximately 10GB with bf16 precision.

What is the recommended LoRA+ ratio for character training?

A 16x ratio is the most commonly recommended starting point for LoRA+ (loraplus_lr_ratio=16). Some users report 32x works better for fine detail, but 16x is the documented default across community guides.

Which tool should I use for FLUX.2 LoRA training?

AI-Toolkit by Ostris is the primary tool for FLUX.2 training, supporting dev and klein variants. FluxGym and Kohya SS support FLUX.1. For FLUX.2 dev, you need 24GB+ VRAM. For FLUX.2 klein (4B params), 12-16GB is sufficient.

What optimizer should I use with fused backward pass?

Fused backward pass requires the Adafactor optimizer. AdamW8bit and other optimizers are not compatible with this feature. If you need to use AdamW8bit, use optimizer groups (--fused_optimizer_groups=8) instead.

Why does my LoRA training run out of VRAM even with fused backward pass?

Check your resolution and batch size settings first — they dominate VRAM usage. With fused backward pass on Kohya SS v0.9+, you should be able to train SDXL at 1024x1024 with a batch size of 2 on 12GB. If you're still getting OOM, disable gradient checkpointing only as a last resort (it reduces VRAM by ~2GB but increases training time by 30%).

What are the best LoRA training parameters for stable results?

For content/concept LoRAs: rank=16-32, alpha=8, base learning rate of 0.0004 with LoRA+ ratio=16. For face/character LoRAs: rank=64-128, alpha=32, base learning rate of 0.0001 with AdamW8bit. These settings consistently produce coherent results across 10-20 training images on both Kohya SS and AI-Toolkit.

HOW TO TRAIN A LORA IN 2026: KOHYA SS, FLUX & VRAM OPTIMIZATION (COMPLETE GUIDE)

20/5/2026
Updated 10/7/2026
7-minute read
1430 words

Training LoRAs on consumer GPUs in 2026 is practical across all major base models — SD 1.5, SDXL, FLUX.1, and FLUX.2. The key advances that made this possible are the fused backward pass (Kohya SS v0.9+, January 2025), LoRA+ for better convergence, and FLUX.2 klein for lightweight FLUX training. If you’ve hit a VRAM ceiling or your latest LoRA came out looking like abstract art, you’re in the right place.

This guide covers the tool ecosystem, VRAM optimization techniques, and per-model parameter recommendations. It draws on the Kohya SS repository, the AI-Toolkit by Ostris, and community-validated settings from r/StableDiffusion and RunDiffusion guides.

For character LoRA training from scratch including dataset preparation and inference, see the character LoRA training guide. This guide focuses on tool selection, VRAM optimization, and advanced training parameters.

Who Is This Guide For?

You have some experience with Stable Diffusion or FLUX and want to train your own LoRAs. You may be on a consumer GPU with 8-16GB VRAM and need to know what’s possible. You want documented, verified settings — not trial and error.

By the End of This, You’ll Know

Which training tool to use for your base model and GPU
How to reduce VRAM usage from 24GB to 10GB using fused backward pass
The optimal LoRA+ ratio and how to configure it
Complete setup steps for SD 1.5, SDXL, FLUX.1, and FLUX.2
How to diagnose and fix common training issues

Tool Ecosystem

Three primary tools cover the LoRA training landscape in 2026.

Kohya SS (sd-scripts) — Standard for SD 1.5, SDXL, FLUX.1

Kohya SS sd-scripts is the most widely used LoRA training framework. Version 0.9.0 (January 2025) introduced the fused backward pass, which is the primary VRAM optimization for SDXL training. The Kohya SS GUI by bmaltais provides a graphical interface wrapping the core scripts.

Key features in v0.9+:

Fused backward pass — combines optimizer step with backward pass, requiring Adafactor optimizer. Documented in the Kohya SD-Scripts SDXL training guide
LoRA+ — separate learning rates for LoRA-A and LoRA-B matrices via loraplus_lr_ratio
Block-wise training and alpha mask training for transparent layers
Support for SD 1.5, SDXL, and FLUX.1

AI-Toolkit by Ostris — Standard for FLUX.2

AI-Toolkit is the primary training suite for FLUX.2 models. It provides a web UI and supports FLUX.2 dev (32B parameters) and FLUX.2 klein (4B parameters). The RunComfy FLUX.2 dev training guide recommends 1024x1024 resolution with 20-60+ images and 1000-3000 steps.

VRAM requirements per the AI-Toolkit training guide:

FLUX.2 dev: 24GB minimum, 80GB recommended for full training runs
FLUX.2 klein: 12-16GB
Qwen Image: 32GB minimum
Z-Image: 12-16GB

FluxGym — Web UI for FLUX.1

FluxGym provides a simpler web UI specifically for FLUX.1 LoRA training. It explicitly supports 12GB/16GB/20GB VRAM configurations and wraps Kohya sd-scripts under the hood.

VRAM Optimization

The fused backward pass in Kohya SS v0.9+ is the most impactful optimization for consumer GPU training. It integrates the optimizer’s backward and step operations, reducing SDXL VRAM usage from approximately 24GB to roughly 17GB at standard precision, or 10GB with bf16. The Kohya SS releases page documents this as the headline feature of v0.9.0.

Requirements:

Kohya SS v0.9.0 or later
PyTorch 2.1 or newer
Adafactor optimizer (incompatible with AdamW8bit)
Flag: --fused_backward_pass

SDXL training command with fused backward pass:

accelerate launch --num_cpu_threads_per_process 8 sdxl_train_network.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
  --train_data_dir="./your_training_images" \
  --output_dir="./output" \
  --fused_backward_pass \
  --optimizer_type="adafactor" \
  --mixed_precision="bf16" \
  --learning_rate=1e-4 \
  --max_train_epochs=10 \
  --save_every_n_epochs=2

If you need to use AdamW8bit, the alternative is optimizer groups via --fused_optimizer_groups=8. This slices parameter chunks to reduce memory without changing the optimizer. Values between 4 and 10 groups provide the best balance.

Additional optimizations for all tools:

Mixed precision (bf16) cuts training time by roughly half. Use --mixed_precision=bf16 on RTX 30-series and newer GPUs for better stability than fp16.
xFormers reduces memory and speeds up attention computation. Install matching your CUDA version: pip install xformers. Add --xformers to the training command.
Gradient checkpointing trades compute for memory: --gradient_checkpointing
Reduce resolution for faster training. 512x512 instead of 768x768 is roughly 2x faster on most models.

LoRA+

LoRA+ applies different learning rate multipliers to the LoRA-A and LoRA-B matrices, improving convergence speed and detail capture. The parameter is loraplus_lr_ratio, applied as a network argument:

--network_args "loraplus_lr_ratio=16"

A 16x ratio is the most commonly recommended starting point across community guides. The ratio can be tuned per-component using separate U-Net and text encoder ratios for finer control.

Per-Model Training Configurations

SD 1.5

Best for low-VRAM scenarios. Runs on 6-8GB GPUs.

accelerate launch train_network.py \
  --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
  --train_data_dir="./dataset" \
  --output_dir="./output" \
  --caption_extension=".txt" \
  --network_module="networks.lora" \
  --network_args "loraplus_lr_ratio=16" \
  --network_dim=64 \
  --network_alpha=32 \
  --learning_rate=2e-4 \
  --lr_scheduler="cosine_with_restarts" \
  --max_train_epochs=15 \
  --mixed_precision="fp16"

SDXL

Requires fused backward pass for 8-12GB GPUs. Without it, 24GB is needed.

Refer to the command in the VRAM optimization section above. Use --network_dim=128, --network_alpha=64, --learning_rate=1e-4, and --max_train_epochs=10. The r/StableDiffusion training primer documents these as validated parameters.

FLUX.1

FLUX.1 training in Kohya SS requires 16-24GB VRAM. Enable quantization to fit on 16GB cards:

accelerate launch train_network.py \
  --pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev" \
  --train_data_dir="./dataset" \
  --output_dir="./output" \
  --network_module="networks.lora" \
  --network_dim=128 \
  --network_alpha=128 \
  --learning_rate=1e-4 \
  --max_train_steps=2000 \
  --mixed_precision="bf16" \
  --quantize=True

For commercial use, target black-forest-labs/FLUX.1-schnell which does not use guidance scales — cap sampling steps at 1-4.

FLUX.2

FLUX.2 offers three tiers. Training on the dev model (32B parameters) requires 24GB+ VRAM and is best done through AI-Toolkit. The klein variant (4B parameters) is viable on 12-16GB consumer GPUs.

AI-Toolkit setup for FLUX.2:

git clone https://github.com/ostris/ai-toolkit.git
cd ai-toolkit
python3 -m venv venv
source venv/bin/activate
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu126
pip3 install -r requirements.txt

Copy an example config and edit it for your dataset:

cp config/examples/train_lora_flux_24gb.yaml config/my_training.yml
python run.py config/my_training.yml

Key parameters for FLUX.2 via AI-Toolkit:

Resolution: 1024x1024 (default for FLUX.2 output)
Training steps: 1000-3000 (per RunComfy’s guide)
Dataset size: 20-60+ images
VRAM: 24GB+ for dev, 12-16GB for klein

Dataset Captioning with WD14 Tagger v3

The WD14 Tagger v3 by SmilingWolf is the standard captioning tool in 2026. Run it from within Kohya SS’s sd-scripts directory:

python tag_images_by_wd14_tagger.py \
  --batch_size=4 \
  --repo_id="SmilingWolf/wd-vit-tagger-v3" \
  --model_dir="./wd14_models" \
  --onnx \
  --use_rating_tags \
  --character_tags_first \
  --always_first_tags="1girl,1boy" \
  ./path/to/images

The v3 tagger supports forcing character tags to the beginning of the caption string, which reduces concept bleeding. Always review and manually edit auto-generated captions — they frequently miss your trigger word or describe irrelevant background elements.

Troubleshooting

CUDA Out of Memory during training. Enable fused backward pass with Adafactor (Kohya SS v0.9+). If already using it, reduce batch size to 1 and enable gradient checkpointing. For FLUX models, enable quantization. The Kohya SS discussion on 8GB SDXL training documents VRAM reduction strategies.

LoRA not appearing in generated images. Verify the LoRA file is in the correct directory (models/Lora/ for AUTOMATIC1111). Check network alpha — higher values produce stronger effects. If using LoRA+, the ratio may be too high; try 8x instead of 16x.

Model diverging during training. Learning rate is too high. For SDXL with LoRA+, try 5e-5. For standard training, 2e-4 for SD 1.5 and 1e-4 for SDXL are documented baselines.

Slow training speed. Enable bf16 mixed precision on RTX 30xx+ GPUs. Install xFormers matching your CUDA version. Reduce network dim from 128 to 64.

Fused backward pass not working. Verify PyTorch 2.1+, Kohya SS v0.9.0+, and Adafactor optimizer. The flag --fused_backward_pass will be ignored with other optimizers.

What You Can Actually Use Today

Kohya SS sd-scripts — full training framework for SD 1.5, SDXL, FLUX.1. GitHub. v0.9.0+ required for fused backward pass.
Kohya SS GUI — graphical interface. GitHub.
AI-Toolkit by Ostris — primary tool for FLUX.2, Z-Image, and Qwen Image training. GitHub. Also available as AI-Studio for a streamlined UI.
FluxGym — web UI for FLUX.1 with low-VRAM support (12GB/16GB/20GB). GitHub.
OneTrainer — Kohya fork with an all-in-one GUI, easier to use than raw sd-scripts. GitHub.
Shakker AI — web-based LoRA training platform with a guided UI, no local GPU required. Website.
Draw Things — on-device LoRA training on iPhone, iPad, and Mac (Apple Silicon). Supports SDXL fine-tuning at ~10.3GiB peak memory. Website.
MimicPC — cloud LoRA training platform with pay-per-use pricing (~$1.19/run). Website.
RunPod / Massed Compute — cloud GPU providers with pre-configured Kohya environments for FLUX training at $0.50-$2.69/hr.
WD14 Tagger v3 — dataset captioning. Hugging Face.
Hugging Face Diffusers LoRA docs — official parameter reference. Documentation.

For step-by-step character LoRA training including dataset preparation and inference, see the character LoRA training guide. For dataset optimization and multi-concept training, read Advanced LoRA Self-Portraits.

Need help with AI model training?

I advise engineering teams on ML infrastructure, model training pipelines, and AI deployment strategy. If you’re building production systems around custom models, let’s talk.

Book a Consultation

ai-image-generation stable-diffusion lora tutorial machine-learning