LORA TRAINING GUIDE 2026: KOHYA SS, FLUX.1/FLUX.2 & VRAM OPTIMIZATION (8GB-80GB)
Training LoRAs on consumer GPUs in 2026 is practical across all major base models — SD 1.5, SDXL, FLUX.1, and FLUX.2. The key advances that made this possible are the fused backward pass (Kohya SS v0.9+, January 2025), LoRA+ for better convergence, and FLUX.2 klein for lightweight FLUX training.
This guide covers the tool ecosystem, VRAM optimization techniques, and per-model parameter recommendations. It draws on the Kohya SS repository, the AI-Toolkit by Ostris, and community-validated settings from r/StableDiffusion and RunDiffusion guides.
For character LoRA training from scratch including dataset preparation and inference, see the character LoRA training guide. This guide focuses on tool selection, VRAM optimization, and advanced training parameters.
Who Is This Guide For?
You have some experience with Stable Diffusion or FLUX and want to train your own LoRAs. You may be on a consumer GPU with 8-16GB VRAM and need to know what’s possible. You want documented, verified settings — not trial and error.
By the End of This, You’ll Know
- Which training tool to use for your base model and GPU
- How to reduce VRAM usage from 24GB to 10GB using fused backward pass
- The optimal LoRA+ ratio and how to configure it
- Complete setup steps for SD 1.5, SDXL, FLUX.1, and FLUX.2
- How to diagnose and fix common training issues
Tool Ecosystem
Three primary tools cover the LoRA training landscape in 2026.
Kohya SS (sd-scripts) — Standard for SD 1.5, SDXL, FLUX.1
Kohya SS sd-scripts is the most widely used LoRA training framework. Version 0.9.0 (January 2025) introduced the fused backward pass, which is the primary VRAM optimization for SDXL training. The Kohya SS GUI by bmaltais provides a graphical interface wrapping the core scripts.
Key features in v0.9+:
- Fused backward pass — combines optimizer step with backward pass, requiring Adafactor optimizer. Documented in the Kohya SD-Scripts SDXL training guide
- LoRA+ — separate learning rates for LoRA-A and LoRA-B matrices via
loraplus_lr_ratio - Block-wise training and alpha mask training for transparent layers
- Support for SD 1.5, SDXL, and FLUX.1
AI-Toolkit by Ostris — Standard for FLUX.2
AI-Toolkit is the primary training suite for FLUX.2 models. It provides a web UI and supports FLUX.2 dev (32B parameters) and FLUX.2 klein (4B parameters). The RunComfy FLUX.2 dev training guide recommends 1024x1024 resolution with 20-60+ images and 1000-3000 steps.
VRAM requirements per the AI-Toolkit training guide:
- FLUX.2 dev: 24GB minimum, 80GB recommended for full training runs
- FLUX.2 klein: 12-16GB
- Qwen Image: 32GB minimum
- Z-Image: 12-16GB
FluxGym — Web UI for FLUX.1
FluxGym provides a simpler web UI specifically for FLUX.1 LoRA training. It explicitly supports 12GB/16GB/20GB VRAM configurations and wraps Kohya sd-scripts under the hood.
VRAM Optimization
The fused backward pass in Kohya SS v0.9+ is the most impactful optimization for consumer GPU training. It integrates the optimizer’s backward and step operations, reducing SDXL VRAM usage from approximately 24GB to roughly 17GB at standard precision, or 10GB with bf16. The Kohya SS releases page documents this as the headline feature of v0.9.0.
Requirements:
- Kohya SS v0.9.0 or later
- PyTorch 2.1 or newer
- Adafactor optimizer (incompatible with AdamW8bit)
- Flag:
--fused_backward_pass
SDXL training command with fused backward pass:
accelerate launch --num_cpu_threads_per_process 8 sdxl_train_network.py \
--pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
--train_data_dir="./your_training_images" \
--output_dir="./output" \
--fused_backward_pass \
--optimizer_type="adafactor" \
--mixed_precision="bf16" \
--learning_rate=1e-4 \
--max_train_epochs=10 \
--save_every_n_epochs=2
If you need to use AdamW8bit, the alternative is optimizer groups via --fused_optimizer_groups=8. This slices parameter chunks to reduce memory without changing the optimizer. Values between 4 and 10 groups provide the best balance.
Additional optimizations for all tools:
- Mixed precision (bf16) cuts training time by roughly half. Use
--mixed_precision=bf16on RTX 30-series and newer GPUs for better stability than fp16. - xFormers reduces memory and speeds up attention computation. Install matching your CUDA version:
pip install xformers. Add--xformersto the training command. - Gradient checkpointing trades compute for memory:
--gradient_checkpointing - Reduce resolution for faster training. 512x512 instead of 768x768 is roughly 2x faster on most models.
LoRA+
LoRA+ applies different learning rate multipliers to the LoRA-A and LoRA-B matrices, improving convergence speed and detail capture. The parameter is loraplus_lr_ratio, applied as a network argument:
--network_args "loraplus_lr_ratio=16"
A 16x ratio is the most commonly recommended starting point across community guides. The ratio can be tuned per-component using separate U-Net and text encoder ratios for finer control.
Per-Model Training Configurations
SD 1.5
Best for low-VRAM scenarios. Runs on 6-8GB GPUs.
accelerate launch train_network.py \
--pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
--train_data_dir="./dataset" \
--output_dir="./output" \
--caption_extension=".txt" \
--network_module="networks.lora" \
--network_args "loraplus_lr_ratio=16" \
--network_dim=64 \
--network_alpha=32 \
--learning_rate=2e-4 \
--lr_scheduler="cosine_with_restarts" \
--max_train_epochs=15 \
--mixed_precision="fp16"
SDXL
Requires fused backward pass for 8-12GB GPUs. Without it, 24GB is needed.
Refer to the command in the VRAM optimization section above. Use --network_dim=128, --network_alpha=64, --learning_rate=1e-4, and --max_train_epochs=10. The r/StableDiffusion training primer documents these as validated parameters.
FLUX.1
FLUX.1 training in Kohya SS requires 16-24GB VRAM. Enable quantization to fit on 16GB cards:
accelerate launch train_network.py \
--pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev" \
--train_data_dir="./dataset" \
--output_dir="./output" \
--network_module="networks.lora" \
--network_dim=128 \
--network_alpha=128 \
--learning_rate=1e-4 \
--max_train_steps=2000 \
--mixed_precision="bf16" \
--quantize=True
For commercial use, target black-forest-labs/FLUX.1-schnell which does not use guidance scales — cap sampling steps at 1-4.
FLUX.2
FLUX.2 offers three tiers. Training on the dev model (32B parameters) requires 24GB+ VRAM and is best done through AI-Toolkit. The klein variant (4B parameters) is viable on 12-16GB consumer GPUs.
AI-Toolkit setup for FLUX.2:
git clone https://github.com/ostris/ai-toolkit.git
cd ai-toolkit
python3 -m venv venv
source venv/bin/activate
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu126
pip3 install -r requirements.txt
Copy an example config and edit it for your dataset:
cp config/examples/train_lora_flux_24gb.yaml config/my_training.yml
python run.py config/my_training.yml
Key parameters for FLUX.2 via AI-Toolkit:
- Resolution: 1024x1024 (default for FLUX.2 output)
- Training steps: 1000-3000 (per RunComfy’s guide)
- Dataset size: 20-60+ images
- VRAM: 24GB+ for dev, 12-16GB for klein
Dataset Captioning with WD14 Tagger v3
The WD14 Tagger v3 by SmilingWolf is the standard captioning tool in 2026. Run it from within Kohya SS’s sd-scripts directory:
python tag_images_by_wd14_tagger.py \
--batch_size=4 \
--repo_id="SmilingWolf/wd-vit-tagger-v3" \
--model_dir="./wd14_models" \
--onnx \
--use_rating_tags \
--character_tags_first \
--always_first_tags="1girl,1boy" \
./path/to/images
The v3 tagger supports forcing character tags to the beginning of the caption string, which reduces concept bleeding. Always review and manually edit auto-generated captions — they frequently miss your trigger word or describe irrelevant background elements.
Troubleshooting
CUDA Out of Memory during training. Enable fused backward pass with Adafactor (Kohya SS v0.9+). If already using it, reduce batch size to 1 and enable gradient checkpointing. For FLUX models, enable quantization. The Kohya SS discussion on 8GB SDXL training documents VRAM reduction strategies.
LoRA not appearing in generated images. Verify the LoRA file is in the correct directory (models/Lora/ for AUTOMATIC1111). Check network alpha — higher values produce stronger effects. If using LoRA+, the ratio may be too high; try 8x instead of 16x.
Model diverging during training. Learning rate is too high. For SDXL with LoRA+, try 5e-5. For standard training, 2e-4 for SD 1.5 and 1e-4 for SDXL are documented baselines.
Slow training speed. Enable bf16 mixed precision on RTX 30xx+ GPUs. Install xFormers matching your CUDA version. Reduce network dim from 128 to 64.
Fused backward pass not working. Verify PyTorch 2.1+, Kohya SS v0.9.0+, and Adafactor optimizer. The flag --fused_backward_pass will be ignored with other optimizers.
What You Can Actually Use Today
- Kohya SS sd-scripts — full training framework for SD 1.5, SDXL, FLUX.1. GitHub. v0.9.0+ required for fused backward pass.
- Kohya SS GUI — graphical interface. GitHub.
- AI-Toolkit by Ostris — primary tool for FLUX.2, Z-Image, and Qwen Image training. GitHub. Also available as AI-Studio for a streamlined UI.
- FluxGym — web UI for FLUX.1 with low-VRAM support (12GB/16GB/20GB). GitHub.
- OneTrainer — Kohya fork with an all-in-one GUI, easier to use than raw sd-scripts. GitHub.
- Shakker AI — web-based LoRA training platform with a guided UI, no local GPU required. Website.
- Draw Things — on-device LoRA training on iPhone, iPad, and Mac (Apple Silicon). Supports SDXL fine-tuning at ~10.3GiB peak memory. Website.
- MimicPC — cloud LoRA training platform with pay-per-use pricing (~$1.19/run). Website.
- RunPod / Massed Compute — cloud GPU providers with pre-configured Kohya environments for FLUX training at $0.50-$2.69/hr.
- WD14 Tagger v3 — dataset captioning. Hugging Face.
- Hugging Face Diffusers LoRA docs — official parameter reference. Documentation.
For step-by-step character LoRA training including dataset preparation and inference, see the character LoRA training guide. For dataset optimization and multi-concept training, read Advanced LoRA Self-Portraits.