2024
2023
Training models is as much an art as it is a science. Machine learning represents the scientific aspect—formulating hypotheses and conducting tests. You can monitor extensive quantitative data and adjust countless small parameters. The quality of the model and its output constitutes the art, which is entirely subjective. This is where you assume the role of an art director.
Train multiple models, then test each one in numerous ways by generating extensive contact sheets and making selections. Good taste is often underappreciated. In this domain, the answer to most questions is often: "It depends."
These are my notes based on my experience. I've included some common advice I've encountered, even if it didn't produce the best results for me.
Assuming you don't want to wait forever:
A beefy enough GPU with sufficient VRAM and CUDA: rtx 3090, 4090 or better
Apple Silicon like an M3 Max with a boatload of Unified RAM is good, but without CUDA or easy to use alternative optimizations its too slow for diffusion training.
Cloud compute has even beefier GPUs by the hour. Do your own cost benefit analysis.
Can you do it with less: Sure, but who's got that kind of time.
You need a dataset of the concept or subject you want to train on.
It doesn't have to be that large. Some say as few as 10-15 images. 40 is good.
You're trading off training time. Fewer images, faster training. More images, longer training.
Regularization images. Some say as much as 10x your training data. This is there to keep from overfitting a general concept to your specific training subject.
But if you're training on a face, you may want it to overfit depending on how you want to use the LORA in your workflow. (e.g.loading and unloading as needed)
Your dataset needs the text describing your dataset. Don't do this by hand. Use image to text or multimodal models like Florance2 and/or WD14. Review and edit by hand as necessary.
Kohya_ss, or OneTrainer. Some say Ontrainer is simpler. Kohya_ss is versitile.
Which model checkpoint are you starting from? Choose the one that is best suited to the type of images you want to optimize for. You've got two main choices: something that leans towards concept art and/or anime or photography / realism. Some say train use the original SDXL checkpoint since it will make your LORA is more compatible with any model. I say be intentional and opinionated and focus on your intended use case above all other considerations.
Between 0.0001 and 0.00001.
An RTX 3090 and 4090 can easily a batch size of 6. 8 is better. Batch size saves time especially if you're training on more images, and running more epochs. Some say, use a batch size of 1 for focused training: I wouldn't on any dataset larger that 15.
The bigger this number the bigger your LoRa file.
A rank of 256 makes a 1.7Gb Lora - but captures a lot more nuanced features.
96 makes a 800Mb Lora.
There's little impact on training time in my experience.
Adafactor. Some say constant.
Adafactor. All the default settings work fine.
Adafactor is going to choose for you. If you choose Constant Scheduler: 0.00005.
Adafactor is going to choose for you. If you choose Constant Scheduler: 0.0001.
1 is the default. I use 1.1. If you want to 'overfit' your LORA you can look at pushing this number higher.
Some say 10 is enough for a small dataset. But, 20 yields better results in most of my use cases. If you've got the time, the sky is the limit here.
Because everything "depends", you're going to have to have to be willing to experiment, test, and iterate.
There's no single magic formula.
Get out there and test for yourself.