The challenge is to run Stable Diffusion 1.5, which includes a large transformer model with almost 1 billion parameters, on a Raspberry Pi Zero 2, which is a microcomputer with 512MB of RAM, without adding more swap space and without offloading intermediate results on disk. The recommended minimum RAM/VRAM for Stable Diffusion 1.5 is typically 8GB. Generally, major machine learning frameworks and libraries are focused on minimizing inference latency and/or maximizing throughput, all of which at the cost of RAM usage. So I decided to write a super small and hackable inference library specifically focused on minimizing memory consumption: OnnxStream. OnnxStream is based on the idea of decoupling the inference engine from the component responsible for providing the model weights, which is a class derived from WeightsProvider. A WeightsProvider specialization can implement any type of loading, caching, and prefetching of the model parameters.
Features
- OnnxStream can consume even 55x less memory than OnnxRuntime with only a 50% to 200% increase in latency
- Documentation available
- OnnxStream is based on the idea of decoupling the inference engine from the component responsible of providing the model weights
- Major machine learning frameworks and libraries are focused on minimizing inference latency
- Examples available
- The OnnxStream Stable Diffusion example implementation now supports SDXL 1.0