A quantitative analysis of techniques and stacked architectures for deploying a memory-constrained, low-latency SDXL-based generator on mobile devices.
If we wanted to remove creature generation costs by moving image generation on-device, we’d run into a few challenges. I’ve done some preliminary investigation into the recent (2024+) R&D techniques that show the most promise for reducing the compute and memory footprint of a stable diffusion image generator. These approaches can be theoretically stacked to reduce compute and RAM usage while maintaining overall fidelity.
Standard SDXL inference requires 20-50 denoising steps, resulting in multi-second to minute-scale latencies on mobile hardware, which is unacceptable for an interactive user experience. The primary objective is to reduce inference time to sub-second levels for 512-768px images.
The baseline SDXL model architecture (U-Net, dual text encoders, VAE) consumes >6 GB of RAM at FP16, exceeding the resource budget of typical mobile devices. The secondary objective is to reduce the on-device memory footprint to a target of <1.5 GB, with stretch goals for a sub-1 GB deployment.
Source: Architectural details are derived from the official SDXL research paper. Read the paper here.
| Step | Action | Compute Impact | Memory Impact | Risk |
|---|---|---|---|---|
| 1 | Apply DMD2 one-step distillation | ~10-50x latency reduction | None | Low |
| 2 | Apply 4-bit quantization to U-Net | Minimal | ~4x U-Net weight reduction | Low |
| 3 | Offload dual text encoders via API | Adds network latency (~1s) | Removes ~1.6 GB from device | Low |
Total On-Device RAM: ~1.0 - 1.5 GB. Latency: Interactive (<1s compute + network). Path: Quickest to a stable, shippable artifact.
| Step | Action | Compute Impact | Memory Impact | Risk |
|---|---|---|---|---|
| 1 | Apply DMD2 one-step distillation | ~10-50x latency reduction | None | Low |
| 2 | Architecturally trim U-Net (BK-SDM style) + KD | ~30-50% U-Net param/MAC reduction | ~30-50% U-Net weight reduction | Medium |
| 3 | Apply 4-bit quantization to trimmed U-Net | Minimal | ~4x reduction on already smaller U-Net | Low |
| 4 | Offload dual text encoders via API | Adds network latency (~1s) | Removes ~1.6 GB from device | Low |
Total On-Device RAM: ~0.8 - 1.2 GB. Latency: Interactive. Path: Higher optimization ceiling, requires architectural modification and knowledge distillation.
| Step | Action | Compute Impact | Memory Impact | Risk |
|---|---|---|---|---|
| 1 | Apply DMD2 one-step distillation | ~10-50x latency reduction | None | Low |
| 2 | Apply BitsFusion-style ~2-bit quantization to U-Net | Minimal | ~7.9x U-Net weight reduction | High |
| 3 | Offload dual text encoders via API | Adds network latency (~1s) | Removes ~1.6 GB from device | Low |
Total On-Device RAM: < 1.0 GB feasible. Latency: Interactive. Path: Highest potential memory savings, but requires significant R&D to port training recipe and validate quality.
This roadmap prioritizes time-to-device and de-risking, assuming Stack 1 or 2 as the initial target. Each stage is a prerequisite for the next.