On-Device Image Gen Optimization

tl;dr

If we wanted to remove creature generation costs by moving image generation on-device, we’d run into a few challenges. I’ve done some preliminary investigation into the recent (2024+) R&D techniques that show the most promise for reducing the compute and memory footprint of a stable diffusion image generator. These approaches can be theoretically stacked to reduce compute and RAM usage while maintaining overall fidelity.

Problem Definition

Primary Constraint: Compute Cost (Latency)

Standard SDXL inference requires 20-50 denoising steps, resulting in multi-second to minute-scale latencies on mobile hardware, which is unacceptable for an interactive user experience. The primary objective is to reduce inference time to sub-second levels for 512-768px images.

Secondary Constraint: Memory Footprint (RAM)

The baseline SDXL model architecture (U-Net, dual text encoders, VAE) consumes >6 GB of RAM at FP16, exceeding the resource budget of typical mobile devices. The secondary objective is to reduce the on-device memory footprint to a target of <1.5 GB, with stretch goals for a sub-1 GB deployment.

SDXL Architecture & Cost Analysis

Component size and color indicate relative resource cost (RAM & Compute).
■ High Cost ■ Medium Cost ■ Low Cost

RAM Footprint Analysis

1. U-Net: Dominant consumer at ~5.1 GB (FP16).
2. Text Encoders: Secondary consumer at ~1.6 GB (FP16) combined.
3. VAE: Minor consumer at ~168 MB (FP16).

Compute Cost Analysis

1. U-Net: Dominant source. Executed iteratively (20-50x) per generation.
2. Text Encoders: Secondary source. Executed once per prompt.
3. VAE: Minor source. Executed once to decode the final image.

Source: Architectural details are derived from the official SDXL research paper. Read the paper here.

Technical Primer: Core Optimization Methodologies

Distribution Matching Distillation (DMD2)

Definition: A distillation technique that trains a student generator to match the output distribution of a teacher model (SDXL) in one or few steps. This is the primary method for latency reduction, as interactive applications cannot tolerate a multi-step denoising process.
Mechanism for Optimization: Addresses compute cost by collapsing the iterative denoising loop (N steps) into a single forward pass (1 step).
Quantified Impact: ~10-50x reduction in U-Net FLOPs. No reduction in model memory.

Pro: Drastic latency reduction with a single, upfront training cost. Maintains original model architecture, simplifying integration.

Con: Does not reduce memory footprint. Quality is sensitive to the distillation process.

GitHub Repo arXiv Paper Project Page

Architectural Pruning (BK-SDM)

Definition: A method that systematically removes entire residual or attention blocks from the U-Net. This is desirable because it reduces the fundamental cost *before* other techniques like quantization are applied, leading to a better final Pareto point.
Mechanism for Optimization: Addresses both compute cost (fewer FLOPs) and memory footprint (fewer parameters).
Quantified Impact: ~30-50% U-Net parameter reduction. (e.g., from ~5.1 GB to ~2.5-3.6 GB before quantization).

Pro: Reduces memory and compute simultaneously. Preserves standard U-Net structure for compatibility with tools like LoRA.

Con: Requires an additional, non-trivial knowledge distillation step. Pruning is an empirical process.

GitHub Repo arXiv Paper ECCV 2024 Paper

Weight Quantization (4-bit / 8-bit)

Definition: Reduces the precision of model weights from FP16 to low-bit integers (INT4/INT8). This is the industry-standard method for footprint reduction due to its reliability and broad hardware support.
Mechanism for Optimization: Addresses memory footprint by storing each weight using fewer bits.
Quantified Impact: ~4x reduction for 4-bit (e.g., U-Net from ~5.1 GB to ~1.28 GB).

Pro: Standard technique providing predictable memory savings. PTQ is fast to implement. LoRA can be kept in high precision over a quantized base.

Con: Can introduce minor quality degradation. QAT adds a fine-tuning step.

SVDQuant (arXiv) bitsandbytes Docs Mixup-Sign (CVPR 2025)

Mixed-Precision Quantization (BitsFusion)

Definition: An advanced, layer-wise quantization scheme that assigns an optimal bit-width to each layer. This approach is pursued when memory is the absolute priority and standard quantization is insufficient.
Mechanism for Optimization: Aggressively targets memory footprint with a more granular approach.
Quantified Impact (projected from SD1.5): ~7.9x reduction, potentially taking the U-Net from ~5.1 GB to ~650 MB.

Pro: Highest potential memory compression.

Con: Highest R&D cost. Requires implementing a complex fine-tuning recipe and porting from SD1.5, presenting a significant engineering risk.

arXiv Paper Project Page NeurIPS 2024 Poster

Comparative Analysis of Optimization Stacks

Stack 1: Latency First via Quantization (Low Risk)

Step	Action	Compute Impact	Memory Impact	Risk
1	Apply DMD2 one-step distillation	~10-50x latency reduction	None	Low
2	Apply 4-bit quantization to U-Net	Minimal	~4x U-Net weight reduction	Low
3	Offload dual text encoders via API	Adds network latency (~1s)	Removes ~1.6 GB from device	Low

Stack 1 - Final Estimated State:

Total On-Device RAM: ~1.0 - 1.5 GB. Latency: Interactive (<1s compute + network). Path: Quickest to a stable, shippable artifact.

Stack 2: Architectural Pruning + Quantization (Medium Risk)

Step	Action	Compute Impact	Memory Impact	Risk
1	Apply DMD2 one-step distillation	~10-50x latency reduction	None	Low
2	Architecturally trim U-Net (BK-SDM style) + KD	~30-50% U-Net param/MAC reduction	~30-50% U-Net weight reduction	Medium
3	Apply 4-bit quantization to trimmed U-Net	Minimal	~4x reduction on already smaller U-Net	Low
4	Offload dual text encoders via API	Adds network latency (~1s)	Removes ~1.6 GB from device	Low

Stack 2 - Final Estimated State:

Total On-Device RAM: ~0.8 - 1.2 GB. Latency: Interactive. Path: Higher optimization ceiling, requires architectural modification and knowledge distillation.

Stack 3: Extreme Quantization (High Risk)

Step	Action	Compute Impact	Memory Impact	Risk
1	Apply DMD2 one-step distillation	~10-50x latency reduction	None	Low
2	Apply BitsFusion-style ~2-bit quantization to U-Net	Minimal	~7.9x U-Net weight reduction	High
3	Offload dual text encoders via API	Adds network latency (~1s)	Removes ~1.6 GB from device	Low

Stack 3 - Final Estimated State:

Total On-Device RAM: < 1.0 GB feasible. Latency: Interactive. Path: Highest potential memory savings, but requires significant R&D to port training recipe and validate quality.

Staged Deployment Roadmap

This roadmap prioritizes time-to-device and de-risking, assuming Stack 1 or 2 as the initial target. Each stage is a prerequisite for the next.

Phase 0: Baseline Validation. Establish performance benchmarks (latency, RAM) for the unoptimized SDXL teacher model on target hardware. Finalize objective quality metric suite (FID, etc.) and subjective human rating rubric.

Phase 1: DMD2 Compute Validation. Using the pre-trained DMD2 student model, validate the latency reduction on a high-memory device (e.g., desktop GPU, iPad Pro) to isolate compute improvements from memory constraints.

Phase 2: Memory Footprint Reduction. Execute chosen footprint reduction strategy (e.g., Stack 2, Steps 2-3: U-Net pruning followed by 4-bit quantization). Validate that quality metrics remain within acceptable deviation from the Phase 1 baseline.

Phase 3: Specialization & LoRA Integration. Fine-tune LoRA adapters on the compressed base model from Phase 2. Validate adapter performance and confirm no catastrophic forgetting on general prompts.

Phase 4: Encoder Offloading. Implement the client-side API call for text embeddings and remove on-device text encoders. Conduct end-to-end testing, including network failure modes and caching strategies.