On-Device Image Gen Optimization

A quantitative analysis of techniques and stacked architectures for deploying a memory-constrained, low-latency SDXL-based generator on mobile devices.

tl;dr

If we wanted to remove creature generation costs by moving image generation on-device, we’d run into a few challenges. I’ve done some preliminary investigation into the recent (2024+) R&D techniques that show the most promise for reducing the compute and memory footprint of a stable diffusion image generator. These approaches can be theoretically stacked to reduce compute and RAM usage while maintaining overall fidelity.

Problem Definition

Primary Constraint: Compute Cost (Latency)

Standard SDXL inference requires 20-50 denoising steps, resulting in multi-second to minute-scale latencies on mobile hardware, which is unacceptable for an interactive user experience. The primary objective is to reduce inference time to sub-second levels for 512-768px images.

Secondary Constraint: Memory Footprint (RAM)

The baseline SDXL model architecture (U-Net, dual text encoders, VAE) consumes >6 GB of RAM at FP16, exceeding the resource budget of typical mobile devices. The secondary objective is to reduce the on-device memory footprint to a target of <1.5 GB, with stretch goals for a sub-1 GB deployment.

SDXL Architecture & Cost Analysis

SDXL Architecture Diagram A diagram showing the main components of the SDXL architecture: Dual Text Encoders, a large U-Net, and a small VAE. The size and color indicate their resource cost. Dual Text Encoders U-Net VAE
Component size and color indicate relative resource cost (RAM & Compute).
High Cost   Medium Cost   Low Cost

RAM Footprint Analysis

  • 1. U-Net: Dominant consumer at ~5.1 GB (FP16).
  • 2. Text Encoders: Secondary consumer at ~1.6 GB (FP16) combined.
  • 3. VAE: Minor consumer at ~168 MB (FP16).

Compute Cost Analysis

  • 1. U-Net: Dominant source. Executed iteratively (20-50x) per generation.
  • 2. Text Encoders: Secondary source. Executed once per prompt.
  • 3. VAE: Minor source. Executed once to decode the final image.

Source: Architectural details are derived from the official SDXL research paper. Read the paper here.

Technical Primer: Core Optimization Methodologies

Distribution Matching Distillation (DMD2)

  • Definition: A distillation technique that trains a student generator to match the output distribution of a teacher model (SDXL) in one or few steps. This is the primary method for latency reduction, as interactive applications cannot tolerate a multi-step denoising process.
  • Mechanism for Optimization: Addresses compute cost by collapsing the iterative denoising loop (N steps) into a single forward pass (1 step).
  • Quantified Impact: ~10-50x reduction in U-Net FLOPs. No reduction in model memory.
Pro: Drastic latency reduction with a single, upfront training cost. Maintains original model architecture, simplifying integration.
Con: Does not reduce memory footprint. Quality is sensitive to the distillation process.

Architectural Pruning (BK-SDM)

  • Definition: A method that systematically removes entire residual or attention blocks from the U-Net. This is desirable because it reduces the fundamental cost *before* other techniques like quantization are applied, leading to a better final Pareto point.
  • Mechanism for Optimization: Addresses both compute cost (fewer FLOPs) and memory footprint (fewer parameters).
  • Quantified Impact: ~30-50% U-Net parameter reduction. (e.g., from ~5.1 GB to ~2.5-3.6 GB before quantization).
Pro: Reduces memory and compute simultaneously. Preserves standard U-Net structure for compatibility with tools like LoRA.
Con: Requires an additional, non-trivial knowledge distillation step. Pruning is an empirical process.

Weight Quantization (4-bit / 8-bit)

  • Definition: Reduces the precision of model weights from FP16 to low-bit integers (INT4/INT8). This is the industry-standard method for footprint reduction due to its reliability and broad hardware support.
  • Mechanism for Optimization: Addresses memory footprint by storing each weight using fewer bits.
  • Quantified Impact: ~4x reduction for 4-bit (e.g., U-Net from ~5.1 GB to ~1.28 GB).
Pro: Standard technique providing predictable memory savings. PTQ is fast to implement. LoRA can be kept in high precision over a quantized base.
Con: Can introduce minor quality degradation. QAT adds a fine-tuning step.

Mixed-Precision Quantization (BitsFusion)

  • Definition: An advanced, layer-wise quantization scheme that assigns an optimal bit-width to each layer. This approach is pursued when memory is the absolute priority and standard quantization is insufficient.
  • Mechanism for Optimization: Aggressively targets memory footprint with a more granular approach.
  • Quantified Impact (projected from SD1.5): ~7.9x reduction, potentially taking the U-Net from ~5.1 GB to ~650 MB.
Pro: Highest potential memory compression.
Con: Highest R&D cost. Requires implementing a complex fine-tuning recipe and porting from SD1.5, presenting a significant engineering risk.

Comparative Analysis of Optimization Stacks

Stack 1: Latency First via Quantization (Low Risk)

StepActionCompute ImpactMemory ImpactRisk
1Apply DMD2 one-step distillation~10-50x latency reductionNoneLow
2Apply 4-bit quantization to U-NetMinimal~4x U-Net weight reductionLow
3Offload dual text encoders via APIAdds network latency (~1s)Removes ~1.6 GB from deviceLow

Stack 1 - Final Estimated State:

Total On-Device RAM: ~1.0 - 1.5 GB. Latency: Interactive (<1s compute + network). Path: Quickest to a stable, shippable artifact.

Stack 2: Architectural Pruning + Quantization (Medium Risk)

StepActionCompute ImpactMemory ImpactRisk
1Apply DMD2 one-step distillation~10-50x latency reductionNoneLow
2Architecturally trim U-Net (BK-SDM style) + KD~30-50% U-Net param/MAC reduction~30-50% U-Net weight reductionMedium
3Apply 4-bit quantization to trimmed U-NetMinimal~4x reduction on already smaller U-NetLow
4Offload dual text encoders via APIAdds network latency (~1s)Removes ~1.6 GB from deviceLow

Stack 2 - Final Estimated State:

Total On-Device RAM: ~0.8 - 1.2 GB. Latency: Interactive. Path: Higher optimization ceiling, requires architectural modification and knowledge distillation.

Stack 3: Extreme Quantization (High Risk)

StepActionCompute ImpactMemory ImpactRisk
1Apply DMD2 one-step distillation~10-50x latency reductionNoneLow
2Apply BitsFusion-style ~2-bit quantization to U-NetMinimal~7.9x U-Net weight reductionHigh
3Offload dual text encoders via APIAdds network latency (~1s)Removes ~1.6 GB from deviceLow

Stack 3 - Final Estimated State:

Total On-Device RAM: < 1.0 GB feasible. Latency: Interactive. Path: Highest potential memory savings, but requires significant R&D to port training recipe and validate quality.

Staged Deployment Roadmap

This roadmap prioritizes time-to-device and de-risking, assuming Stack 1 or 2 as the initial target. Each stage is a prerequisite for the next.

Phase 0: Baseline Validation. Establish performance benchmarks (latency, RAM) for the unoptimized SDXL teacher model on target hardware. Finalize objective quality metric suite (FID, etc.) and subjective human rating rubric.
Phase 1: DMD2 Compute Validation. Using the pre-trained DMD2 student model, validate the latency reduction on a high-memory device (e.g., desktop GPU, iPad Pro) to isolate compute improvements from memory constraints.
Phase 2: Memory Footprint Reduction. Execute chosen footprint reduction strategy (e.g., Stack 2, Steps 2-3: U-Net pruning followed by 4-bit quantization). Validate that quality metrics remain within acceptable deviation from the Phase 1 baseline.
Phase 3: Specialization & LoRA Integration. Fine-tune LoRA adapters on the compressed base model from Phase 2. Validate adapter performance and confirm no catastrophic forgetting on general prompts.
Phase 4: Encoder Offloading. Implement the client-side API call for text embeddings and remove on-device text encoders. Conduct end-to-end testing, including network failure modes and caching strategies.