AI Image Generation Roadmap

PixArt → SANA → SANA-Optimized → Production

Target Device
iPhone 15 Pro (8GB RAM)
~2GB total memory footprint

Timeline Overview

Phase 1: 1-2 days
Phase 2: 1-2 weeks
Phase 3: 1 week
Phase 4: 2-3 weeks
Phase 5: 1-2 weeks
Total: 6-10 weeks
1

Complete PixArt Branch

Current Phase

1.1 PixArt Alpha/Sigma Feature Complete

  • Pipeline Config API (Phase 4)
  • VAE mode selection (Full/Tiny)
  • Adapter weight per-request
  • Godot settings UI
  • Merge to main ai-engine branch
MERGE TO MAIN
2

SANA Base Implementation

Branch: feature/sana-model

2.1 SANA DiT in Candle/Rust

  • • Linear attention (O(N) vs O(N²))
  • • SanaLinearTransformerBlock
  • • MultiHeadCrossAttention
  • • GLUMBConv feedforward

2.2 DC-AE Decoder

  • • 32× spatial compression
  • • Port from SANA reference
  • • NHWC Metal optimization
~600M params → ~1.2 GB

2.3 Gemma-2-2B Text Encoder

  • • Full BF16 initially (~4.7GB)
  • • SANA pipeline integration
  • • HuggingFace support

2.4 SANA Full Pipeline

  • • End-to-end inference
  • • Parity tests vs Python
  • • Basic generation verified
3

Benchmarking & Comparison

Decision Point

Three Competing Models

PixArt-Alpha
+ DistillT5
+ Tiny VAE
PixArt-Sigma
+ DistillT5
+ Tiny VAE
SANA-0.6B/1.6B
+ Gemma-2-2B
+ DC-AE

3.1 Benchmark Suite

  • • ms/step on M3 Max
  • • ms/step on iPad M4 16GB
  • • Memory footprint
  • • Quality metrics (FID, CLIP)

3.2 LoRA Training Comparison

  • • AIBG-style LoRA on PixArt-Sigma
  • • AIBG-style LoRA on SANA-0.6B
  • • Compare quality & speed
DECISION POINT
SANA or PixArt?
4

SANA Miniaturization

Branch: feature/sana-optimized

Parallel Workstreams

LONGEST POLE

4A. DistillGemma

Gemma-2-2B → 270M
via DistillT5 approach
~540MB → ~135MB (W4A16)
1-2 weeks

4B. SVDQuant DiT

W4A4 SANA DiT
Port from Nunchaku
~370MB (0.6B) / ~990MB (1.6B)
2-3 days

4C. Hybrid VAE

DC-AE strategy
Full quality kept
~1.2GB

4.1 SANA-Optimized Integration

DistillGemma + SVDQuant DiT + DC-AE
Target: ~2GB total memory
5

SANA DMD & Production LoRA

Final Phase

5.1 SANA LoRA Training

  • • Train AIBG LoRA on SANA
  • • Verify LoRA + SVDQuant merge
  • • Low-rank absorbs LoRA

5.2 DMD Distillation

  • • SANA + LoRA to 4-step
  • • DMD2 approach
  • • Single-step for preview

5.3 Production Pipeline

  • • SVDQuant + DistillGemma + DMD
  • • 4-step inference
  • • ~2GB, <500ms

Memory Budget Progression

Phase 2: Full SANA

iPad M4 16GB
DiT 1.6B (BF16) 3.3 GB
Gemma-2-2B (BF16) 4.7 GB
DC-AE (BF16) 1.2 GB
Runtime ~2 GB
Total: ~11 GB

Phase 4: Optimized SANA

iPhone 15 Pro 8GB
DiT 0.6B (W4A4) ~370 MB
DistillGemma (W4) ~135 MB
DC-AE (BF16) ~1.2 GB
Runtime ~300 MB
Total: ~2 GB

Key Risks & Mitigations

DistillGemma quality loss

Benchmark before committing. Fallback: W4A16 full Gemma (~1.2GB)

SANA LoRA quality vs PixArt

Compare early in Phase 3. Keep PixArt as fallback

SVDQuant Metal port complexity

Start with BF16, add quantization later

DMD training instability

Use proven DMD2 recipe. 4-step is more stable

Production Target

Memory
~2GB
Generation
<500ms
Steps
4-step