AI Image Generation Roadmap

Target Device

iPhone 15 Pro (8GB RAM)

~2GB total memory footprint

Timeline Overview

Phase 1: 1-2 days

Phase 2: 1-2 weeks

Phase 3: 1 week

Phase 4: 2-3 weeks

Phase 5: 1-2 weeks

Total: 6-10 weeks

Complete PixArt Branch

Current Phase

1.1 PixArt Alpha/Sigma Feature Complete

○ Pipeline Config API (Phase 4)
✓ VAE mode selection (Full/Tiny)
○ Adapter weight per-request
○ Godot settings UI
○ Merge to main ai-engine branch

MERGE TO MAIN

↓

SANA Base Implementation

Branch: feature/sana-model

2.1 SANA DiT in Candle/Rust

• Linear attention (O(N) vs O(N²))
• SanaLinearTransformerBlock
• MultiHeadCrossAttention
• GLUMBConv feedforward

2.2 DC-AE Decoder

• 32× spatial compression
• Port from SANA reference
• NHWC Metal optimization

~600M params → ~1.2 GB

2.3 Gemma-2-2B Text Encoder

• Full BF16 initially (~4.7GB)
• SANA pipeline integration
• HuggingFace support

2.4 SANA Full Pipeline

• End-to-end inference
• Parity tests vs Python
• Basic generation verified

↓

Benchmarking & Comparison

Decision Point

Three Competing Models

PixArt-Alpha

+ DistillT5

+ Tiny VAE

PixArt-Sigma

+ DistillT5

+ Tiny VAE

SANA-0.6B/1.6B

+ Gemma-2-2B

+ DC-AE

3.1 Benchmark Suite

• ms/step on M3 Max
• ms/step on iPad M4 16GB
• Memory footprint
• Quality metrics (FID, CLIP)

3.2 LoRA Training Comparison

• AIBG-style LoRA on PixArt-Sigma
• AIBG-style LoRA on SANA-0.6B
• Compare quality & speed

DECISION POINT

SANA or PixArt?

↓

SANA Miniaturization

Branch: feature/sana-optimized

Parallel Workstreams

LONGEST POLE

4A. DistillGemma

Gemma-2-2B → 270M

via DistillT5 approach

~540MB → ~135MB (W4A16)

1-2 weeks

4B. SVDQuant DiT

W4A4 SANA DiT

Port from Nunchaku

~370MB (0.6B) / ~990MB (1.6B)

2-3 days

4C. Hybrid VAE

DC-AE strategy

Full quality kept

~1.2GB

4.1 SANA-Optimized Integration

DistillGemma + SVDQuant DiT + DC-AE

Target: ~2GB total memory

↓

SANA DMD & Production LoRA

Final Phase

5.1 SANA LoRA Training

• Train AIBG LoRA on SANA
• Verify LoRA + SVDQuant merge
• Low-rank absorbs LoRA

5.2 DMD Distillation

• SANA + LoRA to 4-step
• DMD2 approach
• Single-step for preview

5.3 Production Pipeline

• SVDQuant + DistillGemma + DMD
• 4-step inference
• ~2GB, <500ms

Memory Budget Progression

Phase 2: Full SANA

iPad M4 16GB

DiT 1.6B (BF16) 3.3 GB

Gemma-2-2B (BF16) 4.7 GB

DC-AE (BF16) 1.2 GB

Runtime ~2 GB

Total: ~11 GB

Phase 4: Optimized SANA

iPhone 15 Pro 8GB

DiT 0.6B (W4A4) ~370 MB

DistillGemma (W4) ~135 MB

DC-AE (BF16) ~1.2 GB

Runtime ~300 MB

Total: ~2 GB

Key Risks & Mitigations

⚠

DistillGemma quality loss

Benchmark before committing. Fallback: W4A16 full Gemma (~1.2GB)

⚠

SANA LoRA quality vs PixArt

Compare early in Phase 3. Keep PixArt as fallback

⚠

SVDQuant Metal port complexity

Start with BF16, add quantization later

⚠

DMD training instability

Use proven DMD2 recipe. 4-step is more stable

Production Target

Memory

~2GB

Generation

<500ms

Steps

4-step