Target Device
iPhone 15 Pro (8GB RAM)
~2GB total memory footprint
Timeline Overview
Phase 1: 1-2 days
Phase 2: 1-2 weeks
Phase 3: 1 week
Phase 4: 2-3 weeks
Phase 5: 1-2 weeks
Total: 6-10 weeks
1
Complete PixArt Branch
Current Phase
1.1 PixArt Alpha/Sigma Feature Complete
- ○ Pipeline Config API (Phase 4)
- ✓ VAE mode selection (Full/Tiny)
- ○ Adapter weight per-request
- ○ Godot settings UI
- ○ Merge to main ai-engine branch
MERGE TO MAIN
↓
2
SANA Base Implementation
Branch: feature/sana-model
2.1 SANA DiT in Candle/Rust
- • Linear attention (O(N) vs O(N²))
- • SanaLinearTransformerBlock
- • MultiHeadCrossAttention
- • GLUMBConv feedforward
2.2 DC-AE Decoder
- • 32× spatial compression
- • Port from SANA reference
- • NHWC Metal optimization
~600M params → ~1.2 GB
2.3 Gemma-2-2B Text Encoder
- • Full BF16 initially (~4.7GB)
- • SANA pipeline integration
- • HuggingFace support
2.4 SANA Full Pipeline
- • End-to-end inference
- • Parity tests vs Python
- • Basic generation verified
↓
3
Benchmarking & Comparison
Decision Point
Three Competing Models
PixArt-Alpha
+ DistillT5
+ Tiny VAE
PixArt-Sigma
+ DistillT5
+ Tiny VAE
SANA-0.6B/1.6B
+ Gemma-2-2B
+ DC-AE
3.1 Benchmark Suite
- • ms/step on M3 Max
- • ms/step on iPad M4 16GB
- • Memory footprint
- • Quality metrics (FID, CLIP)
3.2 LoRA Training Comparison
- • AIBG-style LoRA on PixArt-Sigma
- • AIBG-style LoRA on SANA-0.6B
- • Compare quality & speed
DECISION POINT
SANA or PixArt?
↓
4
SANA Miniaturization
Branch: feature/sana-optimized
Parallel Workstreams
LONGEST POLE
4A. DistillGemma
Gemma-2-2B → 270M
via DistillT5 approach
~540MB → ~135MB (W4A16)
1-2 weeks
4B. SVDQuant DiT
W4A4 SANA DiT
Port from Nunchaku
~370MB (0.6B) / ~990MB (1.6B)
2-3 days
4C. Hybrid VAE
DC-AE strategy
Full quality kept
~1.2GB
4.1 SANA-Optimized Integration
DistillGemma + SVDQuant DiT + DC-AE
Target: ~2GB total memory
↓
5
SANA DMD & Production LoRA
Final Phase
5.1 SANA LoRA Training
- • Train AIBG LoRA on SANA
- • Verify LoRA + SVDQuant merge
- • Low-rank absorbs LoRA
5.2 DMD Distillation
- • SANA + LoRA to 4-step
- • DMD2 approach
- • Single-step for preview
5.3 Production Pipeline
- • SVDQuant + DistillGemma + DMD
- • 4-step inference
- • ~2GB, <500ms
Memory Budget Progression
Phase 2: Full SANA
iPad M4 16GB
DiT 1.6B (BF16)
3.3 GB
Gemma-2-2B (BF16)
4.7 GB
DC-AE (BF16)
1.2 GB
Runtime
~2 GB
Total: ~11 GB
Phase 4: Optimized SANA
iPhone 15 Pro 8GB
DiT 0.6B (W4A4)
~370 MB
DistillGemma (W4)
~135 MB
DC-AE (BF16)
~1.2 GB
Runtime
~300 MB
Total: ~2 GB
Key Risks & Mitigations
⚠
DistillGemma quality loss
Benchmark before committing. Fallback: W4A16 full Gemma (~1.2GB)
⚠
SANA LoRA quality vs PixArt
Compare early in Phase 3. Keep PixArt as fallback
⚠
SVDQuant Metal port complexity
Start with BF16, add quantization later
⚠
DMD training instability
Use proven DMD2 recipe. 4-step is more stable
Production Target
Memory
~2GB
Generation
<500ms
Steps
4-step