Players make their own creatures. The cards, the decks, and the creature images themselves. This brief covers the image engine behind that: what we are building, why we chose what we chose, and how far we are from shipping it.
A creature, in our style, made on the phone in 4 to 6 seconds.
Concretely: a style-tuned image model, compressed under 2.5 GB of memory, distilled from 20 generation steps down to about 1. This combination does not exist in shipped products today. The teams working on this class of problem are at the scale of Tencent and TikTok.
Done: the model runs on-device, the compression pipeline is proven, the training dataset is cached, and the engine has a tested slot waiting for the fast model. In progress: the distillation run itself, live now on an 8x H100 machine, 3 to 4 days, results expected by end of week. Remaining after that: quality check at 1, 2, and 4 steps, the style survival test, re-applying compression to the new model, and a small game-integration PR.
Total memory while generating. iOS terminates apps that exceed their memory limit.
The core model normally runs 20 times per image. That count is the speed ceiling.
Our creatures, our art style. The style stays swappable as we evolve it.
Describe a creature, get a high-quality on-style image in 4 to 6 seconds, fully on-device. This is the entire current focus.
Same creature, new pose or expression. Built later from synthetic edit pairs our video model generates for us.
Turn a real photo into a creature in our style. Furthest out, likely post-launch.
Stretch goals only get engine work now if that work is also useful today. The video model is the example: it is already merged into the engine, it will produce the edit-pair dataset later, and it can make creature video ads on our own hardware now.
2.52 GB measured on the current base model. The distilled model comes back uncompressed, so its first pass on device will peak around 2.8 to 3 GB, likely above 3 at first. Initial test hardware is recent iPhones, iPads, and Macs. The compression pipeline that gets back under 2.5 GB already exists and gets re-applied to the new model.
Engine ready, dataset cached, run in progress. The full training dataset (prompts, text embeddings at three precisions, cached teacher outputs, about 1 TB) is done. The distillation run is live now: 3 to 4 days on an 8x H100 machine, $3k to $4k in credits. Results expected by end of week.
Style works on the base model. Our best LoRA results to date, running on the compressed engine. The De-Turbo adapter (the mechanism that should let styles be trained onto the distilled model) is training inside the live run right now. Whether it worked is known when the run finishes.
Our rule: we do not do research. We need a model that fits the phone and has proven, working code for everything we do to it (fine-tune, distill, quantize). If any of those steps lacks working code, that step becomes a research project with unknown cost and timeline.
| Model | 🪶 Memory | ⚡ Compute | 🎨 Style | Verdict |
|---|---|---|---|---|
| UNet models SD 1.5 / SD 2 |
❌breaks when quantized | ❌old architecture, slow on iPhone | ✅ | Eliminated. Pre-transformer architecture with poor mobile performance. |
| PixArt α / Σ our first bet |
✅quantizes well | ❌no good distillation code exists | ✅fine-tuned well | The only available distillation was low quality, broke fine-tuning afterward, and required about a million synthetic images with the style baked in up front. Abandoned. |
| Anima built on NVIDIA Cosmos-Predict2 |
✅4-bit measured near-lossless | ✅NVIDIA's rCM, validated on this exact model | ✅our best LoRAs yet | Selected. Working, validated code exists for all three transformations. |
Anima is an animation-style tune of NVIDIA's Cosmos-Predict2. NVIDIA's own distillation method (rCM) was validated on Cosmos-Predict2 in their paper, outperforming the previous best method. It needs only text prompts as input data, no million-image dataset. Its De-Turbo mechanism allows fine-tuning styles after distillation. That changes the cost structure: instead of $2,000 to $5,000 per style change, we distill once and swap LoRA styles on top at low cost.
Makes the model.
Runs the model.
Makes it fast. Our fork of the tensor library.
The contract between them: whatever training produces, the engine must reproduce. Every handoff is guarded by a parity test that verifies training outputs and on-device outputs match. Downstream of all three sits AI Battleground, the game itself, which consumes the engine and gets a small integration PR (covered in the plan below).
The full pipeline is live in the engine and matches the reference implementation.
The transformer compressed to 4-bit with no measurable quality loss, the text encoder to 8-bit, plus a round of GPU kernel work. Peak memory fell from 5.8 GB to 2.52 GB on the base model. A single generation step runs in about a second on a Mac.
The 1, 2, and 4-step sampler, the loader, a one-command CLI, and a demo toggle are built and verified against the training code. Any checkpoint from the run, including partial mid-run ones, can be loaded and tested in the engine immediately.
100k to 250k prompts, their text embeddings precomputed at three precisions (F16, BF16, Q8_0), and full 20-step teacher outputs cached for every prompt. About 1 TB of prepared data. Re-generating it if the teacher ever changes costs $2k to $3k in compute.
Live now. 3 to 4 days on an 8x H100 machine, $3k to $4k in credits. Results expected by end of week. The De-Turbo adapter (the style-recovery mechanism) is training inside the same run. Intermediate checkpoints get pulled and tested in the engine as the run progresses.
The game already does on-device image generation with the older PixArt Sigma model at 20 steps, so the integration surface exists. Needed: the latest engine version, one small configuration PR, and the per-platform binary separation work. Being prepped now, during the run. It can land with the non-distilled model first and swap to the distilled one when ready.
When the run finishes: measure quality at 1, 2, and 4 steps, then test whether a style LoRA trains cleanly through De-Turbo. If the run is far enough along, the LoRA test can start ahead of schedule on a mid-run checkpoint.
The student comes back uncompressed, so first-pass peak is around 2.8 to 3 GB, likely above 3 initially. The existing 4-bit pipeline gets re-applied to bring it back under 2.5 GB. The further push to 2.0 GB (covers 4 GB iPhone 11/12/13) is a parallel project.
1 · Style versus distillation. Aggressive distillation can narrow a model's output distribution enough that style LoRAs stop working on it. Earlier distillation methods we tried (LADD) failed exactly this way. rCM preserves output diversity better than alternatives, and De-Turbo exists specifically to address this, but neither is verified on our model until the run finishes. 2 · One-step quality. The paper demonstrates quality at 4 steps. One step is the stretch target. The engine supports 1, 2, and 4 selectable, so we ship whichever step count meets the quality bar.
The training stack vendors NVIDIA's rCM repository (pinned at a known revision) with an Anima-specific network, config, and conditioner on top. rCM is a consistency-distillation method (arXiv 2510.08431): a continuous-time consistency objective in the TrigFlow parameterization, which preserves fine detail and output diversity, plus a DMD-style distribution-matching term for sharpness. One trained student supports 1, 2, and 4 step generation; the step count is chosen at sampling time, not trained separately.
Almost everything in the run is precomputed once and frozen:
The standard recipe caches embeddings at full precision. We cached them at F16, BF16, and Q8_0. The Q8_0 set matters most: the student is trained directly on the quantized encoder's outputs, so its quantization error is part of what the student learns. The moment distillation finishes, the 8-bit text encoder is already supported, no second training pass needed. This is also why the engine had to prove it reproduces that encoder exactly (verified to four nines on Mac, NVIDIA, and CPU).
8x H100 machines are scarce. Scheduling one is a real constraint, which is part of why the pipeline was smoke-tested exhaustively before committing: a failed run burns both the credits and the reservation.
Distillation compresses the model's output distribution, and a heavily compressed model is hostile to fine-tuning: training a style LoRA directly on it degrades into mode collapse. The alternative, fine-tuning the base model and hoping the LoRA survives distillation, has historically not worked for us.
The community workaround, which rCM builds in as De-Turbo:
The reason this can work: the style LoRA and the distillation modify different weights at different magnitudes, so they are close to orthogonal. Training against the de-distilled model keeps the LoRA in a space the fast model still responds to.
The distillation only works if every component behaves identically in training (Python) and in deployment (candle). Each was verified individually before spending on the run:
Many of these details have failed silently in past efforts. The validation exists to give the paid run the best possible chance of working the first time.
Dataset cached. Engine ready. The distillation is running now.
Results by end of week. Checkpoints get tested in the engine as the run progresses, the game-integration PR is being prepped in parallel, and the style question gets answered by the same run via De-Turbo. After that: re-compress, quality gate, ship.