Atelico.Studio
Engineering Brief · Anima On-Device · June 2026

Creatures, generated on the phone.

Players make their own creatures. The cards, the decks, and the creature images themselves. This brief covers the image engine behind that: what we are building, why we chose what we chose, and how far we are from shipping it.

The goal

A creature, in our style, made on the phone in 4 to 6 seconds.

Concretely: a style-tuned image model, compressed under 2.5 GB of memory, distilled from 20 generation steps down to about 1. This combination does not exist in shipped products today. The teams working on this class of problem are at the scale of Tencent and TikTok.

Distance to Quick Create on device ~75%

Done: the model runs on-device, the compression pipeline is proven, the training dataset is cached, and the engine has a tested slot waiting for the fast model. In progress: the distillation run itself, live now on an 8x H100 machine, 3 to 4 days, results expected by end of week. Remaining after that: quality check at 1, 2, and 4 steps, the style survival test, re-applying compression to the new model, and a small game-integration PR.

Three pillars

Every decision is evaluated against these three targets.

🪶

Low Memory

2 to 2.5 GB

Total memory while generating. iOS terminates apps that exceed their memory limit.

How: quantization

Low Compute

20 steps → 1

The core model normally runs 20 times per image. That count is the speed ceiling.

How: distillation
🎨

Quality & Style

Our look

Our creatures, our art style. The style stays swappable as we evolve it.

How: LoRA fine-tuning
The product

Creature creation is the core of the game.

1 · Quick Create Now

Describe a creature, get a high-quality on-style image in 4 to 6 seconds, fully on-device. This is the entire current focus.

2 · Edit the creature Stretch

Same creature, new pose or expression. Built later from synthetic edit pairs our video model generates for us.

3 · Photo to creature Stretch

Turn a real photo into a creature in our style. Furthest out, likely post-launch.

The dual-purpose rule

Stretch goals only get engine work now if that work is also useful today. The video model is the example: it is already merged into the engine, it will produce the edit-pair dataset later, and it can make creature video ads on our own hardware now.

How close are we

Status by pillar.

🪶 Low Memory~85%

2.52 GB measured on the current base model. The distilled model comes back uncompressed, so its first pass on device will peak around 2.8 to 3 GB, likely above 3 at first. Initial test hardware is recent iPhones, iPads, and Macs. The compression pipeline that gets back under 2.5 GB already exists and gets re-applied to the new model.

⚡ Low Compute~75%

Engine ready, dataset cached, run in progress. The full training dataset (prompts, text embeddings at three precisions, cached teacher outputs, about 1 TB) is done. The distillation run is live now: 3 to 4 days on an 8x H100 machine, $3k to $4k in credits. Results expected by end of week.

🎨 Quality & Style~50%

Style works on the base model. Our best LoRA results to date, running on the compressed engine. The De-Turbo adapter (the mechanism that should let styles be trained onto the distilled model) is training inside the live run right now. Whether it worked is known when the run finishes.

Pipeline

What actually runs.

1 · Text encoder Qwen3, 0.6B parameters. Reads the prompt once. Already compressed to 8-bit and shipped.
2 · Image transformer, ×N The hot loop. Cosmos-Predict2, 2B parameters. Runs N times per image. N is what distillation shrinks: 20 down to 1.
3 · Decoder (VAE) Turns the result into pixels. Runs once, kept full precision.
Decision record

How we picked the model.

Our rule: we do not do research. We need a model that fits the phone and has proven, working code for everything we do to it (fine-tune, distill, quantize). If any of those steps lacks working code, that step becomes a research project with unknown cost and timeline.

Model 🪶 Memory ⚡ Compute 🎨 Style Verdict
UNet models
SD 1.5 / SD 2
breaks when quantized old architecture, slow on iPhone Eliminated. Pre-transformer architecture with poor mobile performance.
PixArt α / Σ
our first bet
quantizes well no good distillation code exists fine-tuned well The only available distillation was low quality, broke fine-tuning afterward, and required about a million synthetic images with the style baked in up front. Abandoned.
Anima
built on NVIDIA Cosmos-Predict2
4-bit measured near-lossless NVIDIA's rCM, validated on this exact model our best LoRAs yet Selected. Working, validated code exists for all three transformations.
Why Anima

Anima is an animation-style tune of NVIDIA's Cosmos-Predict2. NVIDIA's own distillation method (rCM) was validated on Cosmos-Predict2 in their paper, outperforming the previous best method. It needs only text prompts as input data, no million-image dataset. Its De-Turbo mechanism allows fine-tuning styles after distillation. That changes the cost structure: instead of $2,000 to $5,000 per style change, we distill once and swap LoRA styles on top at low cost.

Coordination

Where the work lives.

vertex-ai-training

Makes the model.

  • Style LoRA fine-tuning
  • Teacher dataset generation
  • The rCM distillation run
  • Reference outputs the engine must match
ai-engine

Runs the model.

  • Loading, compressed checkpoints
  • The samplers (20-step and few-step)
  • Parity tests against training
  • CLI, demos, game-engine surfaces
candle

Makes it fast. Our fork of the tensor library.

  • Custom 4-bit and 8-bit GPU kernels
  • Fused operations
  • Per-weight precision control

The contract between them: whatever training produces, the engine must reproduce. Every handoff is guarded by a parity test that verifies training outputs and on-device outputs match. Downstream of all three sits AI Battleground, the game itself, which consumes the engine and gets a small integration PR (covered in the plan below).

The plan

Stages, with current status.

✅ Shipped

Anima runs on-device

The full pipeline is live in the engine and matches the reference implementation.

✅ Shipped

Compressed and made fast 🪶 Memory

The transformer compressed to 4-bit with no measurable quality loss, the text encoder to 8-bit, plus a round of GPU kernel work. Peak memory fell from 5.8 GB to 2.52 GB on the base model. A single generation step runs in about a second on a Mac.

✅ Ready

Engine slot for the fast model ⚡ Compute

The 1, 2, and 4-step sampler, the loader, a one-command CLI, and a demo toggle are built and verified against the training code. Any checkpoint from the run, including partial mid-run ones, can be loaded and tested in the engine immediately.

✅ Done

Training dataset cached ⚡ Compute

100k to 250k prompts, their text embeddings precomputed at three precisions (F16, BF16, Q8_0), and full 20-step teacher outputs cached for every prompt. About 1 TB of prepared data. Re-generating it if the teacher ever changes costs $2k to $3k in compute.

🔄 Running

The distillation run ⚡ Compute 🎨 Style

Live now. 3 to 4 days on an 8x H100 machine, $3k to $4k in credits. Results expected by end of week. The De-Turbo adapter (the style-recovery mechanism) is training inside the same run. Intermediate checkpoints get pulled and tested in the engine as the run progresses.

🔄 Prepping

Game integration (AI Battleground)

The game already does on-device image generation with the older PixArt Sigma model at 20 steps, so the integration surface exists. Needed: the latest engine version, one small configuration PR, and the per-platform binary separation work. Being prepped now, during the run. It can land with the non-distilled model first and swap to the distilled one when ready.

⬜ Next

Quality gate and style test 🎨 Style

When the run finishes: measure quality at 1, 2, and 4 steps, then test whether a style LoRA trains cleanly through De-Turbo. If the run is far enough along, the LoRA test can start ahead of schedule on a mid-run checkpoint.

⬜ Next

Re-compress the distilled model 🪶 Memory

The student comes back uncompressed, so first-pass peak is around 2.8 to 3 GB, likely above 3 initially. The existing 4-bit pipeline gets re-applied to bring it back under 2.5 GB. The further push to 2.0 GB (covers 4 GB iPhone 11/12/13) is a parallel project.

Risks
The two main risks

1 · Style versus distillation. Aggressive distillation can narrow a model's output distribution enough that style LoRAs stop working on it. Earlier distillation methods we tried (LADD) failed exactly this way. rCM preserves output diversity better than alternatives, and De-Turbo exists specifically to address this, but neither is verified on our model until the run finishes. 2 · One-step quality. The paper demonstrates quality at 4 steps. One step is the stretch target. The engine supports 1, 2, and 4 selectable, so we ship whichever step count meets the quality bar.

Deep dive

Technical detail, by pillar.

⚡ Compute   What an rCM distillation run consists of

The recipe

The training stack vendors NVIDIA's rCM repository (pinned at a known revision) with an Anima-specific network, config, and conditioner on top. rCM is a consistency-distillation method (arXiv 2510.08431): a continuous-time consistency objective in the TrigFlow parameterization, which preserves fine detail and output diversity, plus a DMD-style distribution-matching term for sharpness. One trained student supports 1, 2, and 4 step generation; the step count is chosen at sampling time, not trained separately.

The dataset, and why it is mostly cache

Almost everything in the run is precomputed once and frozen:

  • 100k to 250k text prompts. Text is the only external input. No image dataset is required, which is a large part of why this method was chosen over the PixArt path (which demanded about a million synthetic images).
  • Text embeddings, cached. Every prompt is encoded once by the Qwen3 text encoder and stored.
  • Teacher outputs, cached. The teacher (the 20-step base Anima) is frozen for the whole run, so its full 20-step trajectory per prompt is computed once and stored. Everything except the final VAE decode is cached.
  • Total cache: about 1 TB. Training then only exercises the student, which learns to reach the cached teacher outcome in 1, 2, or 4 steps.

Our deviation: embeddings at three precisions

The standard recipe caches embeddings at full precision. We cached them at F16, BF16, and Q8_0. The Q8_0 set matters most: the student is trained directly on the quantized encoder's outputs, so its quantization error is part of what the student learns. The moment distillation finishes, the 8-bit text encoder is already supported, no second training pass needed. This is also why the engine had to prove it reproduces that encoder exactly (verified to four nines on Mac, NVIDIA, and CPU).

Cost and logistics

$2-3kteacher re-cache, if the teacher ever changes
$3-4kthe distillation run itself, in credits
3-4 dayson one 8x H100 machine
~1 TBcached dataset feeding the run

8x H100 machines are scarce. Scheduling one is a real constraint, which is part of why the pipeline was smoke-tested exhaustively before committing: a failed run burns both the credits and the reservation.

Status

  • The run is live now. Results expected by end of week.
  • Intermediate checkpoints can be pulled at any point during the run and loaded directly into the engine's prepared slot. Quality can be watched improving day by day, inside the actual deployment runtime rather than only in training-side eval.
🎨 Style   De-Turbo: fine-tuning a model that has been distilled

The problem

Distillation compresses the model's output distribution, and a heavily compressed model is hostile to fine-tuning: training a style LoRA directly on it degrades into mode collapse. The alternative, fine-tuning the base model and hoping the LoRA survives distillation, has historically not worked for us.

The de-distiller pattern

The community workaround, which rCM builds in as De-Turbo:

  1. Distill the model down to 1 step.
  2. Train a reverse adapter on top of it that undoes the distillation, so with the adapter attached the model behaves like the slow 20-step model again.
  3. Fine-tune the style LoRA against that reversed model, where training is stable.
  4. At runtime, drop the reverse adapter and keep the style LoRA on the fast model.

The reason this can work: the style LoRA and the distillation modify different weights at different magnitudes, so they are close to orthogonal. Training against the de-distilled model keeps the LoRA in a space the fast model still responds to.

Status and stakes

  • Our De-Turbo adapter trains inside the live distillation run, alongside the student. Getting it stable took several fixes this past week: gradient ownership between the student and the adapter, checkpoint save and resume, and a state bug in activation checkpointing.
  • Whether it worked is known when the run finishes. If the run is far enough along, a LoRA fine-tune through De-Turbo can start early on a mid-run checkpoint.
  • If it works: one distillation, then cheap style training forever after. This is the high-value outcome.
  • If it fails: fallback is baking the style into the teacher before distilling, which means re-caching the teacher data ($2k to $3k) plus a new run ($3k to $4k) for every future art-direction change.
🪶 Memory   The ledger, and what the distilled model changes

Where the base model stands today

  • Image transformer to 4-bit (MLX-affine, group 32): 4.0 GB to 1.2 GB, quality cosine 0.999873 against full precision. Custom GPU kernel in our candle fork.
  • Text encoder to 8-bit (GGML Q8_0): 1.1 GB to 0.6 GB. Quantization scope matched to training exactly, all 259 tensors.
  • Decoder stays full precision. It is quality-sensitive and small (0.25 GB).
  • Checkpoints are pre-packed offline so peak memory equals steady-state memory. iOS terminates on peak, so this is required.
  • Measured on real hardware: 2.52 GB steady, 2.79 GB during load.

What changes when the student arrives

  • The distilled student comes back at full precision. Running it as-is puts first-pass peak around 2.8 to 3 GB, likely above 3 initially. Initial test hardware is therefore recent iPhones, iPads, and Macs.
  • The 4-bit compression pipeline already exists as offline tooling and gets re-applied to the student to come back under 2.5 GB. The text encoder needs nothing: the student was trained on the 8-bit encoder's outputs from the start.
  • If a deeper text-encoder quantization is wanted later, the embeddings get re-cached at the new precision and the distillation re-run against them. The three-precision cache was built to make exactly this kind of iteration cheap.

Later

  • The push from 2.5 to 2.0 GB (ternary text encoder retrain plus a 3-bit transformer) covers 4 GB iPhone 11/12/13. Planned parallel project, does not block shipping.
🔬   Parity and smoke validation across Python and the engine

Cross-runtime checks completed before the run

The distillation only works if every component behaves identically in training (Python) and in deployment (candle). Each was verified individually before spending on the run:

  • Text encoder: runs in Python and in candle, outputs verified to match to four nines (worst-row cosine 0.999959) on Mac, NVIDIA, and CPU.
  • Image transformer: runs in Python and in candle, verified at parity.
  • Cached embeddings: dequantized and manually validated by generating images from them across multiple seeds and comparing results between the Python path, the candle path, and the teacher pipeline. All at parity.
  • The student sampler: the engine's 1, 2, and 4 step sampling loop matches the training repository's sampler at 0.999999, tested with identical injected noise on both sides.

Methodology

  • Training dumps reference fixtures (exact inputs, exact expected outputs). Engine tests replay those inputs and gate on worst-case similarity, measured on real hardware.
  • Gates are not lowered to make a failing test pass. A miss is treated as a bug. This has caught several: a lossy GPU kernel path that only failed on short prompts, a library-version difference that silently changed a model constant, a wrong activation function.
  • Stochastic samplers are compared with identical injected noise on both sides, not by assuming random number generators match across languages.
  • Parity tests load the actual shipped checkpoints. The deployed path is the tested path.

Many of these details have failed silently in past efforts. The validation exists to give the paid run the best possible chance of working the first time.

🎮   Game integration (AI Battleground)

Why it is low-lift

  • The game already runs on-device image generation through the engine, using the older PixArt Sigma model at 20 steps. The integration surface, asset flow, and generation UI all exist. The current model is just slow.
  • The swap is: latest engine version, plus one small configuration PR (default model and step count). It can land with the non-distilled Anima first and switch to the distilled student when it passes the quality gate.

What it folds in

  • The per-platform binary separation work (splitting engine binaries by target platform) gets integrated as part of this PR, so each platform ships the right binary.
  • Being prepped now, while the distillation run is going, so the two finish around the same time.
Bottom line

Dataset cached. Engine ready. The distillation is running now.

Results by end of week. Checkpoints get tested in the engine as the run progresses, the game-integration PR is being prepped in parallel, and the style question gets answered by the same run via De-Turbo. After that: re-compress, quality gate, ship.