Multiplayer Resilience Audit

What We Found in the Code

GDC multiplayer worked for most players. We audited the networking stack to understand why. Here's what the code says and what we plan to do about it.

Date: 2025-03-18 Scope: ~2,300 lines / 15 files / 3 server DOs Fixes A through E / 6 PRs

Issues Found in Audit

We audited the client networking layer, battle system, and Cloudflare backend after GDC. We found five areas where the code lacks resilience. We suspect these contributed to the multiplayer issues some players hit on certain networks and devices.

CONNECTIVITY

Can't Connect on Some Networks

On some Wi-Fi networks and mobile connections, players couldn't connect at all. 15-second timeout, then nothing.

In the code: Only one STUN server, zero relay servers. If either player is behind a restrictive firewall or NAT, there is no alternative path.

RECOVERY

No Recovery From Dropped Connections

A brief network interruption permanently killed the session. Players had to restart.

In the code: The connection state machine treats every failure as permanent. No "reconnecting" state, no retry logic, no heartbeat to detect drops.

BATTLE

Battle Freezes on Disconnect

If a player disconnected mid-battle, the other player's game could freeze with no way out.

In the code: The battle system doesn't listen for disconnect events. It waits indefinitely for the opponent's next action.

LIFECYCLE

Phone Calls Kill Multiplayer

Receiving a call, checking a text, or locking the screen silently destroyed the connection.

In the code: No app lifecycle event handling. When the OS pauses the app, connections die and nothing tries to restore them on resume.

SYNC

Creatures Don't Appear

In some matches, creatures didn't show up for one or both players.

In the code: Race condition. The game can advance before creature data finishes syncing. A comment reads # very finicky :(

AUTH / SAVE

Creatures Fail to Save

After creating a creature, some players saw a mysterious "token expired" error and lost their creature. Seemed to happen at an above-average rate on specific devices.

In the code: Likely related to token refresh timing or lifecycle events invalidating the auth session. Under investigation as part of this initiative.

Fix Roadmap

Five fixes (A through E), broken into six PRs ordered by dependency. The top row can run in parallel. Each downstream PR starts after its dependencies land.

CAN START IN PARALLEL FIX A / #674 DRAFT WebSocket Relay Connectivity / Unblocks C, D.2 FIX B / #675 DRAFT State Machine Hardening Recovery / Unblocks C, D FIX E / PLANNED Creature Sync Fix Sync / Independent FIX D.1 / PLANNED Battle Disconnect L1 Battle / Needs Fix B FIX C / PLANNED Lifecycle Observer Lifecycle / Needs Fix A + B FIX D.2 / PLANNED Battle Reconnection Battle (full) / Needs C + D.1

Arrows show dependencies. Top row ships first with no entanglement.

Progress

WebSocket Relay Transport

Adds a WebSocket relay so players can connect even on networks where peer-to-peer fails. All game traffic routes through a lightweight Cloudflare relay server. Same data sync, zero changes to the battle system or game logic. Includes a new Durable Object, a new connector, a packet-based sync worker, and two test scripts (behavioral + integration).

Fix A / Connectivity / No dependencies
Draft

Connection State Machine: Recovery States

The connection state machine can now represent "connection lost, trying to recover" instead of only "connected" or "permanently failed." Adds CONNECTION_LOST and RECONNECTING states with exponential backoff retry. Nothing changes in production yet because no connector triggers it; it's opt-in. Also adds a headless CLI test runner for all networking tests.

Fix B / Recovery + Foundation / No dependencies
Draft

Creature Sync Race Fix

Replace the fragile manual flush pattern with a proper sync gate. The game won't advance past the loading phase until all players have confirmed their creature data is fully synced. Fixes the invisible-creatures bug.

Fix E / Sync / Independent, no dependencies
Planned

Battle Disconnect Handling (Level 1)

Wire disconnect awareness into the battle system. If a player drops mid-battle, show an overlay, apply a timeout on their actions, and let the remaining player exit cleanly with a win. No reconnection yet, just graceful handling.

Fix D.1 / Battle / Depends on #675
Planned

Lifecycle Observer

Listen for app pause/resume events on mobile. When the player backgrounds the app, mark the connection as suspended instead of tearing it down. On resume, attempt a silent reconnect through the WebSocket relay. If too much time passed, request a full re-sync from the host.

Fix C / Lifecycle / Depends on #674 + #675
Planned

Battle Reconnection (Level 2)

Build on the disconnect overlay and lifecycle observer to support actual reconnection during battle. Pause the game while reconnecting, re-sync state on success, fall back to forfeit if the reconnection window expires.

Fix D.2 / Battle (full) / Depends on Fix C + D.1
Planned