GDC multiplayer worked for most players. We audited the networking stack to understand why. Here's what the code says and what we plan to do about it.
We audited the client networking layer, battle system, and Cloudflare backend after GDC. We found five areas where the code lacks resilience. We suspect these contributed to the multiplayer issues some players hit on certain networks and devices.
On some Wi-Fi networks and mobile connections, players couldn't connect at all. 15-second timeout, then nothing.
In the code: Only one STUN server, zero relay servers. If either player is behind a restrictive firewall or NAT, there is no alternative path.
A brief network interruption permanently killed the session. Players had to restart.
In the code: The connection state machine treats every failure as permanent. No "reconnecting" state, no retry logic, no heartbeat to detect drops.
If a player disconnected mid-battle, the other player's game could freeze with no way out.
In the code: The battle system doesn't listen for disconnect events. It waits indefinitely for the opponent's next action.
Receiving a call, checking a text, or locking the screen silently destroyed the connection.
In the code: No app lifecycle event handling. When the OS pauses the app, connections die and nothing tries to restore them on resume.
In some matches, creatures didn't show up for one or both players.
In the code: Race condition. The game can advance before creature data finishes syncing. A comment reads # very finicky :(
After creating a creature, some players saw a mysterious "token expired" error and lost their creature. Seemed to happen at an above-average rate on specific devices.
In the code: Likely related to token refresh timing or lifecycle events invalidating the auth session. Under investigation as part of this initiative.
Five fixes (A through E), broken into six PRs ordered by dependency. The top row can run in parallel. Each downstream PR starts after its dependencies land.
Arrows show dependencies. Top row ships first with no entanglement.
Adds a WebSocket relay so players can connect even on networks where peer-to-peer fails. All game traffic routes through a lightweight Cloudflare relay server. Same data sync, zero changes to the battle system or game logic. Includes a new Durable Object, a new connector, a packet-based sync worker, and two test scripts (behavioral + integration).
The connection state machine can now represent "connection lost, trying to recover" instead of only "connected" or "permanently failed." Adds CONNECTION_LOST and RECONNECTING states with exponential backoff retry. Nothing changes in production yet because no connector triggers it; it's opt-in. Also adds a headless CLI test runner for all networking tests.
Replace the fragile manual flush pattern with a proper sync gate. The game won't advance past the loading phase until all players have confirmed their creature data is fully synced. Fixes the invisible-creatures bug.
Wire disconnect awareness into the battle system. If a player drops mid-battle, show an overlay, apply a timeout on their actions, and let the remaining player exit cleanly with a win. No reconnection yet, just graceful handling.
Listen for app pause/resume events on mobile. When the player backgrounds the app, mark the connection as suspended instead of tearing it down. On resume, attempt a silent reconnect through the WebSocket relay. If too much time passed, request a full re-sync from the host.
Build on the disconnect overlay and lifecycle observer to support actual reconnection during battle. Pause the game while reconnecting, re-sync state on success, fall back to forfeit if the reconnection window expires.