2026-05-20 · infrastructure multi-node resilience

Why we run a multi-node residential mesh (and how to restore a node through an ISP filter)

The simplest sportsbook-API architecture is a single host that polls each book's public endpoint on a schedule. It works. It is also fragile in three predictable ways: the host's IP gets fingerprinted as a bot and rate-limited, the host's network goes down, and the host can only refresh as fast as its single polling loop. We run a multi-node ingest mesh because each of those failure modes hits a single-host setup eventually, and the cumulative effect is meaningful customer-visible degradation.

This is a practical writeup of how the mesh actually works in production, what it unlocks for customers, and one specific operational scenario we hit recently: restoring a node on a network whose ISP was filtering our own API domain.

What "mesh" means here

Four roles, intentionally separated:

Net effect: instead of one host hitting book X every 60 seconds, four nodes each hit book X every 60 seconds offset by 15 seconds. Effective refresh ceiling at the customer-facing API is 15 seconds, not 60.

What it actually unlocks (verified)

Measured today (2026-05-19) over a 10-minute window, post-restore, single sport (Bovada MLB lobby):

Via nodeProp rows / 10 minLatest observation
Origin direct fetch (datacenter IP)167,77417:59:35 UTC
Residential mobile (US-east)84,39917:59:41 UTC
Cloud node (US-east)23,17817:59:41 UTC

Three observations:

  1. The residential node contributed roughly a third of the total volume on its own. It is not a "backup" in any meaningful sense; it is a peer producer.
  2. The residential node's most-recent observation was 6 seconds fresher than the direct origin fetch in this window. Different polling cadence, different jitter, more frequent fresh captures.
  3. The cloud node's volume was lower than residential because the cloud node was already in a rate-limit-protective backoff cycle from a prior hour. The residential node's IP profile didn't trigger that.

None of this is dramatic on its own. The point of the mesh is the cumulative effect over hundreds of hours of operation: fewer rate limits, fewer single-point failures, lower freshness ceiling, more representative cross-book pricing for arbitrage and +EV scanning.

The scenario we hit tonight

One of the ingest nodes (a Pixel 4a running a watchdog in Termux) went silent for ~19 hours. Two compounding problems:

  1. Termux app-data reset on the phone wiped the runtime files (ingest.py + config.yaml). The watchdog itself survived but was spinning on a missing engine script. Default cause of the silence.
  2. The host network (Spectrum residential WiFi) was filtering our API domain. Spectrum's "Security Shield" service had classified parlay-api.com as suspicious and was intercepting all DNS lookups for it, returning a "Suspicious Site Blocked" HTML page. So the obvious restore path (curl ... | sh from the public URL) was returning HTML where the script expected a shell script, and the install failed in confusing ways.

The combination of "node software wiped" + "node's host network can't reach us" is the kind of failure mode you don't think about until you hit it. Standard remote-management doesn't apply: you can't SSH into a phone behind NAT, you can't push a script from your dashboard because the phone can't reach the dashboard, and the operator was at a remote location.

The solution: Tailscale bridge

The fix that worked, and that we've now productized as a permanent pattern at nodes/tailnet-bridge/:

  1. The node is on Tailscale (always; this is non-negotiable for any residential / unknown-network ingest node going forward).
  2. The origin is also on Tailscale, with tailscale serve --tcp=8765 tcp://localhost:8080 exposing the FastAPI app on the tailnet.
  3. The node's config points at the origin's tailnet IP, not the public domain: ingest_url: http://100.96.132.7:8765/v1/node/ingest.
  4. All ingest traffic goes through WireGuard. The ISP can't see the destination domain (it's encrypted), so its URL classifier has nothing to filter on. Spectrum's filter, AT&T's filter, corporate proxies, captive portals, all transparent.

For node-restoration when a fresh install is needed but the public install URL is filtered, an operator on the same tailnet runs nodes/tailnet-bridge/setup_operator_relay.sh, which hosts the install bundle on the operator's tailnet IP. The node runs one command, the install bundle pulls from the operator over WireGuard, the watchdog starts, and from that point forward all ingest traffic goes directly to the origin's tailnet IP without the operator in the path.

Production-tested 2026-05-19. Validated end-to-end during the incident. Total operator time from "node has been silent 19h" to "node posting again" was about 90 minutes, most of it diagnosing the unrelated phone-keyboard autocorrect issue. A future operator hitting the same scenario should be able to do it in three minutes flat using the runbook at docs/runbooks/2026-05-19-pixel-restoration-via-tailnet.md.

What this means for customers

Three concrete benefits:

  1. Lower observed-latency ceiling on cross-book scans. When multiple books are polled on offset cadences, customers querying our /v1/sports/{sport}/arbitrage or /v1/sports/{sport}/ev endpoints see edges sooner than any single-source operator could surface them.
  2. Better resilience to upstream rate-limits. If one node gets soft-banned by a book (the book starts returning empty arrays or 429s for that IP), other nodes keep pulling. Customers don't see the gap.
  3. More authentic cross-book consensus. A datacenter-IP fetch and a residential-IP fetch sometimes see different prices because the book serves slightly different content based on requester profile. Mesh fetching gives us both views, surfaces the difference, lets +EV computations use the more representative one per market.

What we are NOT claiming

Build on it

If you're building a sportsbook-data product and want to do something similar:

All posts · Sub-1s freshness writeup · ISP classifier blocks · Latency budget