Release v0.64.3
May 02, 2026
Hardened the platform against fleet-wide reconnect storms and fixed a Windows agent crash loop.
Improved
- Per-organization rate limit on agent endpoints so a single misbehaving fleet can no longer overwhelm the API. Default budget is generous enough that no normal MSP will ever notice it; tunable per environment.
- Agent now honors the server's Retry-After header on 429 and 503 responses, so when the API tells the fleet to back off the fleet actually backs off instead of running its own retry schedule.
- Tighter limits on agent log shipments — smaller batch sizes and a hard cap on request body size keep one chatty agent from drowning the log ingest path.
- Slower agent and watchdog restart cadence on Linux (30s and 15s) prevents a network blip from triggering a thundering herd of reconnects across the fleet.
- Postgres connection pool tuned up to 30 connections so heartbeat storms no longer cascade into 504 errors.
Fixed
- Windows user-helper Scheduled Task was crash-looping on multiple customer tenants with an auth rejection error. The helper now starts with the correct role and a regression test prevents future drift.
A focused resilience release. The biggest piece is a three-part defense against the kind of correlated reconnect storms that can take down the API when a network blip or bad config push affects a large fleet at once: per-organization rate limits, Retry-After awareness on the agent side, and slower service restart cadence. None of this is visible during normal operation — it just means the worst case stays bounded.
The Windows fix is more user-visible: a Scheduled Task running under the standard Users group was crash-looping with an auth rejection on tenants like nexusitsys and Revenant Global. That’s resolved, with a regression test in place so it stays resolved.