Unexpected networking issues bumped our market data services off of the production load balancer, this led to timeouts on the core application's "rates" endpoint which caused trades to fail.
These timeouts caused a backlog of requests for our core application's containers to build up. This backlog quickly overwhelmed the containers, crashing them and triggering a replacement. The replacements would then quickly get overwhelmed, crashing them, and continuing the cycle.
Fixes:
Set a short timeout on calls to our market data services from the core application to prevent timeouts from tying up threads for extended periods of time.
Improve image caching and tune CPU resources to reduce the risk of a cascading failure
In the long term we will continue moving more functionality out of the core application and into microservices so that future failures won't impact unrelated parts of the application.
Thank you for your patience and understanding while we worked to fix this issue.