Overload and Cascading Failures
Definition
- Cascading failure is a failure which grows over time as a result of positive (exacerbating) feedback.
Resource Overload
- Overloaded CPU
- to many requests
- excessively long queue lengths
- thread starvation
- Overloaded Memory
- dying tasks (e.g. killed by exausted resources)
- increase rate of gargabe collection, resulting in increased CPU usage
- reduction in hit rates
- Not enough file descriptors
- Not enough capacity during maintanance or rolling update
Triggering Conditions for Cascading Failures
- Process Death
- Process Updates
- New Rollouts
- Organic Growth
- Planned Changes, Drains or Turndowns
- Request Profile Changes
Preventive Actions
Client
- Client-Side Throttling
- server side throttling is not always effective, because for some queires the resources spend on handling the connection is the main part of expensive
- reject probability = max(0, (requests - K * accepts) / (requests + 1))
- Keep control on retries
- per-request retry budget
- per-client retry budget
- prevent hierarcical retries
- have backoff and jitter
- use clear response code and consider how different failures should be handled.
- Set timeouts
- the bigger is timeout the more additional resources are spend on handling connections
- the smaller timeout the bigger probability of false timeouted request
Load Balancers
- Control the size of the queue
- RPS limits for every client
- e.g. limit by IP could help dealing with DoS
- Keep into account backend utilization before sending requests (e.g. load average)
- Circuit bracker
Server
- Rate Limits
- Control the size of the queue
- Serve requests with degraded quality in case of overload
- Drop requests in of overload
- it’s better to surve 2k rps and drop 500 extra, then drop all 2.5k requests
- the best approach is to priorotize requests, and in case of overload serve only the most important
- User request termination on server
- to prevent cases when client do not waiting for request (timeout) but server still performs the work
- Propogate timeout from high level services to low-level
- 10s on load balancer
- (10 - 1)s on the first backend
- (10 - 1 - 5)s on the second backend and so on
Human Actions
- Load test the server’s capacity limits
- Test failure mode for overload
- Perform capacity planning
- Increase Resources
- Stop Health Check Failures/Deaths
- Restart Servers
- be carefull not to trigger the issue with slow startup and cold cashing
- Enter Degraded Modes
- Eliminate Batch Load
- Eliminate Bad Traffic
- Drop Traffic
Misc
Slow Startup and Cold Cashing
Causes
- Maintanance
- Rolling Update
- Service Restart
Prevention
- Overprovision
- Slowly increase the load to the new cluster
- Prevent recoursive links between services