notes

Overload and Cascading Failures

Cascading failure is a failure which grows over time as a result of positive (exacerbating) feedback.

Overloaded CPU
- to many requests
- excessively long queue lengths
- thread starvation
Overloaded Memory
- dying tasks (e.g. killed by exausted resources)
- increase rate of gargabe collection, resulting in increased CPU usage
- reduction in hit rates
Not enough file descriptors
Not enough capacity during maintanance or rolling update

Client-Side Throttling
- server side throttling is not always effective, because for some queires the resources spend on handling the connection is the main part of expensive
- reject probability = max(0, (requests - K * accepts) / (requests + 1))
Keep control on retries
- per-request retry budget
- per-client retry budget
- prevent hierarcical retries
- have backoff and jitter
- use clear response code and consider how different failures should be handled.
Set timeouts
- the bigger is timeout the more additional resources are spend on handling connections
- the smaller timeout the bigger probability of false timeouted request

Control the size of the queue
RPS limits for every client
- e.g. limit by IP could help dealing with DoS
Keep into account backend utilization before sending requests (e.g. load average)
Circuit bracker

Rate Limits
Control the size of the queue
Serve requests with degraded quality in case of overload
Drop requests in of overload
- it’s better to surve 2k rps and drop 500 extra, then drop all 2.5k requests
- the best approach is to priorotize requests, and in case of overload serve only the most important
User request termination on server
- to prevent cases when client do not waiting for request (timeout) but server still performs the work
Propogate timeout from high level services to low-level
- 10s on load balancer
- (10 - 1)s on the first backend
- (10 - 1 - 5)s on the second backend and so on

Increase Resources
Stop Health Check Failures/Deaths
Restart Servers
- be carefull not to trigger the issue with slow startup and cold cashing
Enter Degraded Modes
Eliminate Batch Load
Eliminate Bad Traffic
Drop Traffic