notes

Overload and Cascading Failures

Definition

  1. Cascading failure is a failure which grows over time as a result of positive (exacerbating) feedback.

Resource Overload

  1. Overloaded CPU
    • to many requests
    • excessively long queue lengths
    • thread starvation
  2. Overloaded Memory
    • dying tasks (e.g. killed by exausted resources)
    • increase rate of gargabe collection, resulting in increased CPU usage
    • reduction in hit rates
  3. Not enough file descriptors
  4. Not enough capacity during maintanance or rolling update

Triggering Conditions for Cascading Failures

  1. Process Death
  2. Process Updates
  3. New Rollouts
  4. Organic Growth
  5. Planned Changes, Drains or Turndowns
  6. Request Profile Changes

Preventive Actions

Client

  1. Client-Side Throttling
    • server side throttling is not always effective, because for some queires the resources spend on handling the connection is the main part of expensive
    • reject probability = max(0, (requests - K * accepts) / (requests + 1))
  2. Keep control on retries
    • per-request retry budget
    • per-client retry budget
    • prevent hierarcical retries
    • have backoff and jitter
    • use clear response code and consider how different failures should be handled.
  3. Set timeouts
    • the bigger is timeout the more additional resources are spend on handling connections
    • the smaller timeout the bigger probability of false timeouted request

Load Balancers

  1. Control the size of the queue
  2. RPS limits for every client
    • e.g. limit by IP could help dealing with DoS
  3. Keep into account backend utilization before sending requests (e.g. load average)
  4. Circuit bracker

Server

  1. Rate Limits
  2. Control the size of the queue
  3. Serve requests with degraded quality in case of overload
  4. Drop requests in of overload
    • it’s better to surve 2k rps and drop 500 extra, then drop all 2.5k requests
    • the best approach is to priorotize requests, and in case of overload serve only the most important
  5. User request termination on server
    • to prevent cases when client do not waiting for request (timeout) but server still performs the work
  6. Propogate timeout from high level services to low-level
    • 10s on load balancer
    • (10 - 1)s on the first backend
    • (10 - 1 - 5)s on the second backend and so on

Human Actions

  1. Load test the server’s capacity limits
  2. Test failure mode for overload
  3. Perform capacity planning

Immediate Steps to Address Cascading Failures

  1. Increase Resources
  2. Stop Health Check Failures/Deaths
  3. Restart Servers
    • be carefull not to trigger the issue with slow startup and cold cashing
  4. Enter Degraded Modes
  5. Eliminate Batch Load
  6. Eliminate Bad Traffic
  7. Drop Traffic

Misc

Slow Startup and Cold Cashing

Causes

  1. Maintanance
  2. Rolling Update
  3. Service Restart

Prevention

  1. Overprovision
  2. Slowly increase the load to the new cluster
  3. Prevent recoursive links between services