notes

Troubleshooting

Definition

  1. Troubleshooting is the process of identifying, analyzing and solving problems. Mostly in running application.
  2. Debugging is the process of identifying, analyzing and removing bugs in the system. Mostly in the application code.

Incident Response

Structured Incident Responds includes:

  1. Monitoring
  2. Alerting
  3. Incident Response Policy
    • on-call
    • playbooks

Steps

  1. Define the issue and gather the information
    • understand the current status of the system using monitoring
    • escalate if it’s necessary
      • is incident user-facing?
      • how fast is error budget burning
    • if escalated:
      • document the incident
      • assign Incident Commander, Operation and Communication leads
      • create communication chanell
  2. Mitigate - make the system work with current circumstances (stop the bleeding)
  3. Find the root cause.
  4. Implement / schedule long term fix
  5. Write postmortem
  6. Mitigate the consequences (data loss).

Responding to minor problems

Steps

  1. Clarify the problem
  2. Find the reproduction case
  3. Mitigate - create short term solution
  4. Find the root cause
  5. Implement / schedule long term fix

Clarify the problem

  1. What are you trying to do?
  2. What steps did you follow?
  3. What was the exprected result?
  4. What was the actual result?

Find the reproduction case

  1. Isolate
  2. Segment and reduce

Dealing with intermittent issues

  1. If you could modify the running code:
    • Add more logs to understand the conditions when issue happens
  2. If code modification is not an option
    • Turn on debug mode on a software
  3. If above two do not work:
    • Monitor the environment
  4. If reset helps it’s more likely that the issue is conneted with software bug about resource management. Because when we restart the machine
    • we cleanup memory
    • we cleanup network connections
    • we cleanup opened file descriptors
    • we cleanup cache

Mitigation

Technics

  1. Rolling back a bad software push
  2. “Draining” traffic away from an affected cluster/datacenter
    • Remove the broken machine from the pool of services
  3. Bringing up additional serving capacity
  4. [Feature isolation]

Immediate Steps to Address Cascading Failures

  1. Bringing up additional serving capacity
  2. Eliminate Bad Traffic
  3. Eliminate Batch Load
  4. Enter Degraded Modes
  5. Restart Servers
    • be carefull not to trigger the issue with slow startup and cold cashing
  6. Stop Health Check Failures/Deaths
  7. Drop Traffic

Root Cause

Finding the root cause

  1. Gather Information
    • Does all users affected or only subset of them? What is common between affected users?
    • What changed?
      • source code
      • configs
      • libraries service depends on
      • external services library depends on
    • Does it depends on the server where the app is running?
    • logs [add logs to required parts if necessary]
    • monitoring
    • tracing
    • send custom requests
  2. Form a hypotheesis
    • Start with simplier to check hypothesis
    • Segment problem space
      • if steps number is low: just go through them one by one
      • if steps number is high: use binary search to reduce problem (git bisect)
    • simplify and reduce
    • bin search or go trough the broken response and identify which components works and which does not
  3. Test the hypothesis
    • if it’s possible it’s better to test our hypothesis in stage/dev environment instead of produciton
      • we won’t break something important
      • we won’t interfere with other users
    • find evidencies
    • change the system and observe expected result
  4. Fix the issue

Tools

  1. Which processes consumes CPU
    • top
      • load average
    • atop
      • can group by process name
  2. What proccess is doing:
    • strace
    • ltrace
  3. Disk load:
    • iotop
      • iowait - time spend waiting on IO events
    • iostat
    • vmstat - virtual memory stats
  4. Inspect current traffic on network interfaces
    • iftop
  5. Inspect network packets
    • tcpdump
    • wireshark
  6. Measure the time for completenes of the programm
    • time
  7. Debug programms
    • gdb - for c/c++ programs
        # enable core files generation
        ulimit -c unlimited
      
        # run programm and generate core file in case of crash
        ./my-programm
        Segmentation fault (core dumped)
      
        gdb -c core
      
        (gdb) backtrace # view call stack
        (gdb) up        # move in the call stack by one function
        (gdb) list      # show lines around the current one
        (gdb) print var # print variable value
      
    • pdb3
        pdb3 python-script.py args
        (Pdb) next              # go to next line
        (Pdb) continue          # continue until finish or crash
        (Pdb) print(var_name)   # print variable value
      

Dealing with slowness

  1. Determine and measure what “slow” means.
  2. Find the bottleneck
  3. Possible suspects:
    • overloaded CPU
    • overloaded memory
    • memory leaks
    • slow software
    • data growth (e.g. parsing to big files)
    • hardware failure - when many segments of hdd are corrupted it starts to perform slow, after this it’s the matter of time when we will start losing data
    • malicious software
  4. Profile the source code
    • to profile python script use pprofile3 and kcachegrind
     pprofile3 -f callgrind -o profile.out ./my-script.py
     kcachegrind profile.out
    
  5. Fix the issue and prove the fix with the same measurement that was done at step 1.

Dealing with crashes

Improving troubleshooting processes

Making Troubleshooint easier

  1. Building observability
    • logs
    • black box and white box monitoring
    • correlation id
  2. Clear and simple architecture and interaction between components.
  3. Track what changed (preferable available in one place)
    • which apps were released
    • which configs were updated

Prepare

  1. Create an incident response policy
    • including escalation policy
    • communication chanell
    • contact list
  2. Have mitigation steps for as more outages as possible.
  3. It’s good to have playbooks for every alert
  4. Train team and explore how the incident was handled.
    • role game
    • controlled emergency
    • hands-on exercises / labs

Troubleshooting Pitfalls

  1. Looking at symptoms that aren’t relevant or misunderstanding the meaning of system metrics.
  2. Misunderstanding how to change the system to test hypothesys.
  3. Comming up with wildly improbable theories about what’s wrong, or latching onto causes of past problems.
  4. Hunting down spurious correlations that are actually coincidences or are correlated with shared causes
  5. Correlation is not a causations.