notes

Troubleshooting

Definition
Incident Response
- Structured Incident Responds includes:
- Steps
Responding to minor problems
Mitigation
- Technics
- Immediate Steps to Address Cascading Failures
Root Cause
Improving troubleshooting processes
- Making Troubleshooint easier
Prepare
- Troubleshooting Pitfalls

Definition

Troubleshooting is the process of identifying, analyzing and solving problems. Mostly in running application.
Debugging is the process of identifying, analyzing and removing bugs in the system. Mostly in the application code.

Incident Response

Structured Incident Responds includes:

Monitoring
Alerting
Incident Response Policy
- on-call
- playbooks

Steps

Define the issue and gather the information
- understand the current status of the system using monitoring
- escalate if it’s necessary
  - is incident user-facing?
  - how fast is error budget burning
- if escalated:
  - document the incident
  - assign Incident Commander, Operation and Communication leads
  - create communication chanell
Mitigate - make the system work with current circumstances (stop the bleeding)
- General Techmics
- Stop Cascading Failure
- Document what you do
Find the root cause.
Implement / schedule long term fix
Write postmortem
Mitigate the consequences (data loss).

Responding to minor problems

Steps

Clarify the problem
Find the reproduction case
- after finding reproduction case we could free up user
- Dealing with intermittent issues
Mitigate - create short term solution
Find the root cause
Implement / schedule long term fix

Clarify the problem

What are you trying to do?
What steps did you follow?
What was the exprected result?
What was the actual result?

Find the reproduction case

Isolate
Segment and reduce

Dealing with intermittent issues

If you could modify the running code:
- Add more logs to understand the conditions when issue happens
If code modification is not an option
- Turn on debug mode on a software
If above two do not work:
- Monitor the environment
If reset helps it’s more likely that the issue is conneted with software bug about resource management. Because when we restart the machine
- we cleanup memory
- we cleanup network connections
- we cleanup opened file descriptors
- we cleanup cache

Mitigation

Technics

Rolling back a bad software push
“Draining” traffic away from an affected cluster/datacenter
- Remove the broken machine from the pool of services
Bringing up additional serving capacity
[Feature isolation]

Immediate Steps to Address Cascading Failures

Bringing up additional serving capacity
Eliminate Bad Traffic
Eliminate Batch Load
Enter Degraded Modes
Restart Servers
- be carefull not to trigger the issue with slow startup and cold cashing
Stop Health Check Failures/Deaths
Drop Traffic

Root Cause

Finding the root cause

Gather Information
- Does all users affected or only subset of them? What is common between affected users?
- What changed?
  - source code
  - configs
  - libraries service depends on
  - external services library depends on
- Does it depends on the server where the app is running?
- logs [add logs to required parts if necessary]
- monitoring
- tracing
- send custom requests
Form a hypotheesis
- Start with simplier to check hypothesis
- Segment problem space
  - if steps number is low: just go through them one by one
  - if steps number is high: use binary search to reduce problem (git bisect)
- simplify and reduce
- bin search or go trough the broken response and identify which components works and which does not
Test the hypothesis
- if it’s possible it’s better to test our hypothesis in stage/dev environment instead of produciton
  - we won’t break something important
  - we won’t interfere with other users
- find evidencies
- change the system and observe expected result
Fix the issue

Tools

Which processes consumes CPU
- top
  - load average
- atop
  - can group by process name
What proccess is doing:
- strace
- ltrace
Disk load:
- iotop
  - iowait - time spend waiting on IO events
- iostat
- vmstat - virtual memory stats
Inspect current traffic on network interfaces
- iftop
Inspect network packets
- tcpdump
- wireshark
Measure the time for completenes of the programm
- time

Debug programms

gdb - for c/c++ programs

  # enable core files generation
  ulimit -c unlimited

  # run programm and generate core file in case of crash
  ./my-programm
  Segmentation fault (core dumped)

  gdb -c core

  (gdb) backtrace # view call stack
  (gdb) up        # move in the call stack by one function
  (gdb) list      # show lines around the current one
  (gdb) print var # print variable value

pdb3

  pdb3 python-script.py args
  (Pdb) next              # go to next line
  (Pdb) continue          # continue until finish or crash
  (Pdb) print(var_name)   # print variable value

Dealing with slowness

Determine and measure what “slow” means.
Find the bottleneck
Possible suspects:
- overloaded CPU
- overloaded memory
- memory leaks
- slow software
- data growth (e.g. parsing to big files)
- hardware failure - when many segments of hdd are corrupted it starts to perform slow, after this it’s the matter of time when we will start losing data
- malicious software

Profile the source code

to profile python script use pprofile3 and kcachegrind

 pprofile3 -f callgrind -o profile.out ./my-script.py
 kcachegrind profile.out

Fix the issue and prove the fix with the same measurement that was done at step 1.

Dealing with crashes

Improving troubleshooting processes

Making Troubleshooint easier

Building observability
- logs
- black box and white box monitoring
- correlation id
Clear and simple architecture and interaction between components.
Track what changed (preferable available in one place)
- which apps were released
- which configs were updated

Prepare

Create an incident response policy
- including escalation policy
- communication chanell
- contact list
Have mitigation steps for as more outages as possible.
It’s good to have playbooks for every alert
Train team and explore how the incident was handled.
- role game
- controlled emergency
- hands-on exercises / labs

Troubleshooting Pitfalls

Looking at symptoms that aren’t relevant or misunderstanding the meaning of system metrics.
Misunderstanding how to change the system to test hypothesys.
Comming up with wildly improbable theories about what’s wrong, or latching onto causes of past problems.
Hunting down spurious correlations that are actually coincidences or are correlated with shared causes
Correlation is not a causations.