Thursday, October 4, 2018

Troubleshooting is a lot like storytelling

Define the Problem

A problem definition begins with a comparative analysis, using what, when, where, to what extent.

What is not working

  • What is the observed behavior?
  • How is this different than expected behavior?

When does it happen

  • When was the first occurrence? Are there multiple occurrences?
  • Any pattern? Any clear trigger event?
  • Skim log files for the same time period. 
  • Does it happen continuously or intermittently?
How is the problem cleared:
  • Is some action taken by a user?
  • Does it clear after some time?
  • Does some impacting event reset the environment (crash or reboot event)?

Where in the life-cycle

  • Where in the object life-cycle does the problem-state show itself? 
  • Day-1 issues are usually mis-configurations, design-related, or bugs (hardware or software related)
  • Day-2 issues are usually seen after some “change” has taken place. This might be a link flap, or a memory leak over a long time. 
The system might have been deployed without ever measuring the performance (especially in failure scenarios).
  • Should it work as described as-is?
  • Is there evidence to support the advertised performance numbers?
Marketing is well-known to omit important subtle details. You might find a (hardware or software) limitation exists (that might, or might not have a workaround).

Extent

How many objects show the problem-state? …out of how many total objects?
  • Are there some objects that _could_ show the problem but do not show the problem right now? Compare objects that are working versus the problem-state.
  • Certain features will influence the forwarding pipeline that a packet would follow through a network device.
How many occurrences of the problem were seen on each object?
  • A link that is flapping would usually show similar number of up/down transitions at each side. 
  • An interface configured with sub-optimal MTU might cause fragmentation in a single direction, especially if there are two exit nodes on the network (traffic could follow an asymmetric forwarding path in/out of the network).

Narrow the Scope

Determine the appropriate method to isolate the problem to a direction, a single object, and then a singular component (hardware or software related).

Split the Difference

Imagine you are troubleshooting some connectivity problem with VM hosts that reside in a VXLAN segment
  • VXLAN is a network virtualization overlay technology comprised of the underlay network (between ingress VTEP and egress VTEP devices) and the overlay network (VM hosts at the outer edge of the network)
  • In this type of situation, I start by verifying the underlay reachability. If this fails, then it would be a waste of time to investigate the overlay network. 
  • If the underlay network is working, then move on to the overlay network. 
  • Verify connectivity between each VTEP to the locally connected host. 
  • What direction is packet loss happening? Forward path or the return path? Look at interface counters, ACL counters, aggregate traffic statistics… tcpdump and ERSPAN can help isolate direction of packet loss
Is the problem specific to data plane traffic? Or control plane, or management plane traffic?
  • To help determine if hardware is mis-programmed, you can insert special flags (record route option) in an ICMP packet to force the router to punt the packet to CPU at each hop. Not all network vendors act (punt to CPU) upon it though
  • Is the problem specific to a type of traffic? IPv4 or IPv6? Unicast or BUM traffic? TCP, UDP or ICMP traffic?

Bottom Up

Physical layer issues fall into this category. Imagine a link is fully inserted, but fails to pass traffic
  • Check the port is not administratively disabled (yes, we overlook the easy answers).
  • What is the hardware state? The switch ASIC must first recognize and program the link.
  • If the ASIC has correctly programmed the type of the link (and recognized the transceiver), then what is the software state?
  • Assuming optical fiber is in play, what are the light levels from each termination point? The transmit signal on one side corresponds to the receive signal at the remote end. If the signal is too weak, is there a patch panel, or any intermediate transponder equipment? Check the signal at each point where the cable is terminated.
Top Down
Performance issues fall into this category.
Use packet captures to help tell the story how the system is actually working.
Often we make assumptions, which are sometimes false:
  • The environment changed since the last release.
  • A new variable was introduced into the environment.
  • We’re operating on information that was not carefully validated.
Use a traffic generator if necessary. iperf, nuttcp, and mz are just some of the open-source tools. Be careful, some of them are better suited for particular traffic characteristic. Get involved in the community and help make the tools better.

Verify the Hypothesis
Do not skip the verification process — you are here because there is a complex problem in front of you. Problems have a tendency to return if you do not reveal the underlying cause.
A hypothesis is similar to storytelling, where you expect to find a headline and supporting evidence.
  • At each step of the process document your steps taken and the outcome. 
  • Keep asking yourself, “does this finding / deviation explain the problem-state?” If not, then rule it out as a possible cause. Move to the next item on the list. 
  • Assumptions can be dangerous if not verified. Multicast traffic may be treated differently than unicast traffic at certain points in the forwarding path. 
  • Start by testing the most probable cause. If this would require a considerable amount of resources (time or money), such as sending a field engineer to an unmanned-site that is far away then try to eliminate the low-hanging fruit (something that can be tested quickly to further bolster your hypothesis or rule it out). 
  • Keep an open mind when you approach a problem — be willing to broaden your search.

Environment

It’s best to troubleshoot the actual problem-state in a live environment if at all possible. However:
  • Sometimes it is not possible to leave the system in the problem-state for a long time, and it must be recovered to a normal, working state. 
  • Hopefully you gleaned enough data points from the problem-state to attempt a lab re-create. 
  • Often you do not need a scaled setup, and it can be reduced to a small number of devices (physical or virtualized lab environment). 
  • Traffic generators can introduce new problems. If you are testing with a uni-directional traffic flow, this is different circumstances than most production traffic flows (bi-directional)

The Power of a Team

By working closely with others to solve a problem, all of us benefit in many ways, such as
  • spot gaps or flaws in your story;
  • learn a new way to approach a problem (that saves you time);
  • improve the fix to be more efficient.
Successful people are good communicators. Surround yourself with people that emulate the characteristics you wish to learn from.

No comments:

Post a Comment

Thank you for your comment. Will try to react as soon as possible.

Regards,

Networ King