Often, people would come to me with a technical escalation which is not actually technically difficult to resolve, rather technically misunderstood.
In troubleshooting books we learn that there are few approaches we could take:
- Top-Down the OSI layers – starts at L1, moving up to L7
- Going up the OSI layers – starts at L7, down to L1
- Divide and Conquer – the starting point is determined, based on knowledge and experience
In my opinion though, the first two are just theoretical in nature. In the real world, it is actually the third method that gets us out of the woods. Ok … there are indeed many cases where troubleshooting starts at Layer1 – though that decision was still made based on experience and not because we decided to use the first method; starting at L1 is just a coincidence.
So here is the thing:
Let me give you a few examples:
- Someone comes to you and tells you that a specific user can’t connect to a website. You do a first check and, using telnet, you confirm successful connectivity to the hostname, on port 80. So you then know that L3 (routing), L4 (transport) and DNS resolution are working fine … at least from your machine. Could there be a local firewall? Could there be a blocked user? etc. I’ll let you think of what other questions you could ask.
- You get an alarm which shows loss of an OSPF neighbour. You connect to the router, check the logs and notice that an interface has gone down. So now you know you need to find out what cause that event – but you are likely looking at a L1/L2 issue. But what if you are running OSPF over a GRE tunnel?? I’ll let you think!
- A NOC engineer escalates to you a reachability issue – he can’t ping a server over the DC link. You do a traceroute and notice that the packet is going back and forth between two specific L3 hops. So you are now thinking … “L3 loop!”.
- A user tells you she can’t ping between two servers; she is asking you to check the default gateway. You look at the two IP addresses and knowing the network, you figure they are on the same network! But you also know that, since the two servers are on the same network, the default gateway has nothing to do with the issue. So at least you rule that out very, very quickly!
Q: So how do you then become a good troubleshooter?
A: You just need to dedicate time towards understanding how things actually work!
Rafael A. Couto Cabral • LinkedIn Profile
Cisco | F5 | VMware Certified • PRINCE2 Practitioner
Originally posted 2017-10-15 03:15:46.