Cisco (Networking)

You are currently browsing the archive for the Cisco (Networking) category.

One of the truly fascinating things about networking is how much of it ‘just works’.  There are so many low level pieces of a network stack that you don’t really have to know (although you should) to be an expert at something like OSPF, BGP, or any other higher level networking protocol.  One of the ones that often gets overlooked is MTU (Maximum Transmission Unit), MSS (Maximum Segment Size) and all of the funs tuff that comes along with it.  So let’s start with the basics…

image
Here’s your average looking IP packet encapsulated in an Ethernet Header.  For the sake of conversation, I’ll assume going forward that we are referring to TCP only but I did put the UDP header length in there just for reference.  So a standard IP packet is 1500 bytes long.  There’s 20 bytes for the IP header, 20 bytes for the TCP header, leaving 1460 bytes for the data payload.  This does not include the 18 bytes of Ethernet headers and FCS that surround the IP packet.

When we look at this frame layout, we can further categorize components of the frame by MTU and MSS…

image
The MTU is defined as the maximum length of data that can be transmitted by a protocol in one instance.  What makes this slightly confusing is that the MTU does NOT include the Ethernet headers required to transmit the packet on Ethernet, it only includes the IP packet information.  That is, with regards to MTU, we always assume the Ethernet interface is already taking into account the 18 bytes of Ethernet headers (sometimes we don’t even see the FCS so its 1514 not 1518). So this is pretty straight forward, nothing too exciting here.  The majority of networks assume an MTU of 1500 bytes and everything works as expected.

MSS is slightly different because it is determined by each end of the TCP connection.  During session setup a device can specify the MSS they want to use in their SYN packet.  However, the devices technically do not need to agree on an MSS and each device can use a different MSS value.  The only requirement is that the sending device must limit the size of the data it sends to the MSS it receives from it’s peer.  MSS is generally 40 bytes less than MTU to accommodate for a 20 byte IP header and a 20 byte TCP header.

Note: If you’re using Cisco routers as your end points like I am in my lab, you should be aware that in some code versions TCP sessions generated or terminated on a router use different MSS.  For local connections (same subnet) the default MSS is 1460.  For routed connections, the default MSS was 536.  To make things more ‘normal’ I set the router MSS to 1500 with ‘ip tcp mss 1500’ in global config mode. 

So this all makes good sense right?  The devices should be smart enough to know based on their MTU what the largest TCP segment they can send should be.  Let’s look at a simple lab to prove this works…

image
So let’s generate a telnet session from Device A to Device B and see what the packets that hit the wire look like…

image
If we look at the TCP header in one of these packets, we should see the MSS is set to 1460…

image
Now if I set the MTU of the interface to 1000, we should see a MSS of 960…

image
Right, so things are working as we expect.  So let’s take a moment to talk about the difference between MTU and ‘IP MTU’ on a Cisco router.  You might have noticed that under interface configuration you can set either.  The difference is simple but sometimes not as easy to understand.  MTU sets the physical interface MTU, that is the max packet size supported by an interface.  IP MTU sets the max size of an IP packet.  So there, clear enough for you?  The problem here is that we’re all used to working with just IP packets, however, that’s not always the case.  So the real difference here is between setting the max MTU for any protocol on the interface, and setting the max IP MTU.  I like to think of MTU as being the hardware MTU and the IP MTU as being the IP packet MTU.  By default these values are the same so IP MTU never shows up in the config.  Also – this should be obvious, but the IP MTU has to be equal or less than the MTU.

So let’s make sure we’re on the same page.  Let’s lower the MTU on the interface to 1200 and try setting the IP MTU to a higher value…

image
It complains, this is expected and makes total sense.  Now let’s set the IP MTU to 1100..

image
We can see that the running config now shows the MTU and IP MTU since they are difference values.  A ‘show int fa0/0’ will show the interface (hardware) MTU of 1200 and a ‘show ip interface fa0/0’ will show the IP MTU of 1100 in this case.

So now let’s talk about why any of this matters.  If you consider a network where each link has a IP MTU of 1500, you only ever need to worry about this if you’re doing tunneling (or any other encap).

Note: MPLS is a whole different animal, I’m not covering that in this post but I’d argue that it’s a tunnel all the same. 

The most common type of tunnel we see is a GRE tunnel.  Adding a GRE header on a packet makes the frame format look like this…

image
So now we have an additional 24 bytes of headers.  Another 20 for the outer IP packet (tunnel source and destination) and 4 for the GRE header itself.  So what does this mean?  Our MTU can’t increase since we’re using the max IP MTU of 1500 to match our hardware MTU.  What happens when we try and send traffic now?  Let’s modify the lab slightly so we can use a GRE tunnel more effectively…

image So now we’re going to have a user traversing a couple of network segments to download a file via HTTP on my homes super computer (luckily for me Visio had an exact stencil of what my super computer looks like).  The catch here is that we’re going to have a GRE tunnel that goes from device A to device B which the traffic will ride within.  Pretty straight forward however, how will this all work considering what we just talked about above?  Adding in a GRE header as well as the outer (GRE) IP header is going to add an additional 24 bytes to the frame that the client and server won’t know about.  Should this work?  Let’s check it out and see…

image
You might have to blow that up to see it better, but to make a long story short, it does work.  But how or why?  Let’s walk through a couple of the packets to see what’s happening…

image
The above packet shows the initial TCP SYN.  We can see it comes from the client and is destined to the server.  We also see that it has some TCP options set which include the TCP MSS which the client is setting to 1460.  Recall that TCP MSS is generally dictated by taking the MTU and subtracting 20 bytes for the IP header and 20 bytes for the TCP header.  So this looks right so far.  Let’s look at the SYN/ACK from the server…

image
So same deal here – The server thinks that it’s MSS should be 1460.  Once the TCP session is established, the client can issue it’s HTTP GET which we see in packet 336.  Then we see a bunch of ‘TCP Previous segment not captured’ frames followed by one really interesting one (frame 342)…

image
In packet 342 – the router kicks back a ICMP unreachable message telling the client that ‘Fragmentation needed’.  If we look at that packet in more detail, we see some more interesting information…

image

Ah ha!  The router is telling the server that the max MTU it can use 1476 (1500 minus 24 for GRE/IP).  The server sees this data and caches it.  We can see this on the server (Linux in this case) by issuing the following command…

image
So now the server knows to use an MTU of 1476 when talking to this client.  With the MTU now being 1476 our math now sorts out like this…

image
So using a new MTU of 1476 forces the TCP payload (MSS) down to a lower number to fit all of the headers in making the total frame size 1500 (excluding Ethernet framing)…

image

 

 

 

So in our example so far, the end devices have been the device sorting out the issue caused my the GRE tunnel.  However, it’s not always best practice to let the endpoints do this because PMTU discovery relies on ICMP.  For instance, I can very easily break this connection by disabling ICMP unreachable messages on the interface facing the server.

Note: If you’re testing this out yourself make sure you clear out the servers route cache.  As mentioned before, it will cache the lower MTU for a period of time so it doesn’t need to do PMTU constantly.  Command – ‘ip route flush cache’.

So lets disable ICMP unreachables on the interface…

image
And now try to download the file again…

image
It fails.  Without the router telling the server there’s an issue, we can’t pass traffic.  There are a couple other ways to take care of this.  One would be to let the routers fragment the traffic by clearing the DF bit on the traffic.  I think that’s a horrible idea so I won’t even take the time to discuss it.  The second option it to use what’s referred to as ‘TCP clamping’.

TCP clamping involves having the router rewrite the TCP MSS option in the SYN SYN/ACK to another value.  So in our case, we can tell the router to adjust the MSS down to 1436 to accommodate the tunnel.   Let’s  configure it on our router interface that we also disabled ‘ip unreachables’ on…

image
Now let’s try the traffic again and see what happens…

image It’s working!  Now let’s look at what hits the wire…

So the initial packet coming off the client has an MSS of 1460 set…

image
However, if we look at the packet again after it’s traversed the router ,and is in the GRE tunnel, we can see the MSS is now what we set it to…

image
I flagged the IP ID as well so you can see we’re looking at the same packet.  We’re just analyzing the same packet at different points in the network.

As you can see, the ‘middle’ of the network has very little to do with MTU and MSS.  Like most network things, you need to ensure that you’ve accounted for any additional headers at the edge so devices in the middle don’t need to respond or do crazy things.

So to solidify this point, let’s look at one last example to make we’re all on the same page.  Take this lab topology as an example…

image
Here we’re running the traffic over GRE tunnels as the traffic traverses the segment between cloud-1 and cloud-2.  If we take a look at a packet capture taken in between the two cloud routers we should see a lot of headers…

image
Yep – So we’re doing GRE in GRE there.  So now let’s try our HTTP download again and see what happens…

image
Yeah – So we’re fragmenting.  The interesting thing here is that most hosts set the DF (do not fragment) bit in their IP packet.  However, this does NOT carry over in additional IP headers generated by the router, we can see that here…

image
So here’s the breakdown of what’s happening now…

1 – Client initiates TCP session with Super Computer, sets MSS to 1460 by default.  Device A changes MSS to 1436 before the traffic enters the tunnel.
2 – Server replies in his SYN/ACK with an MSS of 1460.  Device B changes the MSS to 1436 before the traffic enters the tunnel.
3 – The TCP session is setup, each side thinks the other side has an MSS of 1436.
4 – When data starts flowing the packet size is exactly 1500 bytes when it reaches the Cloud-2 router.  Cloud-2 knows that it has to put an additional 24 bytes of headers on the packet which puts it over the MTU for it’s interface facing Cloud-1.  It’s only option at this point is to fragment the traffic.
5 – Cloud-2 receives two packets which happen to be fragmented but it doesn’t much care (or know) and happily forwards them along.
6 – Device A receives the fragmented packets and reassembles them into a single packet

The math all works out to support this.  We can see a 1514 byte packet and a 82 byte packet being sent.  The 1514 byte packet is the max that Cloud-2 can send so it sends the rest of the data in a second packet.  The second packet consists of an IP header (20 bytes), a GRE header (4 bytes), an inner IP header (20 bytes), and 24 bytes of payload (our overflow from packet 1).  Added all together it gives you 68 and you can add in 14 bytes for the Ethernet header giving you 82 bytes total.

So how do we fix this?  Same thing we did before, just lower the edge MSS by another 24 bytes to accommodate the additional headers that cloud-1 and cloud-2 are using for their GRE tunnel.

So there you have it. That was a nice refresher for me. Hope you enjoyed as well!

Tags: , ,

CCIE on hold

As many of you know, I’ve been studying for my CCIE lab for the last year.  I passed the written lab last March and eagerly started studying for the lab.  I was WAY more excited to study for the lab than I was the written.  Studying went well and I felt like I was doing a good job of dedicating a set amout of time towards studying despite having our first child and moving during that time period.

I took my first lab attempt on June 11th and I felt like a million bucks when I left.  I felt great about the whole thing.  I wrapped up troubleshooting with 30 minutes to spare and I started in on config early.  I read the entire lab and felt comfortable with what needed to be configured so I jumped in right away.  I ran into a couple of snags that slowed me down a little bit in the L2 section but by the time we got to lunch I was well into L3.

I remember getting back from lunch and looking at the clock and thinking that I had so much time left that I wasn’t worried.  Then I made a mistake which forced me to have to go back and redo some of the config.  This was a HUGE time waste.  I probably lost an hour total.  Despite this, I managed to finish the config in the allotted time but ran out of time while doing my double check.  Regardless of all of this, I thought for sure I had passed.  Since I took it on a Friday, I had to wait until Sunday for my results.  Turned out I had passed TS and failed config.  I was shocked.  Literally didn’t know why or what went wrong.  Spent the next week trying to justify what might have caused me to fail.

I then joined the ranks of people checking the CCIE portal page 4 times a day for another open lab seat.  While on a work trip to CA I happened to spot an open seat and quickly worked to change my schedule to accommodate it.  Spent the next 3 weeks deep in study.  I had sort of convinced myself that I had made an IP mistake in the first attempt.  Used the wrong IP’s for a section or something along those lines.

Flew back to San Jose on May 13th and took my second attempt on May 14th.  The TS did not go well.  Im not sure how to explain it, but the lab itself just seemed different.  The general feel for the lab felt different and the tickets seemed to be more vague to me.  I almost felt like I was taking a totally different kind of test.  On top of that, the proctor thought it would be a good idea to spend the first hour loudly making travel plans on her cell phone.  Yes, I realize I could have brought ear plugs, but I certainly didn’t think I’d need them because of the proctor.  I ran out of time in TS knowing that I had failed it.  Went into config and this time I finished it with 2 hours left.  Verified the crap out of everything I could, and redid the lab 2 or 3 times.  I left being confident that I had failed overall but passed config this time.  Got my results a couple hours later and I failed both sections.

Im not sure how to summarize my feelings, but discouraged was certainly at the top of the list at that point.  It would be different if I knew what I was doing wrong.  Thing is, they don’t really tell you.  Im obviously making a mistake in config thats causing me to fail but I can’t for the life of my figure out what it is.  Since Im passed the v4 window, my only option is to study for v5.

So that being said, Im putting the CCIE on hold at least until Winter.  We had our first child back in September and my study schedule didn’t leave me much time for the her and my wife.  So I’ll be taking at least the summer and fall off and at that point I’ll decide if I want to go after the v5 exam.

In the mean time, Im looking forward to getting back into blogging and looking at some of the stuff that’s been on my todo list since I started lab study.

So I’ve been doing a lot of CCIE study labbing lately and it’s become rather apparent that the only way to verify IPv4 unicast connectivity is to use TCL scripts on the routers.  Its really pretty easy.  I generally crank the scripts out in notepad and then paste them into the router.  One might look like this…

foreach address {
172.90.134.1
172.90.134.3
172.90.107.1 } {ping $address}

Then all you need to do is enter the TCL shell by typing ‘tclsh’ at the enable prompt.  Then just paste your script in…

image

Exit drops your back out of the TCL shell and back to the normal enable prompt.  I initially found that I was having issues with the scripts because of my lack of spaces.  You NEED a space after the first line variable (address) and one in-between the two brackets separating the last IP and the ping command (…107.7} {ping…). 

While this appears to work well on the routers, the switches (at least the ones for the v4 lab) don’t appear to have the TCL shell.  Must find another solution for those… Edit – Apparently you use macros on the switch, see example below…

image

Tags:

« Older entries