The basics – MTU, MSS, GRE, and PMTU

One of the truly fascinating things about networking is how much of it ‘just works’.  There are so many low level pieces of a network stack that you don’t really have to know (although you should) to be an expert at something like OSPF, BGP, or any other higher level networking protocol.  One of the ones that often gets overlooked is MTU (Maximum Transmission Unit), MSS (Maximum Segment Size) and all of the funs tuff that comes along with it.  So let’s start with the basics…

image
Here’s your average looking IP packet encapsulated in an Ethernet Header.  For the sake of conversation, I’ll assume going forward that we are referring to TCP only but I did put the UDP header length in there just for reference.  So a standard IP packet is 1500 bytes long.  There’s 20 bytes for the IP header, 20 bytes for the TCP header, leaving 1460 bytes for the data payload.  This does not include the 18 bytes of Ethernet headers and FCS that surround the IP packet.

When we look at this frame layout, we can further categorize components of the frame by MTU and MSS…

image
The MTU is defined as the maximum length of data that can be transmitted by a protocol in one instance.  What makes this slightly confusing is that the MTU does NOT include the Ethernet headers required to transmit the packet on Ethernet, it only includes the IP packet information.  That is, with regards to MTU, we always assume the Ethernet interface is already taking into account the 18 bytes of Ethernet headers (sometimes we don’t even see the FCS so its 1514 not 1518). So this is pretty straight forward, nothing too exciting here.  The majority of networks assume an MTU of 1500 bytes and everything works as expected.

MSS is slightly different because it is determined by each end of the TCP connection.  During session setup a device can specify the MSS they want to use in their SYN packet.  However, the devices technically do not need to agree on an MSS and each device can use a different MSS value.  The only requirement is that the sending device must limit the size of the data it sends to the MSS it receives from it’s peer.  MSS is generally 40 bytes less than MTU to accommodate for a 20 byte IP header and a 20 byte TCP header.

Note: If you’re using Cisco routers as your end points like I am in my lab, you should be aware that in some code versions TCP sessions generated or terminated on a router use different MSS.  For local connections (same subnet) the default MSS is 1460.  For routed connections, the default MSS was 536.  To make things more ‘normal’ I set the router MSS to 1500 with ‘ip tcp mss 1500’ in global config mode. 

So this all makes good sense right?  The devices should be smart enough to know based on their MTU what the largest TCP segment they can send should be.  Let’s look at a simple lab to prove this works…

image
So let’s generate a telnet session from Device A to Device B and see what the packets that hit the wire look like…

image
If we look at the TCP header in one of these packets, we should see the MSS is set to 1460…

image
Now if I set the MTU of the interface to 1000, we should see a MSS of 960…

image
Right, so things are working as we expect.  So let’s take a moment to talk about the difference between MTU and ‘IP MTU’ on a Cisco router.  You might have noticed that under interface configuration you can set either.  The difference is simple but sometimes not as easy to understand.  MTU sets the physical interface MTU, that is the max packet size supported by an interface.  IP MTU sets the max size of an IP packet.  So there, clear enough for you?  The problem here is that we’re all used to working with just IP packets, however, that’s not always the case.  So the real difference here is between setting the max MTU for any protocol on the interface, and setting the max IP MTU.  I like to think of MTU as being the hardware MTU and the IP MTU as being the IP packet MTU.  By default these values are the same so IP MTU never shows up in the config.  Also – this should be obvious, but the IP MTU has to be equal or less than the MTU.

So let’s make sure we’re on the same page.  Let’s lower the MTU on the interface to 1200 and try setting the IP MTU to a higher value…

image
It complains, this is expected and makes total sense.  Now let’s set the IP MTU to 1100..

image
We can see that the running config now shows the MTU and IP MTU since they are difference values.  A ‘show int fa0/0’ will show the interface (hardware) MTU of 1200 and a ‘show ip interface fa0/0’ will show the IP MTU of 1100 in this case.

So now let’s talk about why any of this matters.  If you consider a network where each link has a IP MTU of 1500, you only ever need to worry about this if you’re doing tunneling (or any other encap).

Note: MPLS is a whole different animal, I’m not covering that in this post but I’d argue that it’s a tunnel all the same. 

The most common type of tunnel we see is a GRE tunnel.  Adding a GRE header on a packet makes the frame format look like this…

image
So now we have an additional 24 bytes of headers.  Another 20 for the outer IP packet (tunnel source and destination) and 4 for the GRE header itself.  So what does this mean?  Our MTU can’t increase since we’re using the max IP MTU of 1500 to match our hardware MTU.  What happens when we try and send traffic now?  Let’s modify the lab slightly so we can use a GRE tunnel more effectively…

image So now we’re going to have a user traversing a couple of network segments to download a file via HTTP on my homes super computer (luckily for me Visio had an exact stencil of what my super computer looks like).  The catch here is that we’re going to have a GRE tunnel that goes from device A to device B which the traffic will ride within.  Pretty straight forward however, how will this all work considering what we just talked about above?  Adding in a GRE header as well as the outer (GRE) IP header is going to add an additional 24 bytes to the frame that the client and server won’t know about.  Should this work?  Let’s check it out and see…

image
You might have to blow that up to see it better, but to make a long story short, it does work.  But how or why?  Let’s walk through a couple of the packets to see what’s happening…

image
The above packet shows the initial TCP SYN.  We can see it comes from the client and is destined to the server.  We also see that it has some TCP options set which include the TCP MSS which the client is setting to 1460.  Recall that TCP MSS is generally dictated by taking the MTU and subtracting 20 bytes for the IP header and 20 bytes for the TCP header.  So this looks right so far.  Let’s look at the SYN/ACK from the server…

image
So same deal here – The server thinks that it’s MSS should be 1460.  Once the TCP session is established, the client can issue it’s HTTP GET which we see in packet 336.  Then we see a bunch of ‘TCP Previous segment not captured’ frames followed by one really interesting one (frame 342)…

image
In packet 342 – the router kicks back a ICMP unreachable message telling the client that ‘Fragmentation needed’.  If we look at that packet in more detail, we see some more interesting information…

image

Ah ha!  The router is telling the server that the max MTU it can use 1476 (1500 minus 24 for GRE/IP).  The server sees this data and caches it.  We can see this on the server (Linux in this case) by issuing the following command…

image
So now the server knows to use an MTU of 1476 when talking to this client.  With the MTU now being 1476 our math now sorts out like this…

image
So using a new MTU of 1476 forces the TCP payload (MSS) down to a lower number to fit all of the headers in making the total frame size 1500 (excluding Ethernet framing)…

image

 

 

 

So in our example so far, the end devices have been the device sorting out the issue caused my the GRE tunnel.  However, it’s not always best practice to let the endpoints do this because PMTU discovery relies on ICMP.  For instance, I can very easily break this connection by disabling ICMP unreachable messages on the interface facing the server.

Note: If you’re testing this out yourself make sure you clear out the servers route cache.  As mentioned before, it will cache the lower MTU for a period of time so it doesn’t need to do PMTU constantly.  Command – ‘ip route flush cache’.

So lets disable ICMP unreachables on the interface…

image
And now try to download the file again…

image
It fails.  Without the router telling the server there’s an issue, we can’t pass traffic.  There are a couple other ways to take care of this.  One would be to let the routers fragment the traffic by clearing the DF bit on the traffic.  I think that’s a horrible idea so I won’t even take the time to discuss it.  The second option it to use what’s referred to as ‘TCP clamping’.

TCP clamping involves having the router rewrite the TCP MSS option in the SYN SYN/ACK to another value.  So in our case, we can tell the router to adjust the MSS down to 1436 to accommodate the tunnel.   Let’s  configure it on our router interface that we also disabled ‘ip unreachables’ on…

image
Now let’s try the traffic again and see what happens…

image It’s working!  Now let’s look at what hits the wire…

So the initial packet coming off the client has an MSS of 1460 set…

image
However, if we look at the packet again after it’s traversed the router ,and is in the GRE tunnel, we can see the MSS is now what we set it to…

image
I flagged the IP ID as well so you can see we’re looking at the same packet.  We’re just analyzing the same packet at different points in the network.

As you can see, the ‘middle’ of the network has very little to do with MTU and MSS.  Like most network things, you need to ensure that you’ve accounted for any additional headers at the edge so devices in the middle don’t need to respond or do crazy things.

So to solidify this point, let’s look at one last example to make we’re all on the same page.  Take this lab topology as an example…

image
Here we’re running the traffic over GRE tunnels as the traffic traverses the segment between cloud-1 and cloud-2.  If we take a look at a packet capture taken in between the two cloud routers we should see a lot of headers…

image
Yep – So we’re doing GRE in GRE there.  So now let’s try our HTTP download again and see what happens…

image
Yeah – So we’re fragmenting.  The interesting thing here is that most hosts set the DF (do not fragment) bit in their IP packet.  However, this does NOT carry over in additional IP headers generated by the router, we can see that here…

image
So here’s the breakdown of what’s happening now…

1 – Client initiates TCP session with Super Computer, sets MSS to 1460 by default.  Device A changes MSS to 1436 before the traffic enters the tunnel.
2 – Server replies in his SYN/ACK with an MSS of 1460.  Device B changes the MSS to 1436 before the traffic enters the tunnel.
3 – The TCP session is setup, each side thinks the other side has an MSS of 1436.
4 – When data starts flowing the packet size is exactly 1500 bytes when it reaches the Cloud-2 router.  Cloud-2 knows that it has to put an additional 24 bytes of headers on the packet which puts it over the MTU for it’s interface facing Cloud-1.  It’s only option at this point is to fragment the traffic.
5 – Cloud-2 receives two packets which happen to be fragmented but it doesn’t much care (or know) and happily forwards them along.
6 – Device A receives the fragmented packets and reassembles them into a single packet

The math all works out to support this.  We can see a 1514 byte packet and a 82 byte packet being sent.  The 1514 byte packet is the max that Cloud-2 can send so it sends the rest of the data in a second packet.  The second packet consists of an IP header (20 bytes), a GRE header (4 bytes), an inner IP header (20 bytes), and 24 bytes of payload (our overflow from packet 1).  Added all together it gives you 68 and you can add in 14 bytes for the Ethernet header giving you 82 bytes total.

So how do we fix this?  Same thing we did before, just lower the edge MSS by another 24 bytes to accommodate the additional headers that cloud-1 and cloud-2 are using for their GRE tunnel.

So there you have it. That was a nice refresher for me. Hope you enjoyed as well!

27 thoughts on “The basics – MTU, MSS, GRE, and PMTU

  1. David

    Thank you for the detailed explanation – I look forward to many more of the same!

    A tiny point – your router diagram uses /32 subnets – I think you meant to use /30.

    Reply
  2. Jai

    One of the most detailed articles on this important but easily overlooked topic. THANKS A TON!!!

    Spotted one error though in here – “To make things more ‘normal’ I set the router MSS to 1500 with ‘ip tcp mss 1500’ in global config mode. “, I think you meant ip tcp mtu 1500?

    Thanks!!

    Reply
    1. Jon Langemak Post author

      Thanks for reading! In this case, I was talking about specifically the MSS command since I was talking about traffic generated from the router. Does that make sense?

      Reply
  3. Al

    Great Explanation with Superb diagrams!!!

    A question regarding the following: “6 – Device A receives the fragmented packets and reassembles them into a single packet”

    Shouldn’t Device A receive a single packet since the Cloud GRE headers are stripped off once the pkt exits the Cloud environment?

    Also, you mean Cloud-1 in “point 5” ?!?

    Reply
  4. G das

    really you explain very practically . please share your knoweldge globally. it helps to lot of peoples

    Reply
  5. vishal Mandaliya

    Thank you very much for such easy and perfect explanation of MTU and MSS.

    I have one doubt that what is server is on WAN and we are tried to access from LAN and it is not accessible but if we change the MSS/MTU value it is accessible so what could be the possible reason for changing it.

    Reply
  6. Nikhil

    A great article indeed.

    “Instead of adjusting the MTU and rely on ICMP error messages, we can simply adjust the MSS value”- Out of the box solution 🙂

    Reply
  7. axelerator

    Can you expand on why you would NEED to set an ip mtu statement on an interface?

    Wouldn’t the ip tcp adjust mss value be enough to handle all tcp flows by specifying the max payload, thus coming to the specified IP MTU automatically?

    Ive seen articles detailing typical configurations where the ip mtu is filled in with a 40 bytes extra for the header (along with tcp adjust), but it seems kinda “redundant” to me?

    Reply
  8. Eugene

    Great and helpful article!!

    Quick question for a MSS and a MTU.
    When i met a fragmentation issue, i normally changed a MTU instead a MSS.
    But when i met a fragmentation issue (and not allow PMTU) on the IPSec tunnel(site to site), I needed to change the MSS value using the following command for fixing the issue.

    == command for iptable ==
    iptables -t filter -I FORWARD 1 -p tcp –tcp-flags SYN,RST SYN -j TCPMSS –set-mss 1440

    I thought that if i change a MTU from 1500(default) to 1400, a MSS also will reduced.
    But the changed MTU was not fix this issue and the issue was fixed after change MSS value.

    What is different changing the MSS instead of the MTU?

    Reply
    1. Jon Langemak Post author

      So you’re saying that modifying the MTU of the interface didnt help? But lowering the MSS with netfilter rules helped? That’s strange. MTU is sort of like the global setting. Many times if we dont have the expertise or the ability to change the MSS we end up doing a quick fix by lowering the MTU.

      Howe were you changing the MTU? Was it on the server itself?

      Also – as a quick FYI it looks like your changing the MSS in netfilter. We often find its easier to change it as part of the route configuration. That saves quite a bit of time and complication.

      Reply
  9. clininja

    H, Great article

    But Can you expand on why you would NEED to set an ip mtu statement on an interface?

    Wouldn’t the ip tcp adjust mss value be enough to handle all tcp flows by specifying the max payload, thus coming to the specified IP MTU automatically?

    Don’t get why the ip mtu statement needed on tunnel interface
    Thanks

    Reply
  10. satya

    A question regarding the following: “6 – Device A receives the fragmented packets and reassembles them into a single packet”

    Shouldn’t Device A receive a single packet since the Cloud GRE headers are stripped off once the pkt exits the Cloud environment(Cloud-1 exist interface towards Device-A)?

    Also, you mean Cloud-1 in “point 5” ?!?

    Reply
  11. metriXc

    Thank you for your exhaustive explanation. Really informative and interesting.
    I have one question:

    In the breakdown section you mentioned in point 1 that Device A changes the MSS to 1436. Does that mean you also used the command that was described for Router B on Router A (MSS to 1436)?
    In the breakdown step 2 you said the device replies with MSS1460. I thought the Webserver replies with a value equal or less than the advertised MSS from point 1 (1436).

    It would be great if you could clarify the issue.
    Thank you!

    Reply
  12. metriXc

    Great article with lots of interesting information. I really like the depth of this article.
    I stumbled accross some points.
    Here are the questions:

    In the breakdown section you mention in Point 1 that Device A changes the MSS to 1436 before the traffic enters the tunnel. In the description above you say: “Let’s configure it on our router interface that we also disabled ‘ip unreachables’ on…” Above that was a screenshot of Router B. If I am not mistaken, the MSS adjust 1436 should be configured on both Router A and B, right?

    In the breakdown section at Point 2 you say that the Server replies in his SYN/ACK with an MSS of 1460. I read that if the router has the MSS 1436 configured on it’s interface it looks for SYN packets which are either ingress or egress on that router interface and automatically changes the MSS to its configured value. Therefore from my understanding, it should be that the router in that case Router B changes the MSS of the incoming SYN Packet to aMSS of 1436 or lower and that the SYN/ACK package is sent out with an MSS of <=1436 and not 1460.

    I am looking forward to your reply!
    Thank you.

    Reply
  13. Raj

    Really nice article and well laid out, I went through it few times..

    I have just a generic question , if you could drop me a line. I have a small lab with two servers, runs VM’s on both connected to a standard Netgear router (only UI access).

    I really wanted to try some of these myself, do you mind suggesting any good cisco switches/routers so I can work on the consoles more than on UI itself – kept searching in google and not sure based on your experience what would be cost effective yet good ones to buy in for home lab setups.

    Thanks…

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *