Hi folks! Long time no talk : ) Life has been incredibly busy for me over the last few months so I’ll apologize in advance for the lack of posts. However – I’m aiming to get back on the horse so please stay tuned!
With that out of the way – I wanted to spend some time in this post talking about the command line tool found on Linux systems called tc
. We’ve talked about tc
before when we discussed creating some network/traffic simulated topologies and it worked awesome for that use case. If you recall from that earlier post tc
is short for Traffic Control and allows users to configure qdiscs
. A qdisc
is short for Queuing Discipline. I like to think of it as manipulating the Linux kernels packet scheduler.
Note: tc
is traditionally part of the iproute2
toolset which Im pretty sure (but not positive) is included in most base Linux distros these days.
When tc
comes up – it’s easy to immediately start thinking about QOS, queuing, and packet(traffic) control. And while some of the actions available to you when using tc
seem obvious, or at least fit within the mindset of queue disciplines (the drop action comes to mind here), you might be surprised to learn that tc
can actually do much more. For instance, it can do things like perform encapsulation and modify packet/frame headers. But – rather than keep rambling, let’s jump into a basic example. Per usual – let’s start with what our lab will look like…
To start with, we’ll begin with a single host (it’s really just a VM), test_host_1. The host is supporting two different network namespaces called NS1
and NS2
. They are connected back to the main or default network namespace through the use of VETH pairs. One end of the VETH pair resides in the network namespace while the other end resides in default or main namespace of the host. In addition, the host has a single NIC ens6
which we wont be using much in this post at all.
Note: I’m assuming any and all test hosts used in this lab are newly hosts (Im using Ubuntu 18). with no underlying network configuration besides that of ens6.
So let’s go ahead and do the base namespace configuration…
ip netns add ns1
ip link add ns1_veth_ns type veth peer name ns1_veth_float
ip link set ns1_veth_ns netns ns1
ip link set dev ns1_veth_float up
ip netns exec ns1 ip addr add 192.168.10.1/24 dev ns1_veth_ns
ip netns exec ns1 ip link set ns1_veth_ns address 14:ec:d4:01:f1:2b
ip netns exec ns1 ip link set ns1_veth_ns up
ip netns exec ns1 ip link set lo up
ip netns add ns2
ip link add ns2_veth_ns type veth peer name ns2_veth_float
ip link set ns2_veth_ns netns ns2
ip link set dev ns2_veth_float up
ip netns exec ns2 ip addr add 192.168.10.2/24 dev ns2_veth_ns
ip netns exec ns2 ip link set ns2_veth_ns address 84:12:5d:2f:d2:4c
ip netns exec ns2 ip link set ns2_veth_ns up
ip netns exec ns2 ip link set lo up
So at this point, each NS is up and alive and should have the correct VETH pair configuration. But if you’ve noticed, they aren’t able to talk to each other. Why is this?..
root@test_host_1:~# ip netns exec ns1 ping 192.168.10.2 -c 3
PING 192.168.10.2 (192.168.10.2) 56(84) bytes of data.
From 192.168.10.1 icmp_seq=1 Destination Host Unreachable
From 192.168.10.1 icmp_seq=2 Destination Host Unreachable
From 192.168.10.1 icmp_seq=3 Destination Host Unreachable
--- 192.168.10.2 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2038ms
pipe 3
root@test_host_1:~#
The answer is pretty simple – we haven’t done anything with the other end of the VETH pairs that live in the default network namespace. They are as I like to say “floating” at this point. Typically, in the case of Docker or any other containerization kind of technology these ends that are currently floating would plug into a bridge. The bridge would allow all the things we need to occur (MAC learning, ARP, etc) to happen so that the two NS could talk. The problem right now is that the ARP request packets are showing up on the floating side of the VETH pair and just getting ditched because there’s no where else for them to go. L2 broadcast frames like this would typically be flooded within a given broadcast domain but the problem is that this broadcast domain begins and ends on each side of the VETH pair. If the floating side were to be plugged into a bridge that bridge would propagate the ARP broadcasts to other VETH pairs and everything would work out just fine.
So how do we fix this? Well – we can actually write some tc
rules that make things appear to be in the same broadcast domain. To do that, we first need to enable a qdisc
or queuing discipline on the interfaces in question. In our case – that would be the floating sides of the VETH pairs. Let’s do that now on both of the interfaces just to get it out of the way…
tc qdisc add dev ns1_veth_float ingress
tc qdisc add dev ns2_veth_float ingress
This syntax is pretty straightforward – we’re adding an ingress
queuing discipline to both of the VETH interfaces floating sides that are still in the default name space. Now comes the interesting part – tc
has a classifier referred to as Flower. From my understanding there was already something similiar named flow so they just called this one flower but it allows you to select traffic quite easily based on well know header fields such as IP, Mac address, VLAN, etc. So let’s take a look at writing some rules that do this. What’s the first thing that our two namespace need to do in order to talk? ARP! So let’s write some rules that let that happen…
tc filter add dev ns1_veth_float protocol arp parent ffff: prio 1 \
flower \
dst_mac ff:ff:ff:ff:ff:ff \
src_mac 14:ec:d4:01:f1:2b \
action mirred egress redirect dev ns2_veth_float
Once again, I think this is fairly easy to read but let’s talk through it for the sake of making sure we’re on the same page. Here’s how the above breaks down….
Line 1 – Add a filter to the floating side of the NS1 VETH pair interface that matches protocol ARP. Assign it to the ingress queue of the interface (ffff
is the static reference to this) and make this filter rule the 1st priority.
Line 2 – For filter matching rules we’ll be using the flower classifier
Line 3 – Match on a broadcast destination MAC
Line 4 – Match on a source Mac address of the NS side of NS1 VETH pair (now you know why we set static MAC addresses when we built those).
Line 5 – If all of the above matches, on egress redirect the traffic to the NS2 VETH pair floating interface.
So let’s put this rule in and see what happens!
root@test_host_1:~# tc filter add dev ns1_veth_float protocol arp parent ffff: prio 1 \
> flower \
> dst_mac ff:ff:ff:ff:ff:ff \
> src_mac 14:ec:d4:01:f1:2b \
> action mirred egress redirect dev ns2_veth_float
root@test_host_1:~#
If the rule goes in with no feedback – you’re off to the races. If you get the dreaded…
RTNETLINK answers: Invalid argument
We have an error talking to the kernel
It means something went wrong. Make sure you added the qdisc to the interface you’re working with as that’s a requirement before writing filter rules. Then double check your syntax and try whittling the rule down until you find the problem statement. In our case though it worked! We can see the rule is applied by using this command…
root@test_host_1:~# tc -s filter show dev ns1_veth_float parent ffff:
filter protocol arp pref 1 flower chain 0
filter protocol arp pref 1 flower chain 0 handle 0x1
dst_mac ff:ff:ff:ff:ff:ff
src_mac 14:ec:d4:01:f1:2b
eth_type arp
not_in_hw
action order 1: mirred (Egress Redirect to device ns2_veth_float) stolen
index 1 ref 1 bind 1 installed 6 sec used 6 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
root@test_host_1:~#
Awesome – looks good. So now try a ping from NS1 to NS2 and see what we get…
root@test_host_1:~# ip netns exec ns1 ping 192.168.10.2
PING 192.168.10.2 (192.168.10.2) 56(84) bytes of data.
No response, but if we look at the rule we should see hits…
root@test_host_1:~# tc -s filter show dev ns1_veth_float parent ffff:
filter protocol arp pref 1 flower chain 0
filter protocol arp pref 1 flower chain 0 handle 0x1
dst_mac ff:ff:ff:ff:ff:ff
src_mac 14:ec:d4:01:f1:2b
eth_type arp
not_in_hw
action order 1: mirred (Egress Redirect to device ns2_veth_float) stolen
index 1 ref 1 bind 1 installed 205 sec used 191 sec
Action statistics:
Sent 84 bytes 3 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
root@test_host_1:~#
And if you were doing a tcpdump on ns2_veth_float
you would have seen not only the traffic arrive, but also a reply from NS2. NS2 doesn’t need any rules to generate a reply back over it’s VETH pair…
root@test_host_1:~# tcpdump -nne -i ns2_veth_float
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ns2_veth_float, link-type EN10MB (Ethernet), capture size 262144 bytes
01:45:58.442778 14:ec:d4:01:f1:2b > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.168.10.2 tell 192.168.10.1, length 28
01:45:58.442798 84:12:5d:2f:d2:4c > 14:ec:d4:01:f1:2b, ethertype ARP (0x0806), length 42: Reply 192.168.10.2 is-at 84:12:5d:2f:d2:4c, length 28
01:45:59.473754 14:ec:d4:01:f1:2b > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.168.10.2 tell 192.168.10.1, length 28
01:45:59.473763 84:12:5d:2f:d2:4c > 14:ec:d4:01:f1:2b, ethertype ARP (0x0806), length 42: Reply 192.168.10.2 is-at 84:12:5d:2f:d2:4c, length 28
01:46:00.497742 14:ec:d4:01:f1:2b > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.168.10.2 tell 192.168.10.1, length 28
01:46:00.497750 84:12:5d:2f:d2:4c > 14:ec:d4:01:f1:2b, ethertype ARP (0x0806), length 42: Reply 192.168.10.2 is-at 84:12:5d:2f:d2:4c, length 28
Alright – we’re doing great! But we need some more rules for this to work end to end at layer 3. Let’s add the same rule for the ARP broadcast on the other interface just in case the initial request comes from NS2 (remember that for every interface you have to define the qdisc
the first time!)….
tc qdisc add dev ns2_veth_float ingress
tc filter add dev ns2_veth_float protocol arp parent ffff: prio 1 \
flower \
dst_mac ff:ff:ff:ff:ff:ff \
src_mac 84:12:5d:2f:d2:4c \
action mirred egress redirect dev ns1_veth_float
Now – this hasn’t gotten us any further down the road to L3 connectivity, so what rule do we need to tackle next? Well – when NS1 pings NS2 we see that NS2 is generating the ARP reply, but if look at the interface ns1_veth_float
we see that the reply is never getting there…
root@test_host_1:~# tcpdump -nne -i ns1_veth_float
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ns1_veth_float, link-type EN10MB (Ethernet), capture size 262144 bytes
02:13:49.329808 14:ec:d4:01:f1:2b > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.168.10.2 tell 192.168.10.1, length 28
02:13:50.353738 14:ec:d4:01:f1:2b > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.168.10.2 tell 192.168.10.1, length 28
02:13:51.377736 14:ec:d4:01:f1:2b > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.168.10.2 tell 192.168.10.1, length 28
02:13:52.401797 14:ec:d4:01:f1:2b > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.168.10.2 tell 192.168.10.1, length 28
That’s because we have no matching tc
rule! And also because we’re doing our own L2 semantics here to make this work. So we need to add rules to make the ARP reply work as well. Those might look like this…
tc filter add dev ns2_veth_float protocol arp parent ffff: prio 2 \
flower \
dst_mac 14:ec:d4:01:f1:2b \
src_mac 84:12:5d:2f:d2:4c \
action mirred egress redirect dev ns1_veth_float
tc filter add dev ns1_veth_float protocol arp parent ffff: prio 2 \
flower \
dst_mac 84:12:5d:2f:d2:4c \
src_mac 14:ec:d4:01:f1:2b \
action mirred egress redirect dev ns2_veth_float
The above rules look a lot like the first two rules we entered in, but they’re not looking for broadcasts. Rather – they’re looking to match an ARP reply which would be direct MAC to MAC communication. If we put those two rules in – we ought to see a full ARP conversation take place in our captures…
root@test_host_1:~# tcpdump -nne -i ns1_veth_float
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ns1_veth_float, link-type EN10MB (Ethernet), capture size 262144 bytes
02:16:56.062897 14:ec:d4:01:f1:2b > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.168.10.2 tell 192.168.10.1, length 28
02:16:56.062913 84:12:5d:2f:d2:4c > 14:ec:d4:01:f1:2b, ethertype ARP (0x0806), length 42: Reply 192.168.10.2 is-at 84:12:5d:2f:d2:4c, length 28
02:16:56.062916 14:ec:d4:01:f1:2b > 84:12:5d:2f:d2:4c, ethertype IPv4 (0x0800), length 98: 192.168.10.1 > 192.168.10.2: ICMP echo request, id 2872, seq 1, length 64
Awesome! So on the first higlighted line we see the ARP request, followed by the ARP reply, and then the actual ICMP echo request since NS1 now has a valid ARP entry for NS2. But alas, ping is still not working. So we need to add one more rule to allow that…
tc filter add dev ns2_veth_float protocol ip parent ffff: prio 3 \
flower \
dst_mac 14:ec:d4:01:f1:2b \
src_mac 84:12:5d:2f:d2:4c \
action mirred egress redirect dev ns1_veth_float
tc filter add dev ns1_veth_float protocol ip parent ffff: prio 3 \
flower \
dst_mac 84:12:5d:2f:d2:4c \
src_mac 14:ec:d4:01:f1:2b \
action mirred egress redirect dev ns2_veth_float
The above rules should look awfully familiar as the only thing that’s changed is the protocol type from arp
to ip
. But that’s all we need for this to work. So let’s see a full capture…
root@test_host_1:~# tcpdump -nne -i ns1_veth_float
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ns1_veth_float, link-type EN10MB (Ethernet), capture size 262144 bytes
02:20:09.759492 14:ec:d4:01:f1:2b > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.168.10.2 tell 192.168.10.1, length 28
02:20:09.759594 84:12:5d:2f:d2:4c > 14:ec:d4:01:f1:2b, ethertype ARP (0x0806), length 42: Reply 192.168.10.2 is-at 84:12:5d:2f:d2:4c, length 28
02:20:09.759599 14:ec:d4:01:f1:2b > 84:12:5d:2f:d2:4c, ethertype IPv4 (0x0800), length 98: 192.168.10.1 > 192.168.10.2: ICMP echo request, id 2894, seq 1, length 64
02:20:09.759631 84:12:5d:2f:d2:4c > 14:ec:d4:01:f1:2b, ethertype IPv4 (0x0800), length 98: 192.168.10.2 > 192.168.10.1: ICMP echo reply, id 2894, seq 1, length 64
02:20:10.769757 14:ec:d4:01:f1:2b > 84:12:5d:2f:d2:4c, ethertype IPv4 (0x0800), length 98: 192.168.10.1 > 192.168.10.2: ICMP echo request, id 2894, seq 2, length 64
02:20:10.769773 84:12:5d:2f:d2:4c > 14:ec:d4:01:f1:2b, ethertype IPv4 (0x0800), length 98: 192.168.10.2 > 192.168.10.1: ICMP echo reply, id 2894, seq 2, length 64
And there you have it! A full L3 conversation done entirely with tc
rules! While this isn’t a super useful example, I hope it delivers the point that tc
is capable of a lot! In the next post, we’ll look at some more useful tc
actions and extend our lab a little bit. Stay tuned!
Pingback: Technology Short Take 129 - s0x