In our last post we talked about how Kubernetes handles pod networking. Pods are an important networking construct in Kubernetes but by themselves they have certain limitations. Consider for instance how pods are allocated. The cluster takes care of running the pods on nodes – but how do we know which nodes it chose? Put another way – if I want to consume a service in a pod, how do I know how to get to it? We saw at the very end of the last post that the pods themselves could be reached directly by their allocated pod IP address (an anti-pattern for sure but it still works) but what happens when you have 3 or 4 replicas? Services aim to solve these problems for us by providing a means to talk to one or more pods grouped by labels. Let’s dive right in…
To start with, let’s look at our lab where we left at the end of our last post…
If you’ve been following along with me there are some pods currently running. Let’s clear the slate and delete the two existing test deployments we had out there…
user@ubuntu-1:~$ kubectl delete deployment pod-test-1 deployment "pod-test-1" deleted user@ubuntu-1:~$ kubectl delete deployment pod-test-3 deployment "pod-test-3" deleted user@ubuntu-1:~$
So now that we’ve cleaned out the existing deployments let’s define a new one…
apiVersion: extensions/v1beta1 kind: Deployment metadata: name: deploy-test-1 spec: replicas: 1 template: metadata: labels: app: web-front-end version: v1 spec: containers: - name: tiny-web-server-1 image: jonlangemak/web1_8080 ports: - containerPort: 8080 name: web-port protocol: TCP
This is pretty straight forward with the exception of two things. I’m now using much smaller container images that are based off of an excellent post I read on making a small GO web server using the tiny pause container as the base image. My previous test images were huge so this is the first step I’m taking toward rightsizing them. Secondly – you’ll notice that in our spec we define two labels. One to define the application (in this case ‘web-front-end’) and another to define the version (in this case ‘v1’). So let’s create a YAML file on our master called ‘deploy-test-1.yaml’ and load this deployment…
user@ubuntu-1:~$ kubectl create -f deploy-test-1.yaml deployment "deploy-test-1" created user@ubuntu-1:~$ kubectl get deployments NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE deploy-test-1 1 1 1 1 7s user@ubuntu-1:~$ kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE deploy-test-1-1702481282-8dr4l 1/1 Running 0 12s 10.100.3.7 ubuntu-5 user@ubuntu-1:~$ user@ubuntu-1:~$ curl http://10.100.3.7:8080 This is Web Server 1 running on 8080!user@ubuntu-1:~$ user@ubuntu-1:~$
Above we do a couple of things. We first load the definition with kubectl. We then verify that the deployment is defined and that the pods have loaded. In this case we can see that the pod has been deployed on the host ubuntu-5 and the pod has an IP address of 10.100.3.7. At this point – the only way to access the pod is directly by it’s pod IP address. If you noted in the above deployment definition we said that the containers port was 8080. By doing a curl to the pod IP on port 8080 we can see that we can reach the service.
This by itself is not very interesting and only really describes normal pod networking behavior. If another pod in this cluster wanted to reach the service in this pod you’d have to provide it the pod IP address. That’s not very dynamic and considering that pods may die and be restarted its rather prone to failure. To solve this Kubernetes uses the service construct. Let’s look at a service definition…
kind: Service apiVersion: v1 metadata: name: svc-test-1 spec: selector: app: web-front-end ports: - protocol: TCP port: 80 targetPort: web-port
The main thing a service defines is a selector. That is – what the service should be used for. In this case, that selector is ‘app: web-front-end’. If you’ll recall – our deployment listed this label as part of it’s specification. Next – we need to define the ports and protocols the service should use. In this case we’re using TCP and the port definition specifies the port the service will listen on, in this case 80. At this point I think it’s easier to think of the service as a load balancer. The ‘port’ definition defines the port that the front end virtual IP address will listen on. The ‘targetPort’ specific what the back-end hosts are listening on or what the traffic should be load balanced to. In this case, the back-end is any pod that matches our selector. Interestingly enough – instead of specify a numeric port here you can specify a port name. Recall from our deployment specification that we gave the port a name as part of port definition, in this case ‘web-port’. Let’s use that with our service definition rather than the numerical definition of 8080.
Let’s define this file as ‘svc-test-1.yaml’ on our Kubernetes master and load it into the cluster…
user@ubuntu-1:~$ kubectl create -f svc-test-1.yaml service "svc-test-1" created user@ubuntu-1:~$ kubectl get services NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes 10.11.12.1 <none> 443/TCP 12d svc-test-1 10.11.12.125 <none> 80/TCP 7s user@ubuntu-1:~$
Once loaded we check to make sure that the cluster sees the service. Notice that the service has been assigned an IP address out of the ‘service_network_cidr’ we defined when we built this cluster using Ansible. Going back to our load balancer analogy – this is our VIP IP address. So now let’s head over to one of the worker nodes and try to access the service…
user@ubuntu-2:~$ curl http://10.11.12.125 This is Web Server 1 running on 8080!user@ubuntu-2:~$ user@ubuntu-2:~$
Excellent! So the host can access the service IP directly. But what about other pods? To test this let’s fire up a quick pod with a single container in it that we can use as a testing point…
user@ubuntu-1:~$ kubectl run net-test --image=jonlangemak/net_tools deployment "net-test" created user@ubuntu-1:~$ user@ubuntu-1:~$ kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE deploy-test-1-1702481282-8dr4l 1/1 Running 0 39m 10.100.3.7 ubuntu-5 net-test-645963977-081dx 1/1 Running 0 39s 10.100.2.9 ubuntu-4 user@ubuntu-1:~$ user@ubuntu-1:~$ kubectl exec -it net-test-645963977-081dx curl http://10.11.12.125 This is Web Server 1 running on 8080!user@ubuntu-1:~$
In the above output we used the kubectl ‘run’ sub-command to start a pod with a single container using the image ‘jonlangemak/net_tools’. This image is quite large since it is using Ubuntu as it’s base image but its serves as a nice testing endpoint. Once the pod is running we can use the kubectl ‘exec’ sub-command to run commands directly from within the container much like you would locally by using ‘docker exec’. In this case, we curl to the IP address assigned to the server and get the response we’re looking for. Great!
So while this is a win – we’re sort of back in the same boat as before. Any client looking to access the service running in the pod now needs to know the service’s IP address. That’s not much different than needing to know the pods IP address is it? The fix for this is Kube-DNS…
user@ubuntu-1:~$ kubectl exec -it net-test-645963977-081dx curl http://svc-test-1 This is Web Server 1 running on 8080!user@ubuntu-1:~$
As you can see above – services can be resolved by name so long as you are running the Kube-DNS cluster add on. When you register a service the master will take care of inserting a service record for it in Kube-DNS. The containers can then resolve the service directly by name so long as the kubelet process has the correct DNS information (the ‘cluster-domain’ parameter as part of it’s service definition). If it’s configured correctly it will configure the containers resolv.conf file to include the appropriate DNS server (which also happens to be a service itself) and search domains…
user@ubuntu-1:~$ kubectl exec -it net-test-645963977-081dx more /etc/resolv.conf search default.svc.k8s.cluster.local svc.k8s.cluster.local k8s.cluster.local interubernet.local nameserver 10.11.12.254 options ndots:5 user@ubuntu-1:~$ kubectl exec -it net-test-645963977-081dx nslookup svc-test-1 Server: 10.11.12.254 Address: 10.11.12.254#53 Name: svc-test-1.default.svc.k8s.cluster.local Address: 10.11.12.125 user@ubuntu-1:~$
So now we know what services can do, but we don’t know how they do it. Let’s now dig into the mechanics of how this all works. To do that, let’s start by doing some packet captures. Our topology currently looks like this…
As we’ve seen already the net-test pod can access the deploy-test-1 pod both via it’s pod IP address as well as through the service. Let’s start by doing a packet capture as close to the source container (net-test) as possible. In that case, that would be on the VETH interface that connects the container to the cbr0 bridge on the host ubuntu-4. To do that we need to find the VETH interface name that’s associated with the pause container which the net-test container is connected to.
Note: If you arent sure what a pause container is take a look at my last post.
In my case, it’s easy to tell since there’s only one pod running on this host. If there are more, you can trace the name down by matching up interfaces as follows…
user@ubuntu-4:~$ sudo docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES b4e6055d859b jonlangemak/net_tools "/usr/sbin/apache2 -D" 2 hours ago Up 2 hours k8s_net-test.37214b1_net-test-645963977-081dx_default_ce1232a2-1e0a-11e7-ac2c-000c293e4951_82cb2ea0 2e662ca2a8a1 gcr.io/google_containers/pause-amd64:3.0 "/pause" 2 hours ago Up 2 hours k8s_POD.d8dbe16c_net-test-645963977-081dx_default_ce1232a2-1e0a-11e7-ac2c-000c293e4951_b52ac112 user@ubuntu-4:~$ user@ubuntu-4:~$ sudo docker exec -it b4e6055d859b ip link show 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 3: eth0@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default link/ether 0a:58:0a:64:02:09 brd ff:ff:ff:ff:ff:ff link-netnsid 0 user@ubuntu-4:~$ ip link show 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: ens32: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 link/ether 00:0c:29:83:83:dd brd ff:ff:ff:ff:ff:ff 3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default link/ether 02:42:0f:14:ac:bd brd ff:ff:ff:ff:ff:ff 4: cbr0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc htb state UP mode DEFAULT group default qlen 1000 link/ether 0a:58:0a:64:02:01 brd ff:ff:ff:ff:ff:ff 11: veth75b33c5c@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cbr0 state UP mode DEFAULT group default link/ether 4a:be:9c:7c:98:3b brd ff:ff:ff:ff:ff:ff link-netnsid 0 user@ubuntu-4:~$
First we get the container ID so that we can ‘exec’ into the container and look at it’s interfaces. We see that it’s eth0 interface (really that of the pause containers but same network namespace) is matched up with interface 11. Then on the host we see that the VETH interface with name veth75b33c5c is interface 11. So that’s the interface we want to capture on…
user@ubuntu-4:~$ sudo tcpdump -i veth75b33c5c -nn tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on veth75b33c5c, link-type EN10MB (Ethernet), capture size 262144 bytes 15:09:22.577895 IP 10.100.2.9.39502 > 10.11.12.254.53: 55349+ A? svc-test-1.default.svc.k8s.cluster.local. (58) 15:09:22.577943 IP 10.100.2.9.39502 > 10.11.12.254.53: 20607+ AAAA? svc-test-1.default.svc.k8s.cluster.local. (58) 15:09:22.579454 IP 10.11.12.254.53 > 10.100.2.9.39502: 20607* 0/1/0 (112) 15:09:22.579491 IP 10.11.12.254.53 > 10.100.2.9.39502: 55349* 1/0/0 A 10.11.12.125 (74) 15:09:22.581645 IP 10.100.2.9.39942 > 10.11.12.125.80: Flags [S], seq 85935889, win 29200, options [mss 1460,sackOK,TS val 231232293 ecr 0,nop,wscale 7], length 0 15:09:22.582078 IP 10.11.12.125.80 > 10.100.2.9.39942: Flags [S.], seq 1420990349, ack 85935890, win 28960, options [mss 1460,sackOK,TS val 231214788 ecr 231232293,nop,wscale 7], length 0 15:09:22.582106 IP 10.100.2.9.39942 > 10.11.12.125.80: Flags [.], ack 1, win 229, options [nop,nop,TS val 231232293 ecr 231214788], length 0 15:09:22.582172 IP 10.100.2.9.39942 > 10.11.12.125.80: Flags [P.], seq 1:75, ack 1, win 229, options [nop,nop,TS val 231232293 ecr 231214788], length 74: HTTP: GET / HTTP/1.1 15:09:22.582541 IP 10.11.12.125.80 > 10.100.2.9.39942: Flags [.], ack 75, win 227, options [nop,nop,TS val 231214788 ecr 231232293], length 0 15:09:22.582854 IP 10.11.12.125.80 > 10.100.2.9.39942: Flags [P.], seq 1:155, ack 75, win 227, options [nop,nop,TS val 231214788 ecr 231232293], length 154: HTTP: HTTP/1.1 200 OK 15:09:22.582879 IP 10.100.2.9.39942 > 10.11.12.125.80: Flags [.], ack 155, win 237, options [nop,nop,TS val 231232293 ecr 231214788], length 0 15:09:22.582960 IP 10.100.2.9.39942 > 10.11.12.125.80: Flags [F.], seq 75, ack 155, win 237, options [nop,nop,TS val 231232293 ecr 231214788], length 0 15:09:22.583343 IP 10.11.12.125.80 > 10.100.2.9.39942: Flags [F.], seq 155, ack 76, win 227, options [nop,nop,TS val 231214788 ecr 231232293], length 0 15:09:22.583362 IP 10.100.2.9.39942 > 10.11.12.125.80: Flags [.], ack 156, win 237, options [nop,nop,TS val 231232294 ecr 231214788], length 0 ^C 14 packets captured 14 packets received by filter 0 packets dropped by kernel user@ubuntu-4:~$
The capture above was taken while SSH’d directly into the ubuntu-4 host. To generate the traffic I used the kubectl ‘exec’ sub-command on ubuntu-1 to execute a curl command on the net-test container as shown below…
user@ubuntu-1:~$ kubectl exec -it net-test-645963977-081dx curl http://svc-test-1 This is Web Server 1 running on 8080!user@ubuntu-1:~$
The capture above is interesting as it shows the container communicating with both the DNS service (10.11.12.254) and the service we created as svc-test-1 (10.11.12.125). In both cases, the container believes it is communicating directly with the service. That is the service IP is used as the destination in outgoing packets and seen as the the source in the reply packets. So now that we know what the container sees lets move up a hop in the networking stack and see what traffic is traversing the hosts physical network interface…
user@ubuntu-4:~$ sudo tcpdump -i ens32 -nn host 10.100.2.9 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on ens32, link-type EN10MB (Ethernet), capture size 262144 bytes 16:06:29.824455 IP 10.100.2.9.44270 > 10.100.1.7.53: 9264+ A? svc-test-1.default.svc.k8s.cluster.local. (58) 16:06:29.824522 IP 10.100.2.9.44270 > 10.100.1.7.53: 35968+ AAAA? svc-test-1.default.svc.k8s.cluster.local. (58) 16:06:29.826058 IP 10.100.1.7.53 > 10.100.2.9.44270: 35968* 0/1/0 (112) 16:06:29.826096 IP 10.100.1.7.53 > 10.100.2.9.44270: 9264* 1/0/0 A 10.11.12.125 (74) 16:06:29.827886 IP 10.100.2.9.39960 > 10.100.3.7.8080: Flags [S], seq 2543062326, win 29200, options [mss 1460,sackOK,TS val 232089105 ecr 0,nop,wscale 7], length 0 16:06:29.828239 IP 10.100.3.7.8080 > 10.100.2.9.39960: Flags [S.], seq 291435186, ack 2543062327, win 28960, options [mss 1460,sackOK,TS val 232071600 ecr 232089105,nop,wscale 7], length 0 16:06:29.828279 IP 10.100.2.9.39960 > 10.100.3.7.8080: Flags [.], ack 1, win 229, options [nop,nop,TS val 232089105 ecr 232071600], length 0 16:06:29.828326 IP 10.100.2.9.39960 > 10.100.3.7.8080: Flags [P.], seq 1:75, ack 1, win 229, options [nop,nop,TS val 232089105 ecr 232071600], length 74: HTTP: GET / HTTP/1.1 16:06:29.828535 IP 10.100.3.7.8080 > 10.100.2.9.39960: Flags [.], ack 75, win 227, options [nop,nop,TS val 232071600 ecr 232089105], length 0 16:06:29.828910 IP 10.100.3.7.8080 > 10.100.2.9.39960: Flags [P.], seq 1:155, ack 75, win 227, options [nop,nop,TS val 232071600 ecr 232089105], length 154: HTTP: HTTP/1.1 200 OK 16:06:29.828941 IP 10.100.2.9.39960 > 10.100.3.7.8080: Flags [.], ack 155, win 237, options [nop,nop,TS val 232089105 ecr 232071600], length 0 16:06:29.829025 IP 10.100.2.9.39960 > 10.100.3.7.8080: Flags [F.], seq 75, ack 155, win 237, options [nop,nop,TS val 232089105 ecr 232071600], length 0 16:06:29.829225 IP 10.100.3.7.8080 > 10.100.2.9.39960: Flags [F.], seq 155, ack 76, win 227, options [nop,nop,TS val 232071600 ecr 232089105], length 0 16:06:29.829256 IP 10.100.2.9.39960 > 10.100.3.7.8080: Flags [.], ack 156, win 237, options [nop,nop,TS val 232089105 ecr 232071600], length 0 ^C 14 packets captured 14 packets received by filter 0 packets dropped by kernel user@ubuntu-4:~$
Now this is interesting. Here we see the same traffic but as it leaves the minion or node. Notice anything different? The traffic has the same source address (10.100.2.9) but now reflects the ‘real’ destination. The last 10 lines of the capture above show the HTTP request we made with the curl command as it leaves the ubuntu-4 host. Notice that not only does the destination now reflect the pod 10.100.3.7, but the destination port is now also 8080. If we continue to think of the Kubernetes service as a load balancer, the first capture (depicted in red below) would be the client to VIP traffic and the second capture (depicted in blue below) would show the load balancer to back-end traffic. It looks something like this…
As it turns out – services are actually implemented with iptables rules. The Kubernetes host is performing a simple destination NAT after which normal IP routing takes over and does it’s job. Let’s now dig into the iptables configuration to see exactly how this is implemented.
Side note: I’ve been trying to refer to ‘netfilter rules’ as ‘iptables rules’ since the netfilter term sometimes throws people off. Netfilter is the actual kernel framework used to implement packet filtering. IPtables is just a popular tool used to interact with netfilter. Despite this – netfilter is often referred to as iptables and vice versa. So if I use both terms, just know Im talking about the same thing.
If we take a quick look at the iptables configuration of one of our hosts you’ll see quite a few iptables rules already in place…
user@ubuntu-5:~$ sudo iptables-save # Generated by iptables-save v1.6.0 on Mon Apr 10 21:58:15 2017 *filter :INPUT ACCEPT [10:976] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [9:936] :KUBE-FIREWALL - [0:0] :KUBE-SERVICES - [0:0] -A INPUT -j KUBE-FIREWALL -A OUTPUT -j KUBE-FIREWALL -A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES -A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -m mark --mark 0x8000/0x8000 -j DROP COMMIT # Completed on Mon Apr 10 21:58:15 2017 # Generated by iptables-save v1.6.0 on Mon Apr 10 11:58:15 2017 *nat :PREROUTING ACCEPT [0:0] :INPUT ACCEPT [0:0] :OUTPUT ACCEPT [0:0] :POSTROUTING ACCEPT [0:0] :KUBE-HOSTPORTS - [0:0] :KUBE-MARK-DROP - [0:0] :KUBE-MARK-MASQ - [0:0] :KUBE-NODEPORTS - [0:0] :KUBE-POSTROUTING - [0:0] :KUBE-SEP-55K34TL23KTTOOX5 - [0:0] :KUBE-SEP-MAAXFC2P2J2MJC4T - [0:0] :KUBE-SEP-NMM3QW2QWCLSESBJ - [0:0] :KUBE-SEP-O3PLWFREHT2JRQ6X - [0:0] :KUBE-SEP-OA6FICRP4YS6R3CE - [0:0] :KUBE-SEP-PYL75ZSSBC3CSUAE - [0:0] :KUBE-SERVICES - [0:0] :KUBE-SVC-ERIFXISQEP7F7OF4 - [0:0] :KUBE-SVC-NPX46M4PTMTKRN6Y - [0:0] :KUBE-SVC-SWP62QIEGFZNLQE7 - [0:0] :KUBE-SVC-TCOU7JCQXEZGVUNU - [0:0] -A PREROUTING -m comment --comment "kube hostport portals" -m addrtype --dst-type LOCAL -j KUBE-HOSTPORTS -A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES -A OUTPUT -m comment --comment "kube hostport portals" -m addrtype --dst-type LOCAL -j KUBE-HOSTPORTS -A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES90 -A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING -A POSTROUTING ! -d 10.0.0.0/8 -m comment --comment "kubenet: SNAT for outbound traffic from cluster" -m addrtype ! --dst-type LOCAL -j MASQUERADE -A POSTROUTING -s 127.0.0.0/8 -o cbr0 -m comment --comment "SNAT for localhost access to hostports" -j MASQUERADE -A KUBE-MARK-DROP -j MARK --set-xmark 0x8000/0x8000 -A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000 -A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE -A KUBE-SEP-55K34TL23KTTOOX5 -s 10.100.0.9/32 -m comment --comment "kube-system/kube-dns:dns-tcp" -j KUBE-MARK-MASQ -A KUBE-SEP-55K34TL23KTTOOX5 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp" -m tcp -j DNAT --to-destination 10.100.0.9:53 -A KUBE-SEP-MAAXFC2P2J2MJC4T -s 10.100.1.7/32 -m comment --comment "kube-system/kube-dns:dns-tcp" -j KUBE-MARK-MASQ -A KUBE-SEP-MAAXFC2P2J2MJC4T -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp" -m tcp -j DNAT --to-destination 10.100.1.7:53 -A KUBE-SEP-NMM3QW2QWCLSESBJ -s 10.20.30.71/32 -m comment --comment "default/kubernetes:https" -j KUBE-MARK-MASQ -A KUBE-SEP-NMM3QW2QWCLSESBJ -p tcp -m comment --comment "default/kubernetes:https" -m recent --set --name KUBE-SEP-NMM3QW2QWCLSESBJ --mask 255.255.255.255 --rsource -m tcp -j DNAT --to-destination 10.20.30.71:6443 -A KUBE-SEP-O3PLWFREHT2JRQ6X -s 10.100.0.9/32 -m comment --comment "kube-system/kube-dns:dns" -j KUBE-MARK-MASQ -A KUBE-SEP-O3PLWFREHT2JRQ6X -p udp -m comment --comment "kube-system/kube-dns:dns" -m udp -j DNAT --to-destination 10.100.0.9:53 -A KUBE-SEP-OA6FICRP4YS6R3CE -s 10.100.3.7/32 -m comment --comment "default/svc-test-1:" -j KUBE-MARK-MASQ -A KUBE-SEP-OA6FICRP4YS6R3CE -p tcp -m comment --comment "default/svc-test-1:" -m tcp -j DNAT --to-destination 10.100.3.7:8080 -A KUBE-SEP-PYL75ZSSBC3CSUAE -s 10.100.1.7/32 -m comment --comment "kube-system/kube-dns:dns" -j KUBE-MARK-MASQ -A KUBE-SEP-PYL75ZSSBC3CSUAE -p udp -m comment --comment "kube-system/kube-dns:dns" -m udp -j DNAT --to-destination 10.100.1.7:53 -A KUBE-SERVICES -d 10.11.12.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y -A KUBE-SERVICES -d 10.11.12.254/32 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-SVC-TCOU7JCQXEZGVUNU -A KUBE-SERVICES -d 10.11.12.254/32 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp cluster IP" -m tcp --dport 53 -j KUBE-SVC-ERIFXISQEP7F7OF4 -A KUBE-SERVICES -d 10.11.12.125/32 -p tcp -m comment --comment "default/svc-test-1: cluster IP" -m tcp --dport 80 -j KUBE-SVC-SWP62QIEGFZNLQE7 -A KUBE-SERVICES -m comment --comment "kubernetes service nodeports; NOTE: this must be the last rule in this chain" -m addrtype --dst-type LOCAL -j KUBE-NODEPORTS -A KUBE-SVC-ERIFXISQEP7F7OF4 -m comment --comment "kube-system/kube-dns:dns-tcp" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-55K34TL23KTTOOX5 -A KUBE-SVC-ERIFXISQEP7F7OF4 -m comment --comment "kube-system/kube-dns:dns-tcp" -j KUBE-SEP-MAAXFC2P2J2MJC4T -A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment "default/kubernetes:https" -m recent --rcheck --seconds 10800 --reap --name KUBE-SEP-NMM3QW2QWCLSESBJ --mask 255.255.255.255 --rsource -j KUBE-SEP-NMM3QW2QWCLSESBJ -A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment "default/kubernetes:https" -j KUBE-SEP-NMM3QW2QWCLSESBJ -A KUBE-SVC-SWP62QIEGFZNLQE7 -m comment --comment "default/svc-test-1:" -j KUBE-SEP-OA6FICRP4YS6R3CE -A KUBE-SVC-TCOU7JCQXEZGVUNU -m comment --comment "kube-system/kube-dns:dns" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-O3PLWFREHT2JRQ6X -A KUBE-SVC-TCOU7JCQXEZGVUNU -m comment --comment "kube-system/kube-dns:dns" -j KUBE-SEP-PYL75ZSSBC3CSUAE COMMIT # Completed on Mon Apr 10 21:58:15 2017 user@ubuntu-5:~$
Side note: I prefer to look at the iptables configuration rather than the iptables command output when tracing the chains. You could also use a command like ‘sudo iptables -nvL -t nat’ to look at the NAT entries we’ll be looking at above. This is useful when looking for things like hits on certain policies but be advised that this wont help you with the current implementation of kube-proxy. The iptables policy is constantly refreshed clearing any counters for given rules. That issue is discussed here as part of another problem.
These rules are implemented on the nodes by the kube-proxy service. The service is responsible for getting updates from the master about new services and then programming the appropriate iptables rules to make the service reachable for pods running on the host. If we look at the logs for the kube-proxy service we can see it picking up some of these service creation events…
Apr 10 23:23:44 ubuntu-5 kube-proxy[20407]: I0410 23:23:44.021269 20407 proxier.go:472] Adding new service "default/svc-test-1:" at 10.11.12.125:80/TCP
Previous versions of the kube-proxy service actually handled the traffic directly rather than relying on netfilter rules for processing. This is still an option ,and configureable as part of the kube-proxy service defintion, but it’s considerably slower than using netfilter. The difference being that the kube-proxy service runs in user space whereas the netfilter rules are being processed in the Linux kernel.
Looking at the above output of the iptables rules it can be hard to sort out what we’re looking for so let’s trim it down slightly and call out how the process works to access the svc-test-1 service…
Note: I know that’s tiny so if you can’t make it out click on the image to open it in a new window.
Since the container is generating what the host will consider forward traffic (does not originate or terminate on one of the devices IP interfaces) we only need to concern ourselves with the PREROUTING and POSTROUTING chains of the NAT table. It’s important to also note here that the same iptables configuration will be made on each host. This is because any host could possibly have a pod that wants to talk to a service.
Looking at the above image we can see the path a packet would take as it traverses the NAT PREROUTING table. The red arrows indicate a miss and the green arrows indicate a match occurring along with the associated action. In most cases, the action (called a target in netfilter speak) is to ‘jump’ to another chain. If we start at the top black arrow we can see that there are 4 targets that we match on…
- The first match occurs at the bottom of the PREROUTING chain. There is no match criteria specified so all traffic that reaches this point will match this rule. The rule specifies a jump target pointing at the KUBE-SERVICES chain.
- When we get to the KUBE-SERVICES chain we don’t match until the second to last rule which is looking for traffic that is destined to 10.11.12.125 (the IP of our service), is TCP, and has a destination port of 80. The target for this rule is another jump pointing at the KUBE-SVC-SWP62QIEGFZNLQE7 chain.
- There’s only one rule in the KUBE-SVC-SWP62QIEGFZNLQE7 chain and it once again lists no matching criteria, only a jump target pointing at the KUBE-SEP-OA6FICRP4YS6R3CE chain
- When we get to the KUBE-SEP-OA6FICRP4YS6R3CE chain we don’t match on the first rule so we roll down to the second. The second rule is looking for traffic that is TCP and specifies a target of DNAT. The DNAT specifies to change the destination of the traffic to 10.100.3.7 on port 8080. DNAT is considered a terminating target so processing of the PREROUTING chain ends with this match.
When a DNAT is performed netfilter takes care of making sure that any return traffic is also NAT’d back to the original IP. This is why the container only see it’s communication occurring with the service IP address.
This was a pretty straight forward example of a service so lets now look at what happens when we have more than one pod that matches the services label selector. Let’s test that out to see…
user@ubuntu-1:~$ kubectl scale --replicas=3 deployment/deploy-test-1 deployment "deploy-test-1" scaled user@ubuntu-1:~$ user@ubuntu-1:~$ kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE deploy-test-1-1702481282-8dr4l 1/1 Running 0 1d 10.100.3.7 ubuntu-5 deploy-test-1-1702481282-rfgxx 1/1 Running 0 6s 10.100.2.10 ubuntu-4 deploy-test-1-1702481282-wpqxw 1/1 Running 0 6s 10.100.0.10 ubuntu-2 net-test-645963977-081dx 1/1 Running 0 23h 10.100.2.9 ubuntu-4 user@ubuntu-1:~$
Above we can see that we’ve now scaled our deployment from 1 pod to 3. This means we should now have 3 pods that match the service definition. Let’s take a look at our iptables rule set now…
The above depicts the ruleset in place for the PREROUTING chain on one of the minions. I’ve removed all of the rules that didn’t result in a target being hit to make it easier to see whats happening. This looks a lot like the output we saw above with the exception of the KUBE-SVC-SWP62QIEGFZNLQE7 chain. Notice that some of the rules are using the statistic module and appear to be using it to calculate probability. This is allows the service construct to act as a sort of load balancer. The idea is that each of the rules in the KUBE-SVC-SWP62QIEGFZNLQE7 chain will get hit 1/3 of the time. This means that traffic to the service IP will be distributed relatively equally across all of the pods that match the service selector label.
Looking at the numbers used to specify probability you might be confused as to how this would provide equal load balancing to all three pods. But if you think about it some more, you’ll see that these numbers actually lead to almost a perfect 1/3 spit between all back end pods. I find it helps to think of the probability in terms of flow…
If we process the rules sequentially the first rule in the chain will get hit about 1/3 (0.33332999982) of the time. This means that about 2/3 (0.66667000018) of the time the first rule will not be hit and processing will flow to the second rule. The second rule has a 1/2 (.5) probability of being hit. However – the second rule is only receiving 2/3 of the traffic since the first rule is getting hit 1/3 of the time. One half of two thirds is one third. That means that if the second rule misses half of the time, then 1/3 will end up at the last rule of the chain which will always get hit since it doesn’t have a probability statement. So what we end up with is a pretty equal distribution between the pods that are a part of the service. At this point, our service now looks like this with connections toward the service having the possibility of hitting any of the three available back-end pods…
It’s important to call out here that this is providing relatively simple load balancing. While it works well – it relies on the pods providing fungible services. That is – each back-end pod should provide the same service and not be dependent on any sort of state with the client. Since the netfilter rules are processed per flow, there’s no guarantee that we’ll end up on the same back-end pod the next time we talk to the service. In fact there’s a good chance we wont.
Now that we know how services work – let’s talk about some other interesting things you can do with them. You’ll recall above that we defined the service by using a target port name rather than a numerical port. This allows us some flexibility in terms of what the service can use as endpoints. An example that’s often given is one where you’re application changes the port it’s using. For instance, our pods are currently using the port 8080. But perhaps a new version of our pods uses 9090 instead. This is where using port names rather than port numbers comes in handy. So long as our pod definition uses the same name, the numbers can be totally different. For instance, let’s define this deployment file on our Kubernetes master as deploy-test-2.yaml…
apiVersion: extensions/v1beta1 kind: Deployment metadata: name: deploy-test-2 spec: replicas: 1 template: metadata: labels: app: web-front-end version: v2 spec: containers: - name: tiny-web-server-1 image: jonlangemak/web1_9090 ports: - containerPort: 9090 name: web-port protocol: TCP
Notice that the container port is 9090 but we use the same name for the port. Now create the deployment…
user@ubuntu-1:~$ kubectl create -f deploy-test-2.yaml deployment "deploy-test-2" created user@ubuntu-1:~$ kubectl get pods NAME READY STATUS RESTARTS AGE deploy-test-1-1702481282-8dr4l 1/1 Running 0 1d deploy-test-1-1702481282-rfgxx 1/1 Running 0 1h deploy-test-1-1702481282-wpqxw 1/1 Running 0 1h deploy-test-2-2110180743-7wqhr 1/1 Running 0 10s net-test-645963977-081dx 1/1 Running 0 1d user@ubuntu-1:~$
After deploying it check to make sure the pod is running. Once it comes into a running status try to curl to the service URL (http://svc-test-1) again from your net-test container. Im going to do it through kubectl ‘exec’ sub-command on the master but you could also do it directly on the host with ‘docker exec’…
!user@ubuntu-1:~$ kubectl exec -it net-test-645963977-081dx curl http://svc-test-1 This is Web Server 1 running on 8080! user@ubuntu-1:~$ kubectl exec -it net-test-645963977-081dx curl http://svc-test-1 This is Web Server 1 running on 8080! user@ubuntu-1:~$ kubectl exec -it net-test-645963977-081dx curl http://svc-test-1 This is Web Server 1 running on 8080! user@ubuntu-1:~$ kubectl exec -it net-test-645963977-081dx curl http://svc-test-1 This is Web Server 1 running on 9090 !user@ubuntu-1:~$ kubectl exec -it net-test-645963977-081dx curl http://svc-test-1 This is Web Server 1 running on 8080 !user@ubuntu-1:~$
Notice how the service is picking up the new pod? That’s because the pods both share the ‘app=web-front-end’ label that the service is looking for. We can confirm this by showing all of the pods that mach that label…
user@ubuntu-1:~$ kubectl get pods --selector=app=web-front-end NAME READY STATUS RESTARTS AGE deploy-test-1-1702481282-8dr4l 1/1 Running 0 1d deploy-test-1-1702481282-rfgxx 1/1 Running 0 1h deploy-test-1-1702481282-wpqxw 1/1 Running 0 1h deploy-test-2-2110180743-7wqhr 1/1 Running 0 9m user@ubuntu-1:~$
If we wanted to migrate between the old and new versions of the pods, we could first scale up the new pod…
user@ubuntu-1:~$ kubectl scale --replicas=3 deployment/deploy-test-2 deployment "deploy-test-2" scaled user@ubuntu-1:~$ kubectl get pods --selector=app=web-front-end NAME READY STATUS RESTARTS AGE deploy-test-1-1702481282-8dr4l 1/1 Running 0 1d deploy-test-1-1702481282-rfgxx 1/1 Running 0 1h deploy-test-1-1702481282-wpqxw 1/1 Running 0 1h deploy-test-2-2110180743-0cxsw 1/1 Running 0 18s deploy-test-2-2110180743-7wqhr 1/1 Running 0 16m deploy-test-2-2110180743-z9l40 1/1 Running 0 18s user@ubuntu-1:~$
Then we can use the kubectl ‘edit’ sub-command to edit the service. This is done with the ‘kubectl edit service/svc-test-1’ command which will bring up a VI like text editor for you to make changes to the service. In this case, we want the service to be more specific so we tell it to look for an additional label. Specifically, the ‘version=v2’ label…
# Please edit the object below. Lines beginning with a '#' will be ignored, # and an empty file will abort the edit. If an error occurs while saving this file will be # reopened with the relevant failures. # apiVersion: v1 kind: Service metadata: creationTimestamp: 2017-04-10T16:03:24Z name: svc-test-1 namespace: default resourceVersion: "1681740" selfLink: /api/v1/namespaces/default/services/svc-test-1 uid: 39f04a88-1e07-11e7-ac2c-000c293e4951 spec: clusterIP: 10.11.12.125 ports: - port: 80 protocol: TCP targetPort: web-port selector: app: web-front-end version: v2 sessionAffinity: None type: ClusterIP status: loadBalancer: {}
Notice the new selector v2
add above. Once edited, save the file like you normally would (ESC, :wq, ENTER). The changes will be made to the service immediately. We can see this by viewing the service and then searching for pods that match the new selector…
user@ubuntu-1:~$ kubectl get services -o wide NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR kubernetes 10.11.12.1 <none> 443/TCP 13d <none> svc-test-1 10.11.12.125 <none> 80/TCP 1d app=web-front-end,version=v2 user@ubuntu-1:~$ user@ubuntu-1:~$ kubectl get pods --selector=app=web-front-end,version=v2 NAME READY STATUS RESTARTS AGE deploy-test-2-2110180743-0cxsw 1/1 Running 0 9m deploy-test-2-2110180743-7wqhr 1/1 Running 0 26m deploy-test-2-2110180743-z9l40 1/1 Running 0 9m user@ubuntu-1:~$
And if we execute our test again – we should see only responses from the version 2 pod…
user@ubuntu-1:~$ kubectl exec -it net-test-645963977-081dx curl http://svc-test-1 This is Web Server 1 running on 9090! user@ubuntu-1:~$ kubectl exec -it net-test-645963977-081dx curl http://svc-test-1 This is Web Server 1 running on 9090! user@ubuntu-1:~$ kubectl exec -it net-test-645963977-081dx curl http://svc-test-1 This is Web Server 1 running on 9090! user@ubuntu-1:~$
I hinted at this earlier but it’s worth calling out as well. When the kube-proxy service defines rules for the service to work for the pods, it also defines rules for the services to be accessible from the hosts themselves. We saw this at the beginning of the post when the host ubuntu-2 was able to access the service directly by it’s assigned service IP address. In this case, since the server itself is originating the traffic, different chains are processed. Specifically, the OUTPUT chain is processed which has this rule facilitating getting the traffic to the KUBE-SERVICES chain…
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
From that point, the processing is largely similar to what we saw from the pod perspective. One thing to point out though is that since the hosts are not configured to use Kube-DNS they can not, by default, resolve the services by name.
In the next post we’ll talk about how you can use services to provide external access into your Kubernetes cluster. Stay tuned!
I’ve gone through a number of your blogs. Hats off to your level of detail and the progress you have identified. Really helped me tremendously to validate the theory of operation.
Excellent site you hve here.. It’s difficult to find high quality writing like yours nowadays.I truly appreciate people
like you! Take care!!