Kubernetes

You are currently browsing the archive for the Kubernetes category.

In the last 4 posts we’ve examined the fundamentals of Kubernetes networking…

Kubernetes networking 101 – Pods

Kubernetes networking 101 – Services

Kubernetes networking 101 – (Basic) External access into the cluster

Kubernetes Networking 101 – Ingress resources

My goal with these posts has been to focus on the primitives and to show how a Kubernetes cluster handles networking internally as well as how it interacts with the upstream or external network.  Now that we’ve seen that, I want to dig into a networking plugin for Kubernetes – Calico.  Calico is interesting to me as a network engineer because of wide variety of functionality that it offers.  To start with though, we’re going to focus on a basic installation.  To do that, I’ve updated my Ansible playbook for deploying Kubernetes to incorporate Calico.  The playbook can be found here.  If you’ve been following along up until this point, you have a couple of options.

  • Rebuild the cluster – I emphasized when we started all this that the cluster should be designed exclusively for testing.  Starting from scratch is always the best in my opinion if you’re looking to make sure you don’t have any lingering configuration.  To do that you can follow the steps here up until it asks you to deploy the KubeDNS pods.  You need to deploy Calico before deploying any pods to the cluster!
  • Download and rerun the playbook – This should work as well but I’d encourage you to delete all existing pods before doing this (even the ones in the Kube-System namespace!).  There are configuration changes that occur both on the master and the minion nodes so you’ll want to make sure that once the playbook is run that all the services have been restarted.  The playbook should do that for you but if you’re having issues check there first.

Regardless of which path you choose, I’m going to assume from this point on that you have a fresh Kubernetes cluster which was deployed using my Ansible role.  Using my Ansible role is not a requirement but it does some things for you which I’ll explain along the way so no worries if you aren’t using it.  The goal of this post is to talk about Calico, the lab being used is just a detail if you want to follow along.

So now that we have our lab sorted out – let’s talk about deploying Calico.  One of the nicest things about Calico is that it can be deployed through Kubernetes.  Awesome!  The recommended way to deploy it is to use the Calico manifest which they define over on their site under the Standard Hosted Installation directions.  If you’re using my Ansible role, a slightly edited version of this manifest can be found on your master in /var/lib/kubernetes/pod_defs.  Let’s take a look at what it defines…

That’s a lot so let’s walk through what the manifest defines. The first thing the manifest defines is a config-map that Calico uses to define high level parameters about the Calico installation. Calico relies on a ETCD key value store for some of it’s functions so this is where we define the location of that. In this case, I’m using the same one that I’m using for Kubernetes. Again – this is a lab – they don’t recommend you doing that in non-lab environments. So in my case, I point the etcd_endpoints parameter to the host Ubuntu-1 on port 2379. Since we’re using cert based auth for ETCD I also need to tell Calico where the certs are for that. To do that you just need to un-comment lines 46-48 in the config-map. Do not change these values assuming you need to point that at a real file location on the host!

The second item the manifest defines is a Kubernetes secret which we populate with the ETCD TLS information if we’re using it.  We are so we need to populate these fields (lines 46-48) with base 64 encoded versions of each of these items.  Again – this is something that Ansible will do for you if you use my role. If not, you need to manually insert the values (I removed them from the file just to save space). We haven’t talked about secrets specifically but they are a means to share secret information with objects inside the Kubernetes cluster.

The third item the manifest defines is a daemon-set. Dameon-sets are a means to deploy a specific workload to every Kubernetes node or minion. So say I had a logging system that I wanted on each system. Deploying it as a daemon-set allows Kubernetes to manage that for me. If I join a new node to the cluster, Kubernetes will start the logging system on that node as well. So in this case, the daemon-set is for Calico and consists of two containers. The node container is the brains of the operation and what does most of the heavy lifting. This is also where we changed the CALICO_IPV4POOL_CIDR parameter from the default to 10.100.0.0/16. This is not required but I wanted to keep the pod IP addresses in that subnet for my lab. The install-cni container takes care of creating the correct CNI definitions on each host so that Kubernetes can consume Calico through a CNI plugin. Once it completes this task it goes to sleep and never wakes back up. We’ll talk more about the CNI definitions below.

The fourth and final piece of the manifest defines the Calico policy controller. We wont be talking about that piece of Calico in this post so just hold tight on that one for now.

So let’s deploy this file to the cluster…

Alright – now let’s run our net-test pod again so we have a testing point…

Once running let’s check and see what the pod received for networking.

First we notice that the eth0 interface is actually a VETH pair. We see that it’s peer is interface index 5 which on the host is an interface called cali182f84bfeba@if4. So the container’s network namespace is connected back to the host using a VETH pair. This is very similar to how most container networking solutions work with one minor change. The host side VETH pair is not connected to a bridge. It just lives by itself in the default or root namespace. We’ll talk more about the implications of this later on in this post. Next we notice that the pod received an IP address of 10.100.163.129. This doesn’t seem unusual since that was our pod CIDR we had defined in previous labs, but if you look at the kube-controller-manager service definition. You’l notice that we no longer configure that option…

Notice that the --cluster-cidr parameter is missing entirely and that the --allocate-node-cidrs parameter has been changed to false. This means that Kubernetes is no longer allocating pod CIDR networks to the nodes. So how are the pods getting IP addresses now? The answer to that lies in the kubelet configuration…

Our --network-plugin change from kubenet to cni. This means that we’re using native CNI in order to provision container networking. When doing so, Kubernetes acts as follows…

The CNI plugin is selected by passing Kubelet the –network-plugin=cni command-line option. Kubelet reads a file from –cni-conf-dir (default /etc/cni/net.d) and uses the CNI configuration from that file to set up each pod’s network. The CNI configuration file must match the CNI specification, and any required CNI plugins referenced by the configuration must be present in –cni-bin-dir (default /opt/cni/bin).
If there are multiple CNI configuration files in the directory, the first one in lexicographic order of file name is used.
In addition to the CNI plugin specified by the configuration file, Kubernetes requires the standard CNI lo plugin, at minimum version 0.2.0

Since we didnt specify --cni-conf-dir or –-cni-bin-dir the kubelet will look in the default path for each.  So let’s checkout what’s in the --cni-conf-dir (/etc/cni/net.d) now…

There’s quite a bit here and all of these files were written by Calico. Specifically by the install-cni container. We can verify that by checking it’s logs…

As we can see from the log of the container on each host, the CNI container created the binaries if they didnt exist (these may have already existed if you were using the previous lab build). It then created the CNI policy and the associated kubeconfig file for CNI to use. It also created the /etc/cni/net.d/calico-tls directory and placed the certs required to talk to etcd in that directory. It got this information from the Kubernetes secret /calico-secrets which is really the information from the secret calico-etcd-secrets that we created in the Calico manifest. The secret just happens to be mounted into the container as calico-secrets. The CNI definition also specifies that a plugin of calico should be use which we’ll find does exist in the /opt/cni/bin directory. it also specifies an IPAM plugin of calico-ipam meaning that calico is also taking care of our IP address assignment. One other interesting thing to point out is that the CNI definition lists the information required to talk to the Kubernetes API. To do this, it’s using the default pod token.  If you’re curious how the pods get the token to talk to the API server check out this piece of documentation that talks about default service accounts and credentials in Kubernetes.  Lastly – the install-CNI container created a kubeconfig file which specifies some further Kubernetes connectivity parameters.

So running the Calico manifest did quite a lot for us.  Each node node has the Calico CNI plugins and the means to talk to the Kubernetes API.  So now we know that Calico is driving the IP address allocation for the hosts, what about the actual networking side of things?  Let’s take a closer look at the routing for net-test container…

Well this is strange. The default route is pointing at 169.254.1.1. Let’s look on the host this container is running on and see what interfaces exist…

Nothing matching that IP address here. So what’s going on? How can a container route at an IP that doesnt exist? Let’s walk through what’s happening. Some of you reading this might have noticed that 169.254.1.1 is an IPv4 link local address.  The container has a default route pointing at a link local address meaning that the container expects this IP address to be reachable on it’s directly connected interface, in this case, the containers eth0 address. The container will attempt to ARP for that IP address when it wants to route out through the default route. Since our container hasnt talked to anything yet, we have the opportunity to attempt to capture it’s ARP request on the host. Let’s setup a TCPDUMP on the host ubuntu-3 and then use kubectl exec on the master to try talking to the outside world…

In the top output you can see we have the container send a single ping to 4.2.2.2. This will surely follow the container’s default route and cause it to ARP for it’s gateway at 169.254.1.1. In the bottom output you see the capture on the host Ubuntu-3. Notice we did the capture on the interface cali182f84bfeba which is the host side of the VETH pair connecting the container back to the root or default network namespace on the host. In the output of the TCPDUMP we see the container with a source of 10.100.163.129 send an ARP request. The reply comes from 2e:7e:32:de:8c:a3 which, if we reference the above output, will see is the MAC address of the host side VETH pair cali182f84bfeba. So you might be wondering how on earth the host is replying to an ARP request for which it doesn’t have an IP interface on. The answer is proxy-arp. If we check the host side VETH interface we’ll see that proxy-arp is enabled…

By enabling proxy-arp on this interface Calico is instructing the host to reply to the ARP request on behalf of someone else that is, through proxy. The rules for proxy-ARP are simple. A host which has proxy-ARP enabled will reply to ARP requests with it’s own MAC address when…

  • The host receives an ARP request on an interface which has proxy-ARP enabled.
  • The host knows how to reach the destination
  • The interface the host would use to reach the destination is not the same one that it received the ARP request on

So in this case, the container is sending an ARP request for 169.254.1.1.  Despite this being a link-local address, the host would attempt to route this following it’s default route out the hosts physical interface.  This means we’ve met all three requirements so the host will reply to the ARP request with it’s MAC address.

Note: If you’re curious about these requirements go ahead and try them out yourself.  For requirement 1 you can disable proxy-arp on the interface with echo 0 > /proc/sys/net/ipv4/conf/<interface name goes here>/proxy_arp.  For requirement 2 simply remove the hosts default route (make sure you have a 10’s route or some other means to reach the host before you do that!) like so sudo ip route del 0.0.0.0/0.  For the third requirement point the route 169.254.0.0/16 at the VETH interface itself like this sudo ip route add 169.254.0.0/16 dev <Calico VETH interface name>.  If you do any of these, the container will no longer be able to access the outside world.  Part of me wonders if this makes it a bit fragile but I also assume that most hosts will have a default route.  

The ARP process for the container would look like this…

In this case, the proxy ARP requirements are met since the host has a default route it can follow for the destination of 169.254.1.1 so it replies to the container with it’s own MAC address.  At this point, the container believes it has a valid ARP entry for it’s default gateway and will start initiating normal traffic toward the host.  It’s a pretty clever configuration but one that takes some time to understand.

I had mentioned above that the host side of the container VETH pair just lived in the hosts default or root namespace. In other container implementations, this interface would be attached to a common bridge so that all connected containers could talk to one another directly. In that scenario, the bridge would commonly be allocated an IP address giving the host an IP address on the same subnet as the containers.  This would allow the host to talk (do things like ARP) to the container directly. Having the bridge also allows the containers themselves to talk directly to one another through the bridge. This describes a layer 2 scenario where the host and all containers attached to the bridge can ARP for each others IP addresses directly. Since we don’t have the bridge, we need to tell the host how to route to each container. If we look at the hosts routing table we’ll see that we have a /32 route for the IP of the our net-test container…

The route points the IP address at the host side VETH pair. We also notice some other unusual routes in the hosts routing table…

These routes are inserted by Calico and represent the subnets allocated by Calico to all of the other hosts in our Kubernetes cluster. We can see that Calico is allocating a /26 network to each host…

10.100.243.0/26 – Ubuntu-2
10.100.163.128/26 – Ubuntu-3
10.100.5.192/26 – Ubuntu-4
10.100.138.192/26 – Ubuntu-5

Notice that these destinations are reachable through the tunl0 interface which is Calico’s IPIP overlay transport tunnel. This means that we don’t need to tell the upstream or physical network how to get to each POD CIDR range since it’s being done in the overlay. This also means that we can no longer reach the pod IP address directly. This conforms more closely with what the Kubernetes documentation describes when it says that the pod networks are not routable externally.  In our previous examples they were reachable since we were manually routing the subnets to each host.

We’ve just barely scratched the surface of Calico in this post but it should be enough to get you and running. In the next post we’ll talk about how Calico shares routing and reachability information between the hosts.

I called my last post ‘basic’ external access into the cluster because I didn’t get a chance to talk about the ingress object.  Ingress resources are interesting in that they allow you to use one object to load balance to different back-end objects.  This could be handy for several reasons and allows you a more fine-grained means to load balance traffic.  Let’s take a look at an example of using the Nginx ingress controller in our Kubernetes cluster.

To demonstrate this we’re going to continue using the same lab that we used in previous posts but for the sake of level setting we’re going to start by clearing the slate.  Let’s delete all of the objects in the cluster and then we’ll start by build them from scratch so you can see every step of the way how we setup and use the ingress.

Since this will kill our net-test pod, let’s start that again…

Recall that we used this pod as a testing endpoint so we could simulate traffic originating from a pod so it’s worth keeping around.

Alright – now that we have an empty cluster the first thing we need to do is build some things we want to load balance to. In this case, we’ll define two deployments. Save this definition as a file called back-end-deployment.yaml on your Kubernetes master…

Notice how we defined two deployments in the same file and separated the definitions with a ---. Next we want to define a service that can be used to reach each of these deployments. Save this service definition as back-end-service.yaml

Notice how the service selectors are looking the specific labels that match each pod. In this case, we’re looking for app and version with the version differing between each deployment. We can now deploy these definitions and ensure everything is created as expected…

I created a new folder called ingress to store all these definitions in.

These deployments and services represent the pods and services that we’ll be doing the actual load balancing to. They appear to have deployed successfully so let’s move onto the next step.

Next we’ll start building the actual ingress. To start with we need to define what’s referred to as a default back-end. Default back-ends serve as the default endpoint that the ingress will send traffic to in the event it doesn’t match any other rules. In our case the default back-end will consist of a deployment and a service that matches the deployed pods to make them easily reachable. First define the default back-end deployment. Save this as a file called default-back-end-deployment.yaml

Next lets define the service that will match the default back-end. save this file as default-back-end-service.yaml

Now let’s deploy both the definition for the default back-end deployment as well as the service…

Great! This looks just like we’d expect but let’s do some extra validation from our net-test pod to make sure the pods and services are working as expected before we get too far into the ingress configuration…

If you aren’t comfortable with services see my post on them here.

As expected pods can resolve the services by DNS name and we can successfully reach each service.   In the case of the default back-end we get a 404. Since all of the back end pods are reachable we can move on to defining the ingress itself. The Nginx ingress controller comes in the form of a deployment. The deployment definition looks like this…

Go ahead and save this file as nginx-ingress-controller-deployment.yaml on your server. However – before we can deploy this definition we need to deploy a config-map. Config-maps are a Kubernetes construct that are used to handle non-private configuration information. Since the Nginx ingress controller above expects a config-map, we need to deploy that before we can deploy the ingress controller…

In this case, we’re using a config-map to pass service level parameters to the pod. In this case, we’re passing the enable-vts-status: 'true' parameter which is required for us to see the VTS page of the Nginx load balancer. Save this as nginx-ingress-controller-config-map.yaml on your server and then deploy both the config-map and the Nginx ingress controller deployment…

Alright – so we’re still looking good here. The pod generated from the deployment is running. If you want to perform another quick smoke test at this point you can try connecting to the Nginx controller pod directly from the net-test pod. Doing so should result in landing at the default back-end since we have not yet told the ingress what it should do…

Excellent!  Next we need to define the ingress policy or ingress object. Doing so is just like defining any other object in Kubernetes, we use a YAML definition like the one below…

The rules defined in the ingress will be read by the Nginx ingress controller and turned into Nginx load balancing rules. Pretty slick right? In this case, we define 3 rules. Let’s walk through the rules one at a time from top to bottom.

  1. Is looking for http traffic headed to the host website8080.com.  If it receives traffic matching this host it will load balance it to the pods that match the selector for the service backend-svc-1
  2. Is looking for http traffic headed to the host website9090.com.  If it receives traffic matching this host it will load balance it to the pods that match the selector for the service backend-svc-2
  3. Is looking for traffic destined to the host website.com on multiple different paths…
    1. Is looking for http traffic matching a path of /eightyeighty.  If it receives traffic matching this path it will load balance it to the pods that match the selector for the service back-end-svc-1
    2. Is looking for http traffic matching a path of /ninetyninety.  If it receives traffic matching this path it will load balance it to the pods that match the selector for the service back-end-svc-2
    3. Is looking for http traffic matching a path of /nginx_status.  If it receives traffic matching this path it will load balance it to the pods that match the selector for the service nginx-ingress (not yet defined)

Those rules are pretty straight forward and things we’re used to dealing with on traditional load balancing platforms. Let’s go ahead and save this definition as nginx-ingress.yaml and deploy it to the cluster…

We can see that the ingress has been created successfully. If we would have been watching the logs of the Nginx ingress controller pod as we deployed the ingress we would have seen these log entries shortly after defining the ingress resource in the cluster…

The ingress controller is constantly watching the API server for ingress configuration. Directly after defining the ingress policy, the controller started building the configuration in the Nginx load balancer. Now that it’s defined, we should be able to do some preliminary testing within the cluster. Once again from our net-test pod we can run the following tests…

We can run tests from within the cluster by connecting directly to the pod IP address of the Nginx ingress controller which in this cases is 10.100.3.34. You’ll notice the first test fails and we end up at the default back-end. This is because we didnt pass a host header. In the second example we pass the website8080.com host header and get the correct response. In the third example we pass the website9090.com host header and also receive the response we’re expecting. In the fourth example we attempt to connect to website.com and receive once again the default back-end response of 404. If we then try the appropriate paths we’ll see we once again start getting the correct responses.

The last piece that’s missing is external access. In our case, we need to expose the ingress to the upstream network. Since we’re not running in a cloud environment, the best option for that would be with a nodePort type service like this…

I used a nodePort service here but you certainly could have also used the externalIP construct as well.  That would allow you to access the URLs on their normal port.  

Notice that this is looking for pods that match the selector nginx-ingress-lb and provides service functionality for two different ports. The first will be port 80 which we’re asking it provide on the host’s interface on port (nodePort) 30000. This will service the actual inbound requests to the websites. The second port is 18080 and we’re asking it to provide that on nodePort 32767. This will let us view the Nginx VTS monitoring page of the load balancer. Let’s save this definition as nginx-ingress-controller-service.yaml and deploy it to our cluster….

Now we should be able to reach all of our URLs from outside of the cluster either by passing the host header manually as we did above, or by creating DNS records to resolve the names to a Kubernetes node. If you want to access the Nginx VTS monitoring page through a web browser you’ll need to go the DNS route. I created local DNS zones for each domain to test and was then successful in reaching the website from my workstation’s browser…

If you added the DNS records you should also be able to reach the VTS monitoring page of the Nginx ingress controller as well…

When I first saw this, I was immediately surprised by something.  Does anything look strange to you?  I was surprised to see that the upstream pools listed the actual pod IP addresses.  Recall when we defined our ingress policy we listed the destinations as the Kubernetes services. My initial assumption was that the Nginx ingress controller would then simply be resolving the service name to an IP address and using the single service IP as it’s pool. That is – the ingress controller was just load balancing to a normal Kubernetes service.  Turns out that’s not the case. The ingress controller relies on the services to keep track of the pods but doesn’t actually use the service construct to get traffic to the pods. Since Kubernetes is keeping track of which pods are in a given service the ingress controller can just query the API server to get a list of pods that are currently alive and match the selector for the service. In this manner, traffic is load balanced directly to a pod rather than through a service construct. If we mapped out a request to one of our test websites through the Nginx ingress controller it would look something like this…

If you arent comfortable with how nodePort services work check out my last post.

In this case, I’ve pointed the DNS zones for website.com, website8080.com, and website9090.com to the host ubuntu-2. The diagram above shows a client session headed to website9090.com. Note that the client still believes that it’s TCP session (orange line) is with the host ubuntu-2 (10.20.30.72). The nodePort service is doing it’s job and sending the traffic over to the Nginx ingress controller. In doing so, the host hides the traffic behind it’s own source IP address to ensure the traffic returns successfully (blue line). This is entirely nodePort service functionality. What’s new is that when the Nginx pod talks to the back end pool, in this case 10.100.2.28, it does so directly pod to pod (green line).

As you can see – the ingress is allowing us to handle traffic to multiple different back ends now.  The ingress policy can be changed by editing the object using kubectl edit ingress nginx-ingress. So for instance, let’s say that we wanted to move the website8080.com to point to the pods that are selected by backend-svc-2 rather than backend-svc-1. Simply edit the ingress to look like this…

Then save the configuration and try to reconnect to website8080.com once again…

The Nginx ingress controller watches for changes in configuration on the API server and then implements those changes.  If we would have been watching the logs on the Nginx ingress controller container we would have seen something like this in the log entries after we made our change…

The ingress controller is also watching for changes to the service. For instance, if we now scaled our deploy-test-2 deployment we could see the Nginx pool size increase to account for the new pods. Here’s what VTS looks like before the change…

Then we can scale the deployment up with this command…

And after a couple of seconds VTS will show the newly deployed pods as part of the pool…

We can also modify properties of the controller itself by modifying the configMap that the controller is reading for it’s configuration. One of the more interesting options we can enable on the Nginx ingress controller is sticky sessions. Since the ingress is load balancing directly to the pods rather than to a service it’s possible for it to maintain sessions between the back-end pool members. We can enable this by editing the config map. kubectl edit configmap nginx-ingress-controller-conf and then add the highlighted line to the configMap…

Once again, the Nginx ingress controller will detect the change and reload it’s configuration. Now if we access website8080.com repeatedly from the same host we should see the load is sent to the same pod.  In this case the pod 10.100.0.25…

The point of this post was just to show you basics of how the ingress worked from a network perspective.  There are many other uses cases for them and many other ingress controllers to choose from besides Nginx. Take a look at the official documentation for them as well as these other posts that I found helpful during my research…

Jay Gorrell – Kubernetes Ingress

Daemonza – Kubernetes nginx-ingress-controller

In our last post we talked about an important Kubernetes networking construct – the service.  Services provide a means for pods running within the cluster to find other pods and also provide rudimentary load balancing capabilities.  We saw that services can create DNS entries within Kube-DNS which makes the service accessible by name as well.  So now that we know how you can use services to access pods within the cluster it seems prudent to talk about how things outside of the cluster can access these same services.  It might make sense to use the same service construct to provide this functionality, but recall that the services are assigned IP addresses that are only known to the cluster.  In reality, the service CIDR isnt actually routed anywhere but the Kubernetes nodes know how to interact with service IPs because of the netfilter rules programmed by the kube-proxy.  The service network just needs to be unique so that the containers running in the pod will follow their default route out to the host where the netfilter rules will come into play.  So really the service network is sort of non-existent from a routing perspective as it’s only locally significant to each host.  This means that it can’t really be used by external clients since they wont know how to route to it either.  That being said, we have a few other options we can use most of which still rely on the service construct.  Let’s look at them one at a time…

ExternalIP

In this mode – you are essentially just assigning a service an external IP address.  This IP address can be anything you want it to be but the catch is that you’re on the hook to make sure that the external network knows to send that traffic to the Kubernetes cluster nodes.  In other words – you have to ensure that traffic destined to the assigned external IP makes it to a Kubernetes node.  From there, the service construct will take care of getting it where it needs to be.  To demonstrate this, let’s take a look at our lab where we left it after our last post

We had two different pod deployments running in the cluster in addition to the ‘net-test’ deployment but we wont need that for this example.  We had also defined a service called ‘svc-test-1’ that is currently targeting the pods of the ‘deploy-test-2’ deployment matching the selectors app=web-front-end and version=v2.  As we did when we changed the service selector, let’s once again edit the service and add another parameter.  To edit the service use this command…

In the editor, add the ‘externalIPs:’ list parameter followed by the IP address of 169.10.10.1 as shown in the highlighted field below…

When done, save the service definition by closing the file as you typically would (ESC, :wq, ENTER). You should get confirmation that the service was edited when you return to the console. What we just did was told Kubernetes to answer on the IP address of 169.10.10.1 for the service ‘svc-test-1’. To test this out, let’s point a route on our gateway for 169.10.10.0/24 at the host ubuntu-2. Note that ubuntu-2 is the only host that is not currently running a pod that will match the service selector…

This makes ubuntu-2 a strange choice to point the route at but highlights how the external traffic gets handled within the cluster.  With our route in place, let’s try to access the 169.10.10.1 IP address from a remote host…

Awesome, so it works!  Let’s now dig into how it works.  We can assume that since services use netfilter rules, and that the IP was assigned as part of the service, that the externalIP configuration likely also uses it.  So let’s start there.  For the sake of easily pointing out how the exteralIPs were implemented, I took a ‘iptables-save’ before and after the modification of the service.  Afterwards, I diffed the files and these three lines were added to the iptables-save output after the externalIPs were implemented in the service…

So what do these rules do? The first rule is looking for traffic that is TCP and destined to the IP address of 169.10.10.1 on port 80. This has a target of jump and points to a chain called ‘KUBE-MARK-MASQ’. This chain has the following rules…

This rule matches all traffic and has a target of ‘MARK’ which is a non-terminating target. The traffic in this case will be marked with a value of ‘0x4000/0x400’. So what do I mean by marked? In this case ‘–set-xmarl’ is setting a marking on the packet that is internal to the host. That is – this marking is only locally significant. Since the MARK target is non-terminating we jump back to the KUBE-SERVICES chain after the marking has occurred. The next line is looking for traffic that is…

  • Destined to 169.10.10.1
  • Has a protocol of TCP
  • Has a destination port of 80
  • Has not entered through a bridge interface (! –physdev-is-in)
  • Source of the traffic is not a local interface (! –src-type LOCAL)

The last two rules ensure that this is not traffic that is originated from a POD or the host itself destined to a service.

If the last two rules are foreign to you I suggest you take a look at the MAN page for the IPTables extensions.  It’s definitely worth bookmarking.  

Since the second rule is a match for our external traffic we follow that JUMP target into the KUBE-SVC-SWP62QIEGFZNLQE7 chain.  At that point – the load balancing works just like an internal service.  It’s worth pointing out that the masquerade rule is crucial to all of this working.  Let’s look at an example of what this might look like if we didn’t have the masquerade rule…

Let’s walk through what will happen without the masquerade rule shown above…

  • An external user, in this case 192.168.127.50, makes a request to the external IP of 169.10.10.1.
  • The request reaches the gateway, does a route lookup, and sees that there’s a route for 169.10.10.0/24 pointing at ubuntu-2 (10.20.30.72)
  • The request reaches the host ubuntu-2 and hits the above mentioned IPTables rules.  Without the masquerade rule, the only rule that gets hit is the one for passing the traffic to the service chain KUBE-SVC-SWP62QIEGFZNLQE7.  Normal service processing occurs as explained in our last post and a pod out of the service gets selected, in this case the pod 10.100.3.8 on host ubuntu-5.
  • Traffic is destination NAT’d to 10.100.3.8 and makes it way to the host ubuntu-5.
  • The pod receives the traffic and attempts to reply to the TCP connection based on the source IP address of the request.  The source in this case is unchanged and the host ubuntu-5 attempts to reply directly to the user at 192.168.127.50.
  • The user making the request receives the reply from 10.100.3.8 and drops the packet since it hasn’t initiated a session to that IP address.

So as you can see – this just wont work.  This is why we need the masquerade rule.  When we use the rule, the processing looks like this…

This looks much better.  In this example the flows specify the correct source and destination since the host ubuntu-2 is now hiding it’s request to the pod behind it’s own interface.  This ensures that the reply from the pod 10.100.3.8 will come back to the hose ubuntu-2.  This is an important step because this is the host which performed the initial service DNAT.   If the request does not come back to this host, the DNAT to the pod can not be reversed.  Reversing the DNAT in this case means changing the source of the packet back to the original pre-DNAT source of 169.10.10.1.   So as you can see – the masquerade rule is quite important to ensuring that the externalIP construct works.

NodePort

If you’re not interested in dealing with routing a new subnet to the hosts your other option would be what’s referred to as nodeport.  Nodeport works a lot like the original Docker bridge mode for publishing ports.  It makes a service available externally on the nodes through a different port.  That port is by default in the range of 30000-32767 but can be modified by changing the ‘–service-node-port-range’ configuration flag on the API server.  To change out service to nodeport we simply delete the externalips definition we inserted during the previous example and change the service type to nodeport as shown below…

After we save the change, we can view the services again with kubectl…

Notice that our port column for the service now lists more than one port. The first port is that of the internal service. The second port (30487) is the nodeport, or the port that we can use externally to reach the service. One point about nodeport that I’d like to mention is that it’s an overlay on top of a typical clusterip service.  In the externalip example above, notice that we didnt change the type of the service, we just added the externalips to the spec.  In the case of nodeport, you need to change the service type.  If you’re using a service within the cluster you might be concerned that making this change would remove the clusterip configuration and prevent pods from accessing the service.  That is not the case.  Nodeport works ontop of the clusterip service configuration so you will always have a clusterip when you configure nodeport.

At this point, we should be able to reach the service by connecting to any Kubernetes node on that given port…

Let’s now do a similar stare and compare with the iptables rules on each host.  Once again, I’ll compare the rules in place after the configuration to a copy of the rules I had before we started our work.  These are the only lines that were added to get the nodeport functionality working…

These lines are pretty straight forward and perform similar tasks to what we saw above with the externalip functionality. The first line is looking for traffic destined to port 30487 which it will then pass to the KUBE-MARK-MASQ chain so that the traffic will be masqueraded. We have to do this for the same reason we explained above. The second line is also looking for traffic destined to port 30487 and when matched will pass the traffic to the specific chain for the service to handle the load balancing. But how do we get to this chain? If we look at the KUBE-SERVICES chain we will see this entry at the bottom of the chain…

This rule has always been present in the ruleset the chain it references (KUBE-NODEPORTS) just never existed.

Nodeport offers a couple of advantages over externalip. Namely, we can have more than one load balancing target. With externalip, all of the traffic is headed to the same IP address. While you could certainly route that IP to more than one host, letting the network load balance for you, you’d need to worry about how to update the routing when the host failed. With nodeport, it’s reasonable to think about using an external load balancer that referenced a back-end pool of all of the Kubernetes nodes or minions…

The pool could reference a specific port for each service which would be front-ended on the load balancer by a single VIP.  In this model, if a node went away the load balancer could have the intelligence (in the form of a health check) to automatically pull that node from the pool.  Keep in mind that the destination the load balancer is sending traffic to will not necessarily host the pod that is answering the client request.  However – that’s just the nature of Kubernetes services so that’s pretty much table stakes at this point.

Lastly – if it’s more convenient, you can also specify manually the nodeport you wish to use.  In this instance, I edited the spec to specify a nodeport of 30000…

« Older entries