Kubernetes

You are currently browsing the archive for the Kubernetes category.

In our last post we talked about how Kubernetes handles pod networking.  Pods are an important networking construct in Kubernetes but by themselves they have certain limitations.  Consider for instance how pods are allocated.  The cluster takes care of running the pods on nodes – but how do we know which nodes it chose?  Put another way – if I want to consume a service in a pod, how do I know how to get to it?  We saw at the very end of the last post that the pods themselves could be reached directly by their allocated pod IP address (an anti-pattern for sure but it still works) but what happens when you have 3 or 4 replicas?  Services aim to solve these problems for us by providing a means to talk to one or more pods grouped by labels.  Let’s dive right in…

To start with, let’s look at our lab where we left at the end of our last post

 

If you’ve been following along with me there are some pods currently running.  Let’s clear the slate and delete the two existing test deployments we had out there…

So now that we’ve cleaned out the existing deployments let’s define a new one…

This is pretty straight forward with the exception of two things. I’m now using much smaller container images that are based off of an excellent post I read on making a small GO web server using the tiny pause container as the base image.  My previous test images were huge so this is the first step I’m taking toward rightsizing them.  Secondly – you’ll notice that in our spec we define two labels.  One to define the application (in this case ‘web-front-end’) and another to define the version (in this case ‘v1’).  So let’s create a YAML file on our master called ‘deploy-test-1.yaml’ and load this deployment…

Above we do a couple of things. We first load the definition with kubectl. We then verify that the deployment is defined and that the pods have loaded. In this case we can see that the pod has been deployed on the host ubuntu-5 and the pod has an IP address of 10.100.3.7. At this point – the only way to access the pod is directly by it’s pod IP address. If you noted in the above deployment definition we said that the containers port was 8080. By doing a curl to the pod IP on port 8080 we can see that we can reach the service.

This by itself is not very interesting and only really describes normal pod networking behavior.  If another pod in this cluster wanted to reach the service in this pod you’d have to provide it the pod IP address.  That’s not very dynamic and considering that pods may die and be restarted its rather prone to failure.  To solve this Kubernetes uses the service construct.  Let’s look at a service definition…

The main thing a service defines is a selector.  That is – what the service should be used for.  In this case, that selector is ‘app: web-front-end’.  If you’ll recall – our deployment listed this label as part of it’s specification.  Next – we need to define the ports and protocols the service should use.  In this case we’re using TCP and the port definition specifies the port the service will listen on, in this case 80.  At this point I think it’s easier to think of the service as a load balancer.  The ‘port’ definition defines the port that the front end virtual IP address will listen on.  The ‘targetPort’ specific what the back-end hosts are listening on or what the traffic should be load balanced to.  In this case, the back-end is any pod that matches our selector.  Interestingly enough – instead of specify a numeric port here you can specify a port name.  Recall from our deployment specification that we gave the port a name as part of port definition, in this case ‘web-port’.  Let’s use that with our service definition rather than the numerical definition of 8080.

Let’s define this file as ‘svc-test-1.yaml’ on our Kubernetes master and load it into the cluster…

Once loaded we check to make sure that the cluster sees the service.  Notice that the service has been assigned an IP address out of the ‘service_network_cidr’ we defined when we built this cluster using Ansible.  Going back to our load balancer analogy – this is our VIP IP address.  So now let’s head over to one of the worker nodes and try to access the service…

Excellent!  So the host can access the service IP directly. But what about other pods? To test this let’s fire up a quick pod with a single container in it that we can use as a testing point…

In the above output we used the kubectl ‘run’ sub-command to start a pod with a single container using the image ‘jonlangemak/net_tools’. This image is quite large since it is using Ubuntu as it’s base image but its serves as a nice testing endpoint.  Once the pod is running we can use the kubectl ‘exec’ sub-command to run commands directly from within the container much like you would locally by using ‘docker exec’. In this case, we curl to the IP address assigned to the server and get the response we’re looking for. Great!

So while this is a win – we’re sort of back in the same boat as before.  Any client looking to access the service running in the pod now needs to know the service’s IP address.  That’s not much different than needing to know the pods IP address is it?  The fix for this is Kube-DNS…

As you can see above – services can be resolved by name so long as you are running the Kube-DNS cluster add on.  When you register a service the master will take care of inserting a service record for it in Kube-DNS.  The containers can then resolve the service directly by name so long as the kubelet process has the correct DNS information (the ‘cluster-domain’ parameter as part of it’s service definition).  If it’s configured correctly it will configure the containers resolv.conf file to include the appropriate DNS server (which also happens to be a service itself) and search domains…

So now we know what services can do, but we don’t know how they do it. Let’s now dig into the mechanics of how this all works.  To do that, let’s start by doing some packet captures.  Our topology currently looks like this…

 

As we’ve seen already the net-test pod can access the deploy-test-1 pod both via it’s pod IP address as well as through the service.  Let’s start by doing a packet capture as close to the source container (net-test) as possible.  In that case, that would be on the VETH interface that connects the container to the cbr0 bridge on the host ubuntu-4.  To do that we need to find the VETH interface name that’s associated with the pause container which the net-test container is connected to.

Note: If you arent sure what a pause container is take a look at my last post.

In my case, it’s easy to tell since there’s only one pod running on this host.  If there are more, you can trace the name down by matching up interfaces as follows…

First we get the container ID so that we can ‘exec’ into the container and look at it’s interfaces. We see that it’s eth0 interface (really that of the pause containers but same network namespace) is matched up with interface 11. Then on the host we see that the VETH interface with name veth75b33c5c is interface 11. So that’s the interface we want to capture on…

The capture above was taken while SSH’d directly into the ubuntu-4 host.  To generate the traffic I used the kubectl ‘exec’ sub-command on ubuntu-1 to execute a curl command on the net-test container as shown below…

The capture above is interesting as it shows the container communicating with both the DNS service (10.11.12.254) and the service we created as svc-test-1 (10.11.12.125).  In both cases, the container believes it is communicating directly with the service.  That is the service IP is used as the destination in outgoing packets and seen as the the source in the reply packets.  So now that we know what the container sees lets move up a hop in the networking stack and see what traffic is traversing the hosts physical network interface…

Now this is interesting.  Here we see the same traffic but as it leaves the minion or node.  Notice anything different?  The traffic has the same source address (10.100.2.9) but now reflects the ‘real’ destination.  The highlighted lines above show the HTTP request we made with the curl command as it leaves the ubuntu-4 host.  Notice that not only does the destination now reflect the pod 10.100.3.7, but the destination port is now also 8080.  If we continue to think of the Kubernetes service as a load balancer, the first capture (depicted in red below) would be the client to VIP traffic and the second capture (depicted in blue below) would show the load balancer to back-end traffic.  It looks something like this…

As it turns out – services are actually implemented with iptables rules.  The Kubernetes host is performing a simple destination NAT after which normal IP routing takes over and does it’s job. Let’s now dig into the iptables configuration to see exactly how this is implemented.

Side note: I’ve been trying to refer to ‘netfilter rules’ as ‘iptables rules’ since the netfilter term sometimes throws people off.  Netfilter is the actual kernel framework used to implement packet filtering.  IPtables is just a popular tool used to interact with netfilter.  Despite this – netfilter is often referred to as iptables and vice versa.  So if I use both terms, just know Im talking about the same thing.  

If we take a quick look at the iptables configuration of one of our hosts you’ll see quite a few iptables rules already in place…

Side note: I prefer to look at the iptables configuration rather than the iptables command output when tracing the chains. You could also use a command like ‘sudo iptables -nvL -t nat’ to look at the NAT entries we’ll be looking at above. This is useful when looking for things like hits on certain policies but be advised that this wont help you with the current implementation of kube-proxy. The iptables policy is constantly refreshed clearing any counters for given rules. That issue is discussed here as part of another problem.

These rules are implemented on the nodes by the kube-proxy service. The service is responsible for getting updates from the master about new services and then programming the appropriate iptables rules to make the service reachable for pods running on the host. If we look at the logs for the kube-proxy service we can see it picking up some of these service creation events…

Previous versions of the kube-proxy service actually handled the traffic directly rather than relying on netfilter rules for processing. This is still an option ,and configureable as part of the kube-proxy service defintion, but it’s considerably slower than using netfilter. The difference being that the kube-proxy service runs in user space whereas the netfilter rules are being processed in the Linux kernel.

Looking at the above output of the iptables rules it can be hard to sort out what we’re looking for so let’s trim it down slightly and call out how the process works to access the svc-test-1 service…

Note: I know that’s tiny so if you can’t make it out click on the image to open it in a new window.

Since the container is generating what the host will consider forward traffic (does not originate or terminate on one of the devices IP interfaces) we only need to concern ourselves with the PREROUTING and POSTROUTING chains of the NAT table. It’s important to also note here that the same iptables configuration will be made on each host.  This is because any host could possibly have a pod that wants to talk to a service.

Looking at the above image we can see the path a packet would take as it traverses the NAT PREROUTING table.  The red arrows indicate a miss and the green arrows indicate a match occurring along with the associated action.  In most cases, the action (called a target in netfilter speak) is to ‘jump’ to another chain.  If we start at the top black arrow we can see that there are 4 targets that we match on…

  • The first match occurs at the bottom of the PREROUTING chain.  There is no match criteria specified so all traffic that reaches this point will match this rule.  The rule specifies a jump target pointing at the KUBE-SERVICES chain.
  • When we get to the KUBE-SERVICES chain we don’t match until the second to last rule which is looking for traffic that is destined to 10.11.12.125 (the IP of our service), is TCP, and has a destination port of 80.  The target for this rule is another jump pointing at the KUBE-SVC-SWP62QIEGFZNLQE7 chain.
  • There’s only one rule in the KUBE-SVC-SWP62QIEGFZNLQE7 chain and it once again lists no matching criteria, only a jump target pointing at the KUBE-SEP-OA6FICRP4YS6R3CE chain
  • When we get to the KUBE-SEP-OA6FICRP4YS6R3CE chain we don’t match on the first rule so we roll down to the second.  The second rule is looking for traffic that is TCP and specifies a target of DNAT.  The DNAT specifies to change the destination of the traffic to 10.100.3.7 on port 8080.  DNAT is considered a terminating target so processing of the PREROUTING chain ends with this match.

When a DNAT is performed netfilter takes care of making sure that any return traffic is also NAT’d back to the original IP.  This is why the container only see it’s communication occurring with the service IP address.

This was a pretty straight forward example of a service so lets now look at what happens when we have more than one pod that matches the services label selector.  Let’s test that out to see…

Above we can see that we’ve now scaled our deployment from 1 pod to 3. This means we should now have 3 pods that match the service definition. Let’s take a look at our iptables rule set now…

The above depicts the ruleset in place for the PREROUTING chain on one of the minions.  I’ve removed all of the rules that didn’t result in a target being hit to make it easier to see whats happening.  This looks a lot like the output we saw above with the exception of the KUBE-SVC-SWP62QIEGFZNLQE7 chain.  Notice that some of the rules are using the statistic module and appear to be using it to calculate probability.  This is allows the service construct to act as a sort of load balancer.  The idea is that each of the rules in the KUBE-SVC-SWP62QIEGFZNLQE7 chain will get hit 1/3 of the time.  This means that traffic to the service IP will be distributed relatively equally across all of the pods that match the service selector label.

Looking at the numbers used to specify probability you might be confused as to how this would provide equal load balancing to all three pods.  But if you think about it some more, you’ll see that these numbers actually lead to almost a perfect 1/3 spit between all back end pods.  I find it helps to think of the probability in terms of flow…

If we process the rules sequentially the first rule in the chain will get hit about 1/3 (0.33332999982) of the time.  This means that about 2/3 (0.66667000018) of the time the first rule will not be hit and processing will flow to the second rule.  The second rule has a 1/2 (.5) probability of being hit.  However – the second rule is only receiving 2/3 of the traffic since the first rule is getting hit 1/3 of the time.  One half of two thirds is one third.  That means that if the second rule misses half of the time, then 1/3 will end up at the last rule of the chain which will always get hit since it doesn’t have a probability statement.  So what we end up with is a pretty equal distribution between the pods that are a part of the service.  At this point, our service now looks like this with connections toward the service having the possibility of hitting any of the three available back-end pods…

It’s important to call out here that this is providing relatively simple load balancing.  While it works well – it relies on the pods providing fungible services.  That is – each back-end pod should provide the same service and not be dependent on any sort of state with the client.  Since the netfilter rules are processed per flow, there’s no guarantee that we’ll end up on the same back-end pod the next time we talk to the service.  In fact there’s a good chance we wont.

Now that we know how services work – let’s talk about some other interesting things you can do with them.  You’ll recall above that we defined the service by using a target port name rather than a numerical port.  This allows us some flexibility in terms of what the service can use as endpoints.  An example that’s often given is one where you’re application changes the port it’s using.  For instance, our pods are currently using the port 8080.  But perhaps a new version of our pods uses 9090 instead.  This is where using port names rather than port numbers comes in handy.  So long as our pod definition uses the same name, the numbers can be totally different.  For instance, let’s define this deployment file on our Kubernetes master as deploy-test-2.yaml…

Notice that the container port is 9090 but we use the same name for the port.  Now create the deployment…

After deploying it check to make sure the pod is running. Once it comes into a running status try to curl to the service URL (http://svc-test-1) again from your net-test container. Im going to do it through kubectl ‘exec’ sub-command on the master but you could also do it directly on the host with ‘docker exec’…

Notice how the service is picking up the new pod? That’s because the pods both share the ‘app=web-front-end’ label that the service is looking for. We can confirm this by showing all of the pods that mach that label…

If we wanted to migrate between the old and new versions of the pods, we could first scale up the new pod…

Then we can use the kubectl ‘edit’ sub-command to edit the service. This is done with the ‘kubectl edit service/svc-test-1’ command which will bring up a VI like text editor for you to make changes to the service. In this case, we want the service to be more specific so we tell it to look for an additional label. Specifically, the ‘version=v2’ label…

Notice the highlighted line above where we added the new selector. Once edited, save the file like you normally would (ESC, :wq, ENTER). The changes will be made to the service immediately. We can see this by viewing the service and then searching for pods that match the new selector…

And if we execute our test again – we should see only responses from the version 2 pod…

I hinted at this earlier but it’s worth calling out as well. When the kube-proxy service defines rules for the service to work for the pods, it also defines rules for the services to be accessible from the hosts themselves. We saw this at the beginning of the post when the host ubuntu-2 was able to access the service directly by it’s assigned service IP address. In this case,  since the server itself is originating the traffic, different chains are processed. Specifically, the OUTPUT chain is processed which has this rule facilitating getting the traffic to the KUBE-SERVICES chain…

From that point, the processing is largely similar to what we saw from the pod perspective. One thing to point out though is that since the hosts are not configured to use Kube-DNS they can not, by default, resolve the services by name.

In the next post we’ll talk about how you can use services to provide external access into your Kubernetes cluster.  Stay tuned!

Some time ago I wrote a post entitled ‘Kubernetes networking 101‘.  Looking at the published date I see that I wrote that more than 2 years ago!  That being said – I think it deserves a refresher.  The time around, Im going to split the topic into smaller posts in the hopes that I can more easily maintain them as things change.  In today’s post we’re going to cover how networking works for Kubernetes pods.  So let’s dive right in!

In my last post – I described a means in which you can quickly deploy a small Kubernetes cluster using Ansible.  I’ll be using that environment for all of the examples shown in these posts.  To refresh our memory – let’s take another quick look at the topology…

The lab consists of 5 hosts with ubuntu-1 acting as the Kubernetes master and the remaining nodes acting as Kubernetes minions (often called nodes now but I cant break the habit).  At the end of our last post we had what looked like a working Kubernetes cluster and had deployed our first service and pods to it.  Prior to deploying to the cluster we had to add some routing in the form of static routes on our Layer 3 gateway.  This step ensured that the allocated cluster (or pod) network was routed to the proper host.  Since the last post I’ve rebuilt the lab many times.  Given that the cluster network allocation is random (assigned as the nodes come up) the subnet allocation, and hence the static routes,  have changed.  My static routes now look like this…

So now that we’re back to a level state – let’s talk about pods.  We already have some pods deployed as part of the kube-dns deployment but to make things easier to understand let’s look at a deploying a new pod manually so we can examine what happens during a pod deployment.

I didnt point this out in the first post – but kubectl will inherently work on only your master node.  It will not work anywhere else without further configuration.  We’ll talk about that in an upcoming post where we discuss the Kubernetes API server.  For now – make sure you’re running kubectl directly on your master.

Our first pod will be simple.  To run it, execute this command on your master…

What we’re doing here is simply asking the cluster to run a single container. Since the smallest deployment unit within Kubernetes is a pod, it will run this single container in a pod. But what is a pod? Kubernetes defines a pod as

A pod (as in a pod of whales or pea pod) is a group of one or more containers (such as Docker containers), the shared storage for those containers, and options about how to run the containers. Pods are always co-located and co-scheduled, and run in a shared context. A pod models an application-specific “logical host” – it contains one or more application containers which are relatively tightly coupled — in a pre-container world, they would have executed on the same physical or virtual machine.

So while this sounds awfully application specific, there is at least one thing we can infer about a pods network from this definition.  The description describes a pod as a group of containers that model a single logical host.  If we carry that over to the network world, to me that implies a single network endpoint.  Boil that down further and it implies a single IP address.  Reading further into the description we find…

Containers within a pod share an IP address and port space, and can find each other via localhost. They can also communicate with each other using standard inter-process communications like SystemV semaphores or POSIX shared memory. Containers in different pods have distinct IP addresses and can not communicate by IPC.

So our initial assumption was right.  A pod has a single IP address and can access other containers in the same pod over the localhost interface.  To summarize – all containers in the same pod share the same network namespace.  So let’s take a look at a running pod and see what’s actually been implemented by running our pod.  To find the pod, ask kubectl to return a list of all of the known pods.  Don’t forget the ‘-o wide’ parameter which tells the output to include the pod IP address and the node…

So in our case the pod is running on the host ubuntu-3. You can see that the pod received an IP address out of the cluster CIDR which was previously allocated, and routed, to the host ubuntu-3. So let’s move over to the ubuntu-3 host and see what’s going on. We’ll first examine the running containers on the host…

This host happens to also be running one of the kube-dns replicas so for now only focus on the above highlighted lines. We can see that our container jonlangemak/web_server_1 is in fact running on this host. Let’s inspect the containers network configuration to see what we can find out…

From this output we can tell that this container is running in what I call mapped container mode.  Mapped container mode describes when one container joins the network namespace of an existing container.  If we look at the ID of the container that jonlangemak/web_server_1 is mapped to, we can see that it belongs to the second container we highlighted in the above output – gcr.io/google_containers/pause-amd64:3.0.  So we knew that all containers within the same pod share the same network namespace but how does the pause container fit into the picture?

The reason for the pause containers is actually pretty easy to understand when you think about order of operations.  Let’s consider a different scenario for a moment.  Let’s consider a pod definition which specifies three containers.  As we already mentioned – all containers within a single pod share the same network namespace.  In order to do that, one method might be to launch one of the pod containers, and then attach all subsequent containers to that first container.  Something like this…

But this doesn’t work for a couple of reasons.  Containers are spawned as soon as the image is downloaded and ready so it would be hard to determine which container would be ready first.  However – even if we don’t consider the logic required to determine which container the others should join, what happens when container1 in the diagram above dies?  Say it encounters an error, or a bug was introduced that causes it die.  When it dies, we just lost our anchoring point for the pod…

Let’s imagine container1 spawned first.  After that container2 spawned and managed to connect itself to container1’s network namespace.  Shortly thereafter container1 encountered an error and died.  Not only is container2 in trouble, but now container3 has no place to connect to.

If you’re interested in seeing the results it’s not hard to replicate just by running a couple Docker containers on their own.  Start container1, then add container2 to container1’s network namespace (–net=container:<container1’s name>), then kill container1.  Container2 will likely stay running but wont have any network access and no other containers can join container1 since it’s not running.

A better approach is to run a known good container and join all of the pod containers to it…

In this scenario – we know that the pause container will run and we don’t have to worry about what container comes up first since we know that all containers can join the pause container.   In this case – the pause container servers as an anchoring point for the pod and make it easy to determine what network namespace the pod containers should join.  Makes pretty good sense right?

Now let’s look at an actual scenario where a pod has more than one container.  To deploy that we’ll need to define a configuration file to pass to kubectl.  Save this on your master server as pod2-test.yaml…

Notice that this specification defines two containers, jonlangemak/web_server_1 and jonlangemak/web_server_2. To have Kubernetes load this pod run the following command…

Now let’s check the status of our pod deployment…

Notice that pod2 lists a status of error. Instead of going to that node to check the logs manually, let’s retrieve them through kubectl…

The logs from the first container look fine but the second show some errors as highlighted above. Have you figured out what we did wrong yet? Recall that a pod is a single network namespace.  Per our pod definition above we attempted to load two containers, in the same pod, that were using the same port number. The first one loaded successfully but when the second tried to bind to it’s defined port it failed since it overlapped with the other pod container. The solution to this is to run two containers that are listening on two different ports. Let’s define another YAML specification called pod3-test.yaml…

Let’s clean up our last test pod and then deploy this new pod…

Great! Now the pod is running as expected. If we go to host ubuntu-4 we’ll see a single pause container for the pod as well as our two pod containers running in Docker…

If we inspect the two pod containers we’ll see that they are connected to the pause container as expected…

At this point we’ve seen how pause containers work, how to deploy pods with one or more containers, as well as some of the limitations of multi-container pods. What we haven’t talked about is where the pod IP address comes into play. Let’s talk through that next by examining the configuration for our first pod we defined that’s living on ubuntu-3.

Once the pod IP is allocated it is assigned to the pause container.  The work of downloading the containers defined as part of the pod definition and mapping them into the pause containers network namespace begins.  As I hinted to in earlier posts, Kubernetes now leverages CNI to provide container networking.  If you looked closely at the systemd service definition for the kubelet running on the nodes you’d see a line that defines what network plugin is being used…

We can see that in this case it’s the kubenet plugin. Kubenet is the built-in network plugin provided with Kubernetes. Despite being built in it still requires the CNI components bridge, lo, and host-local. Since it uses the host-local IPAM driver we know where to look for it’s IP address allocations from our previous CNI posts

As we described in the previous article – the host-local IPAM driver stores IP allocations in the ‘/var/lib/cni/networks/<network name>’ directory.  If we browse this directory we see that there are two allocations.  10.100.1.2 is being used by kube-dns and we know that 10.100.1.3 is being used by our first pod based on the output from kubectl above.

Since we know that the pod’s IP address 10.100.1.3, we can look at that file to get the container ID of the container using that IP address. In this case, we can see that the ID lines up with the pause container ID our pod is using…

If we look at the documentation for the kubenet plugin we’ll see it works in the following manner…

Kubenet creates a Linux bridge named cbr0 and creates a veth pair for each pod with the host end of each pair connected to cbr0. The pod end of the pair is assigned an IP address allocated from a range assigned to the node either through configuration or by the controller-manager.

If we look at the interfaces on the ubuntu-3 host we will see the cbr0 interface along with one side of a VETH pair that lists cbr0 as it’s parent…

In this case there are two VETH pair interfaces listed but we can easily tell which one belongs to our pod by checking the VETH pair mapping in the pod…

Notice that the pod believes that the other end of the VETH pair is interface number 6 (eth0@if6) which lines up with the VETH interface ‘vethfa911188@if3’ on the host. So at this point we know how pod’s are networked on the localhost. The whole setup looks something like this…

Now let’s talk about what this means from a network perspective.  As part of the cluster setup we routed the 10.100.1.0/24 network to the host ubuntu-3 physical interface address of 10.20.30.73.  This means that if I initiate a network request toward a pod IP address it will end up on the host.  Additionally, since the host sees 10.100.1.0/24 as connected – it will attempt to ARP for any IP address in that subnet.  This means that in this case (natively routed pod networks) the pods are accessible directly…

So long as the routing is in place, I can even access the pod networks from my desktop machine which is on a remote subnet but uses the same gateway…

You’d also find that you can connect to the containers running in the third pod (10.100.2.2) in the same manner.  In this case, since there are two containers we can access them on their respective ports of 80 and 8080…

Should pods be accessed this way?  Not really but it is a troubleshooting step you can perform if your network routing allows for the connectivity.  Pods should be accessed through services which we’ll discuss in the next post.

Some of you will recall that I had previously written a set of SaltStack states to provision a bare metal Kubernetes cluster.  The idea was that you could use it to quickly deploy (and redeploy) a Kubernetes lab so you could get more familiar with the project and do some lab work on a real cluster.  Kubernetes is a fast moving project and I think you’ll find that those states likely no longer work with all of the changes that have been introduced into Kubernetes.  As I looked to refresh the posts I found that I was now much more comfortable with Ansible than I was with SaltStack so this time around I decided to write the automation using Ansible (I did also update the SaltStack version but I’ll be focusing on Ansible going forward).

However – before I could automate the configuration I had to remind myself how to do the install manually. To do this, I leaned heavily on Kelsey Hightower’s ‘Kubernetes the hard way‘ instructions.  These are a great resource and if you haven’t installed a cluster manually before I suggest you do that before attempting to automate an install.  You’ll find that the Ansible role I wrote VERY closely mimics Kelsey’s directions so if you’re looking to understand what the role does I’d suggest reading through Kelsey’s guide.  A big thank you to him for publishing it!

So let’s get started…

This is what my small lab looks like.  A couple of brief notes on the base setup…

  • All hosts are running a fresh install of Ubuntu 16.04.  The only options selected for package installation were for the OpenSSH server so we could access the servers via SSH
  • The servers all have static IPs as shown in the diagram and a default gateway as listed on the L3 gateway
  • All servers reference a local DNS server 10.20.30.13 (not pictured) and are resolvable in the local ‘interubernet.local’ domain (ex: ubuntu-1.interubernet.local).  They are also reachable via short name since the configuration also specifies a search domain (interubernet.local).
  • All of these servers can reach the internet through the L3 gateway and use the aforementioned DNS server to resolve public names.  This is important since the nodes will download binaries from the internet during cluster configuration.
  • In my example – I’m using 5 hosts.  You don’t need 5 but I think you’d want at least 3 so you could have 1 master and 2 minions but the role is configurable so you can have as many as you want
  • I’ll be using the first host (ubuntu-1) as both the Kubernetes master as well as the Ansible controller.  The remaining hosts will be Ansible clients and Kubernetes minions
  • The servers all have a user called ‘user’ which I’ll be using throughout the configuration

With that out of the way let’s get started.  The first thing we want to do is to get Ansible up and running.  To do that, we’ll start on the Ansible controller (ubuntu-1) by getting SSH prepped for Ansible.  The first thing we’ll do is generate an SSH key for our user (remember this is a new box, you might already have a key)…

To do this, we use the ‘ssh-keygen’ command to create a key for the user. In my case, I just hit enter to accept the defaults and to not set a password on the key (remember – this is a lab). Next we need to copy the public key to all of the servers that the Ansible controller needs to talk to. To perform the copy we’ll use the ‘ssh-copy-id’ command to move the key to all of the hosts…

Above I copied the key over for the user ‘user’ on the server ubuntu-5. You’ll need to do this for all 5 servers (including the master).  Now that’s in place make sure that the keys are working by trying to SSH directly into the servers…

While you’re in each server make sure that Python is installed on each host. Besides the above SSH setup – having Python installed is the only other requirement for the hosts to be able to communicate and work with the Ansible controller…

In my case Python wasn’t installed (these were really stripped down OS installs so that makes sense) but there’s a good chance your servers will already have Python. Once all of the clients are tested we can move on to install Ansible on the controller node. To do this we’ll use the following commands…

I wont bother showing the output since these are all pretty standard operations. Note that in addition to installing Ansible we also are installing Python PIP. Some of the Jinja templating I do with the playbook requires the Python netaddr library.  After you install Ansible and PIP take care of installing the netaddr package to get that out of the way…

Now we need to tell Ansible what hosts we’re working with. This is done by defining hosts in the ‘/etc/ansible/hosts’ file. The Kubernetes role I wrote expects two host groups. Once called ‘masters’ and once call ‘minions’. When you edit the host file for the first time there will likely be a series of comments with examples. To clean things up I like to delete all of the example comments and just add the two required groups. In the end my Ansible hosts file looks like this…

You’ll note that the ‘masters’ group is plural but at this time the role only supports defining a single master.

Now that we told Ansible what hosts it should talk to we can verify that Ansible can talk to them. To do that, run this command…

You should see a ‘pong’ result from each host indicating that it worked. Pretty easy right? Now we need to install the role. To do this we’ll create a new role directory called ‘kubernetes’ and then clone my repository into it like this…

Make sure you put the ‘.’ at the end of the git command otherwise git will create a new folder in the kubernetes directory to put all of the files in.  Once you’ve download the repository you need to update the variables that Ansible will use for the Kubernetes installation. To do that, you’ll need to edit roles variable file which should now be located at ‘/etc/ansible/roles/kubernetes/vars/main.yaml’. Let’s take a look at that file…

I’ve done my best to make ‘useful’ comments in here but there’s a lot more that needs to be explained (and will be in a future post) but for now you need to definitely pay attention to the following items…

  • The host_roles list needs to be updated to reflect your actual hosts.  You can add more or have less but the type of the host you define in this list needs to match what group its a member of in your Ansible hosts file.  That is, a minion type in the var file needs to be in the minion group in the Ansible host file.
  • Under cluster_info you need to make sure you pick two network that don’t overlap with your existing network.
    • For service_network_cidr pick an unused /24.  This wont ever get routed on your network but it should be unique.
    • For cluster_node_cidr pick a large network that you aren’t using (a /16 is ideal).  Kubernetes will allocate a /24 for each host out of this network to use for pods.  You WILL need to route this traffic on your L3 gateway but we’ll walk through how to do that once we get the cluster up and online.

Once the vars file is updated the last thing we need to do is tell Ansible what to do with the role. To do this, we’ll build a simple playbook that looks like this…

The playbook says “Run the role kubernetes on hosts in the masters and the minions group”.  Save the playbook as a YAML file somewhere on the system (in my case I just saved it in ~ as k8s_install.yaml). Now all we need to do is run it! To do that run this command…

Note the addition of the ‘–ask-become-pass’ parameter. When you run this command, Ansible will ask you for the sudo password to use on the hosts. Many of the tasks in the role require sudo access so this is required. An alternative to having to pass this parameter is to edit the sudoers file on each host and allow the user ‘user’ to perform passwordless sudo. However – using the parameter is just easier to get you up and running quickly.

Once you run the command Ansible will start doing it’s thing. The output is rather verbose and there will be lots of it…

If you encounter any failures using the role please contact me (or better yet open an issue in the repo on GitHub).  Once Ansible finishes running we should be able to check the status of the cluster almost immediately…

The component status should return a status of ‘Healthy’ for each component and the nodes should all move to a ‘Ready’ state.  The nodes will take a minute or two in order to transition from ‘NotReady’ to ‘Ready’ state so be patient. Once it’s up we need to work on setting up our network routing. As I mentioned above we need to route the network Kubernetes used for the pod networks to the appropriate hosts. To find which hosts got which network we can use this command which is pointed out in the ‘Kubernetes the hard way’ documentation…

Slick – ok. So now it’s up to us to make sure that each of those /24’s gets routed to each of those hosts. On my gateway, I want the routing to look like this…

Make sure you add the routes on your L3 gateway before you move on.  Once routing is in place we can deploy our first pods.  The first pod we’re going to deploy is kube-dns which is used by the cluster for name resolution.  The Ansible role already took care of placing the pod definition files for kube-dns on the controller for you, you just need to tell the cluster to run them…

As you can see there is both a service and a pod definition you need to install by passing the YAML file to the kubectl command with the ‘-f’ parameter. Once you’ve done that we can check to see the status of both the service and the pod…

If all went well you should see that each pod is running all three containers (denoted by the 3/3) and that the service is present.  At this point you’ve got yourself your very own Kubernetes cluster.  In the next post we’ll walk through deploying an example pod and step through how Kubernetes does networking.  Stay tuned!

« Older entries