In our last post, we looked at how Kubernetes handles the bulk of it’s networking. What we didn’t cover yet, was how to access services deployed in the Kubernetes cluster from outside the cluster. Obviously services that live in pods can be accessed directly as each pod has its own routable IP address. But what if we want something a little more dynamic? What if we used a replication controller to scale our web front end? We have the Kubernetes service, but what I would call its VIP range (Portal Net) isn’t routable on the network. There are a couple of ways to solve this problem. Let’s walk through the problem and talk about a couple of ways to solve it. I’ll demonstrate the way I chose to solve it but that doesn’t imply that there aren’t other better ways as well.
As we’ve seen, Kubernetes has a built-in load balancer which it refers to as a service. A service is group of pods that all provide the same function. Services are accessible by other pods through an IP address which is allocated out of the clusters portal net allocation. This all works out rather well for the pods but only because the way that we get traffic to the service. Recall from our earlier posts that we need to use some iptables (netfilter) tricks to get traffic to the load balancing mechanism. That mechanism is the Kubernetes proxy service and it lives on each and every Kubernetes node. That being said, the concept of services only works for device that need to go through the Kubernetes proxy. This would only include devices that are in the cluster.
So to solve for external access into the cluster it seems reasonable that we’d need a different type of ‘external’ load balancer. But what would an external load balancer use as the back end pool members? The pod IP addresses are routable, so we could just load balance directly to the pod IP addresses. However, pods can tend to be ephemeral in nature. We would be constantly updating the load balancer pools as new pods came and left for a particular service group. On the flip side, if we could leverage the service construct, the cluster would take care of finding the pods for us. Luckily for us, there’s a way to do this. Let’s look at a quick sample service definition…
id: "webfrontend" kind: "Service" apiVersion: "v1beta1" port: 9090 containerPort: 80 PublicIPs: [10.20.30.62,10.20.30.63,192.168.10.64,192.168.10.65] selector: name: "web80" labels: name: "webservice"
Notice we added a new variable called ‘PublicIPs’. By defining a public IP address, we’re telling the cluster that we want this service reachable on these particular IP addresses. You’ll also note that the public IP’s I define are the IP addresses of the physical Kubernetes nodes. This is a matter of simplicity, I could assign any IP address I wanted as a ‘public IP’ so long as the network knew how to get it to the Kubernetes cluster.
Note: This works way different if you are running this on GCE. There you can just tell the service to provision an external load balancer and the magic of Google’s awesome network does it for you. Recall that in these posts I’m dealing with a bare metal lab.
So but what does this really do? Let’s deploy it to the cluster and then check one of our Kubernetes nodes netfilter rules by dumping the rules with ‘iptables-save’…
Note: Im not going to step through all the commands I use to build this lab. If you don’t know how to deploy constructs into Kubernetes go back and read this post.
Normally we’d expect to see a rule in each block just referencing the service IP address of 10.100.49.241. In this case, we see 4 more rules in each block, one for each public IP we defined in the service. Notice how the rule is exactly the same with the exception of the destination IP address. This tells us that the node will handle traffic destined to the these 4 new IP addresses in the exact same manner that it handles traffic for the service IP. So that’s awesome! However, it’s only so awesome. This type of setup means that each Kubernetes node can handle requests on port 9090 for the web80 service. But how do we handle that from a user perspective? We don’t want a user going right to a Kubernetes node since the nodes themselves should be considered ephemeral. So we need another abstraction layer here to make this seamless to the end user.
This is where the external load balancer kicks in. In my case, I chose to use HAproxy since they have an available docker container pre-built on docker hub. Let’s run through the config and then I’ll circle back and talk about some specifics. I need a docker host to run the HAproxy container on and since I want it to be ‘outside’ the Kubernetes cluster (AKA, doesn’t have the kube-proxy service on it) I chose to just use the kubmasta host. The first thing we need is a workable HAproxy config file. The one I generated looks like this…
global log 127.0.0.1 local0 log 127.0.0.1 local1 notice user haproxy group haproxy defaults log global mode http option httplog option dontlognull option forwardfor option http-server-close contimeout 5000 clitimeout 50000 srvtimeout 50000 errorfile 400 /etc/haproxy/errors/400.http errorfile 403 /etc/haproxy/errors/403.http errorfile 408 /etc/haproxy/errors/408.http errorfile 500 /etc/haproxy/errors/500.http errorfile 502 /etc/haproxy/errors/502.http errorfile 503 /etc/haproxy/errors/503.http errorfile 504 /etc/haproxy/errors/504.http stats enable stats auth user:kubernetes stats uri /haproxyStats frontend all bind *:80 #Define the host we're looking for acl host_web80 hdr(host) -i web80.interubernet.local acl host_web8080 hdr(host) -i web8080.interubernet.local #Decide what backend pool to use for each host use_backend webservice80 if host_web80 use_backend webservice8080 if host_web8080 backend webservice80 balance roundrobin option httpclose option forwardfor server kubminion1 10.20.30.62:9090 check server kubminion2 10.20.30.63:9090 check server kubminion3 192.168.10.64:9090 check server kubminion4 192.168.10.65:9090 check option httpchk HEAD /index.html HTTP/1.0 backend webservice8080 balance roundrobin option httpclose option forwardfor server kubminion1 10.20.30.62:9091 check server kubminion2 10.20.30.63:9091 check server kubminion3 192.168.10.64:9091 check server kubminion4 192.168.10.65:9091 check option httpchk HEAD /index.html HTTP/1.0
There’s quite a bit to digest here and if you haven’t used HAproxy before this can be a little confusing. Let’s hit on the big items. Under the default section I enable the statistics page and tell HAproxy at which URI it should be accessible at. I also define a username and password for authentication to that page.
The next section defines a single frontend and several backend pools. The frontend binds the service to all (*) interfaces on port 80. This is important and something that I didn’t think about until I had a ‘duh, Im running this in a container’ moment. I initially tried to bind the service to a specific IP address. This doesn’t work. Since the host is running a default docker network configuration the IP assigned to the container is random and we’re going to have to use port mappings to get traffic into the container. So assigning all available interfaces is really your only option (there are other options, this is just the easiest). The remainder of the frontend section defines 2 rules for load balancing. In the first section, I define host_web80 to be equal to any request that is destined to ‘web80.interubernet.local’ and host_web8080 to be equal to any request destined to ‘web8080.interubernet.local’. This also implies that I have DNS records that look like this…
A – web80.interubernet.local – 10.20.30.61
A – web8080.interubernet.local – 10.20.30.61
Recall I mentioned that I’m using the default docker network configuration so when I run the container I’ll map the ports I need to ports on the hosts (kubmasta (10.20.30.61)) physical interface.
The backend sections define the pools. As you can see, I define each Kubernetes minion as well as a specific health check for each server. The default ‘check’ HAproxy uses is just a layer4 port probe to the backend host. While this will ensure that the host is up and talking to the Kubernetes cluster, it does NOT ensure that the pods we want to talk to are actually running. Recall that when we define a service the rules get pushed to all of the hosts and Kubernetes starts searching for pods with a label that match the services label selector. If there are no available pods, the L4 port check will still succeed. That being said, we define a layer 7 health check that verifies that the the index.html file exists.
The HAproxy container can pull a custom configuration into the container by mapping a volume. In my case, I created a folder (/root/haproxy-config/) and put my configuration file (haproxy.cfg) in that folder. Then to run the container I used this command…
docker run -d -p 80:80 -v ~/haproxy-config:/haproxy-override dockerfile/haproxy
In addition to mapping the volume, I also map port 80 on the host (10.20.30.61) to port 80 on the container. Once your host downloads the image, you should be able to verify it’s running and that port 80 has been mapped…
So now that we have the HAproxy configuration in place, let’s define the service we listed above as well as a second service that looks like this…
id: "webfrontend2" kind: "Service" apiVersion: "v1beta1" port: 9091 containerPort: 8080 PublicIPs: [10.20.30.62,10.20.30.63,192.168.10.64,192.168.10.65] selector: name: "web8080" labels: name: "webservice"
Once both services are defined, the next step is to define the replication controllers for the backend pods. Before we do that, let’s take a quick look at the HAproxy dashboard to see what it sees…
So it looks like it doesn’t see any of the backend pools as being available. We haven’t deployed any of our pods yet so this is normal. Keep in mind that if we didn’t do the HTTP check on the backends this would show as up since the Kubernetes service is currently in place on the cluster. We’ll define the backend pools through a Kubernetes replication controller. Let’s start with the first one…
id: web-controller apiVersion: v1beta1 kind: ReplicationController desiredState: replicas: 1 replicaSelector: name: web80 podTemplate: desiredState: manifest: version: v1beta1 id: webpod containers: - name: webpod image: jonlangemak/docker:web_container_80 ports: - containerPort: 80 labels: name: web80
This shouldn’t look new. Nothing special here, except for the fact that we’re only deploying a single replica. Let’s deploy it to the cluster and then check back with HAproxy…
We can now see that the entire cluster is online. Are you confused as to why it shows the service up on all 4 nodes when there should only be a single pod running on one of them? Remember, we’re still using services so the pod appears to be present on all 4 nodes. Let’s look at diagram to show you what I mean…
In the case of the health checks, the probes come from the HAproxy container and are sent to what to what it believes to be the backend servers that will service the requests. That’s not really the case. What we really have is two layers of load balancing. The request comes in from HAproxy on port 9090 which is the port we defined in our service (red line). The Kubernetes host receives the traffic, netfilter catches the traffic and sends the traffic on an assigned random port to the Kubernetes proxy service (orange line). The Kubernetes proxy knows that there’s currently only one pod matching it’s label selector so it sends the traffic directly to that pod on port 80. This causes HAproxy to think that all 4 Kubernetes nodes are hosting the service it’s looking for when it’s really only running on kubminion1.
Now let’s create our second replication controller for the web8080 traffic…
id: web-controller-2 apiVersion: v1beta1 kind: ReplicationController desiredState: replicas: 2 replicaSelector: name: web8080 podTemplate: desiredState: manifest: version: v1beta1 id: webpod containers: - name: webpod image: jonlangemak/docker:web_container_8080 ports: - containerPort: 8080 labels: name: web8080
Once we deploy this controller to the cluster we can check HAproxy again…
Now HAproxy believes that both backends are up. Let’s do a couple quick tests to verify things are working as we expect. Recall that we need to use the DNS names for this work since that’s how we’re mapping traffic to the correct backend pool…
So things seem to be working as expected. Let’s wrap up by showing what the flow would look like for each of these user connections on our diagram…
So as you can see, what’s really happening isn’t super straight forward. Adding a second layer of load balancing with HAproxy certainly makes this a little more confusing, but it also gives you a lot of resiliency ad flexibility. For instance, we can keep deploying services in this manner until we run out of ports. The web80 service used port 9090 and web8080 used 9091. I could very easily define another service in Kubernetes on port 9092 (or any other port) and then create a new HAproxy frontend rule and associated backend pool.
Like I said, this is just one way to do it and I’m not even convinced that it’s the ‘right’ way to solve this. One of the key benefits of Kubernetes is that it reduces the need for excessive port mapping. Using HAproxy in this manner reintroduces some port madness. However, I think it’s rather minimal and much easier to scale than when you’re doing port mapping at the host level.
I’m anxious to hear how other solved this and if anyone has nay feedback. Thanks for reading!
This is a great 101, however having two load balancers doesn’t seems to be a solid solution IMO. Kubernetes moving really fast and it isn’t easy to catch up the speed. I’m currently thinking about a solution involving an api load balancer such as vulcand working with kube api server? What do you think about that?
For sure. Thats totally the direction we want to go. vulcand is a great direction but there’s no reason we couldnt do something similiar with our current configuration. Leveraging a tool like confd we could automate the provisioning of the HAproxy configuration.
I totally agree that doing this manually isnt ‘web scale’. But if you’re looking for a way to do this in a small environment or test bed this is a good start. Im not sure of any other ways to get around the 2 load balancer thing. I mean, we want to leverage the Kubernetes constructs like services for pod discovery.
Do you have any ideas on how to do this without services and without sending traffic right to a pod?
Thanks!
Hi Jon,
Thanks for your post, it’s very practicable for my “wet hand” trial. But there are still some questions for you:
1 Kubernetes always claim it provide the primitive load balancer function, but I don’t know how to use it, what I know is either using GCE or replying on other external load balancer like HAProxy in your post, where is kuberbnetes’s orignal load balancer?
2 In your example, IP+PORT is bind in the backend of HAProxy, it means that once a new service is created, the configure of HAProxy need to be updated, it’s obviously not flexible. I’m not familiar with HAProxy, is there way to only bind ip address in the backend with port range? for exmaple, binding four minions with port range 1000~2000 in the backend, when user visit 10.20.30.61:1001(master/HAProxy), then request will be forwarded to port 1001 in one of four backend minions.
best regards
I am curious about these points as well!
I want to share Docker, Swarm and Kubernetes, Socketplane, Mesos. Also use case.
This is an absolutely phenomenal series of posts. Thanks so much for the effort that obviously went into the setup, the experiments and the writing. I’ve been using Docker for nearly two years and I don’t think I fully understood the networking model until I read these through from the beginning. We’ve been using kubernetes on GCE for about a month now, and I definitely did not understand the networking model there, so again, thank you. I was struggling not so much with the tools available to permit ingress from the outside world to our services, but rather with why they are what they are. You cleared it right up. I did have to read certain parts twice, but I don’t think that was your fault :).
Thanks! I’m glad you found the blogs useful!
Jon, FANTASTIC WORK!!!! Really, I think most DEVs love K8s for its abstraction but leave the heavy lifting (HA, networking, security, auth, etc.) to the OPS guys, which usually are totally here. Your work really really helps a lot. Please continue with this series, esp. since K8s and Docker are moving fast in the networking space (docker network).
Nice post, but I’m not convinced this would work in production. There is a problem right at the top: DNS name has to be integrated with load-balancer. So it actually resolved to correct set of IPs.
So that one small green haproxy would be actually HA/balanced service.
And then it becomes tricky to manage it automatically.
Hi Jon,
thanks for your series of posts which helped a lot to dive into Kubernetes operation.
You can pull the IPs for the pods from etcd when you have defined a service over these pods. Taking the service name as key, you can pull like
etcdctl get /registry/services/endpoints/default/.
Taking confd to build a haproxy.cfg for instance, the template looks like
{{$data := json (getv “/registry/services/endpoints/default/”)}}
{{range $data.subsets}}
{{range .addresses}}
server {{.ip}} check
{{end}}
{{end}}
Cheers and thanks againg
Hi Jon,
just saw that wordpress prunes the meta-variables I used.
It is meant to be etcdctl get /registry/services/endpoints/default/THESERVICENAME
and accordingly
{{$data := json (getv “/registry/services/endpoints/default/THESERVICENAME”)}}
{{range $data.subsets}}
{{range .addresses}}
server {{.ip}} check
{{end}}
{{end}}
Hi Chris,
I have configured 3 Kubernetes master with etcd configured. I have installed confd and haproxy. I am looking for some guidance on updating haproxy dynamically. Do you have any basic template?
Pingback: Kubernetes是什么-Wikipedia | 酷 壳 – CoolShell 3F