AWS Elastic Kubernetes Service (EKS) — Things to look out for

9 min readMar 12, 2020

Kubernetes is fast becoming the de facto standard for running containerised workloads, and most of the public cloud vendors like AWS, Azure & GCP provide their own version of managed Kubernetes service like EKS (AWS), AKS (Azure) and GKE (GCP). Enterprises with existing footprint on public cloud prefer to use managed Kubernetes platforms instead of self-managed Kubernetes because of their simplicity, and they also take away the complexity typically associated with running production grade Kubernetes clusters. If you have provisioned and managed a Kubernetes cluster in production, you’ll know how hard it can get from a manageability stand-point. To understand the complexity, you might want to take a look at ‘Kubernetes The Hard Way’ by Kelsey Hightower, and this one of the primary reasons why managed Kubernetes platforms are really famous.

We decided to go with AWS EKS for all our containerised applications, given our large AWS footprint it was fairly a straightforward decision to go with EKS. In this article, I’m going to talk about some of the foundational things you need to watch out for if you are planning to use EKS to run your containerised applications. This article is not a work of fiction, all the things I talk about in this article are things we’ve encountered and learned in the last couple of years running numerous containerised applications on EKS. This is also not a definitive guide for how to run containerised applications on EKS, remember ‘There’s more than one way to skin a cat’.

1. You’ll need more IP addresses than you think

EKS comes pre-configured with VPC CNI plugin which provides robust networking for pods. VPC CNI plugin assigns IP addresses from the VPC CIDR to pods, and this can become a problem over time if you don’t pay attention. For example, let’s assume that you have created three subnets with /24 CIDR notation, and associated them with EKS worker nodes. With this setup, you can get about 750 IP addresses for the worker nodes (EC2 instances) and the pods that get launched within the cluster. Let’s just say that you have 50 worker nodes as part of your cluster, which means you only have room for 700 pods which is not a lot if you are running multiple applications within that VPC. Luckily, we ran into this issue in our shared non-prod environment during the early build phase, and it helped us to plan our production network size appropriately.

But, how did we solve the problem? EKS allows clusters to be created in a VPC with additional IPv4 CIDR blocks in the 100.64.0.0/10 and 198.19.0.0/16 ranges. By adding secondary CIDR blocks to a VPC from the 100.64.0.0/10 and 198.19.0.0/16 ranges, in conjunction with the CNI Custom networking feature, it is possible for pods to no longer consume any RFC 1918 IP addresses in a VPC. By introducing secondary CIDR range, we were able to release the pressure on our primary CIDR ranges, and it also allowed us to have room for more pods in the cluster. You can also tackle this problem using custom CNI plugins like Calico and Flannel, but we wanted to keep it simple and we were quite content with VPC CNI plugin.

2. Auto-scaling is easier said than done

We leverage two kinds of auto-scaling in our EKS cluster,

Cluster Autoscaler — For scaling EKS worker node groups
Horizontal Pod Autoscaler — For scaling the application microservices

We encountered lot of issues with our application whenever Cluster Autoscaler or HPA triggered a scale-down event. The erratic application behaviour was caused by aggressive scaling thresholds and lack of pod termination handling mechanism.

Cluster Autoscaler

Cluster Autoscaler is quite straightforward, you define a scale down utilisation threshold while deploying Cluster Autoscaler (we deployed through Helm), and Cluster Autoscaler takes care of the rest. Cluster Autoscaler increases the size of the cluster when:

there are pods that failed to schedule on any of the current nodes due to insufficient resources.
adding a node similar to the nodes currently present in the cluster would help.

Cluster Autoscaler decreases the size of the cluster when some nodes are consistently unneeded for a significant amount of time. A node is unneeded when it has low utilisation and all of its important pods can be moved elsewhere. We started-off with a higher value for scale down utilisation threshold (65%), and it was bad for our application because of the frequent scaling events (especially scale-down). After a couple of iterations we got the number right (30%), and we never had to do anything else with cluster autoscaler except for new version updates.

HPA

We initially scaled our microservices based on CPU utilisation reported by metrics server which didn’t yield the desired results. Since we were running Java based microservices, we realised that CPU based scaling is not the right way to scale the microservices, and it was very tempting to look at Memory as possible replacement. But, scaling Java applications based on memory utilisation is a really bad idea because of how JVM manages memory (heap and non-heap).

We solved this problem by leveraging custom metrics for scaling our microservices by making use of Prometheus adapter. We wrote our own Prometheus queries to derive custom metrics and we exposed them to HPA via Prometheus adapter for scaling.We opted to go with a compound custom metric which is a combination of transactions per second and CPU utilisation.

3. You’ll need PreStop lifecycle hook

If you are used to running applications on VMs, you would’ve written a bunch of shutdown scripts to gracefully handle all the inflight transactions within your application. But, when you move to containers it is very easy to overlook this aspect, and that can impact your application adversely.

In our case, auto-scaling was causing a lot of problems especially during scale-down. For example, when cluster-autoscaler or HPA triggers a scale-down event, we witnessed a huge drop in Transaction per second (TPS) for a brief period of time. We initially thought that this caused by how we configured Cluster Autoscaler and HPA, but even after tweaking their settings we still witnessed this erratic application behaviour. Then we realised, oh wait! what happens to all the inflight transactions within the pods when they get terminated or rescheduled to another node? We didn’t have an answer, and that was the root cause of the issue.

We addressed this issue by introducing PreStop lifecycle hooks with custom logic to handle inflight transactions before a pod gets terminated by Kubernetes. Understanding the Kubernetes termination lifecycle will certainly help in these kind of situations.

Kubernetes waits for a specified time called the termination grace period. By default, this is 30 seconds. It’s important to note that this happens in parallel to the preStop hook and the SIGTERM signal. Kubernetes does not wait for the preStop hook to finish.

4. You are better off with platform services for stateful components

We had quite a few stateful components/services within our Kubernetes cluster like MongoDB, Apache Ignite, Confluent Kafka, Elasticsearch etc. It didn’t take long for us to realise that managing them can be quite stressful and energy sapping. For example, for our MongoDB cluster we started off with provisioned IOPS EBS volumes with the initial value of 1000. During performance testing we had to constantly tweak the IOPS value, it quickly reached 4000, and the numbers just kept increasing so much so that we didn’t know what is the right IOPS number.

We ended up replacing self-managed MongoDB with AWS DocumentDB, and we immediately witnessed performance improvement, and it also freed up our platform engineering team from operational burden. During our performance test runs we noticed that the IOPS can go as high as 14,000 and DocumentDB didn’t have any issues handling that. With DocumentDB we didn’t have to explicitly specify an IOPS number which was a huge advantage because we no longer had to worry about getting the IOPS right. We did the same thing with Apache Ignite by replacing it with AWS ElastiCache Redis, and Elasticsearch with ECE.

Bottomline is, if you have a better PaaS/SaaS alternative for stateful services, go for it if you don’t have any security or regulatory constraints.

5. Pay attention to PDB and anti-affinity

Pod Disruption Budgets and anti-affinity can make or break your application. In our case, we had maxUnavailable set to 50% in PDB, and it really had a huge impact on our application’s performance. What this setting meant to our application was Kubernetes had the license to reduce the number of instances of a service by 50%. Since we were not handling pod termination gracefully at the application level initially, loosing 50% of the capacity at the same time had a severe performance impact and resulted in lost transactions from an end-user perspective. It took us a couple of iterations to get this setting right and it was completely dictated by your application.

Pod anti-affinity allows you to constrain which nodes your pod is eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes. If you are dealing with a big application with lot of microservices, you’ll end up with lot of worker nodes due to anti-affinity resulting in cost inflation. Use anti-affinity only if you absolutely have to, in our case we consciously decided to have just one instance of a service per worker node for reliability reasons. We are constantly tweaking this setting to understand the application impact and you should take a similar approach if you don’t know where to start.

6. Segregate workloads into Managed Node Groups

EKS managed node groups allows you to segregate different workloads within your cluster. We run Java spring boot microservices along with other services like Confluent Kafka, Hashicorp Vault and Consul within the same EKS cluster. We ran into issues when we wanted to use Cluster Autoscaler, because we didn’t want to auto-scale services like Kafka, MongoDB, etc., and we only wanted to scale our application microservices. We tackled this problem by running the application microservices in their own worker node groups, and other services had their own worker node groups. This setup allowed us to target specific worker node groups for auto-scaling, and it also simplified the overall management of the cluster. Here is a sample Terraform code snippet for launching EKS worker node groups.

7. Spot instances can save a fortune (at least in dev)

Launching an EKS cluster is quite straightforward, use CloudFormation or Terraform to spin up the cluster, and in no-time you’ll have a fully functional Kubernetes cluster at your disposal. By default, the worker nodes are launched as on-demand instances, and that can have a massive cost implication if you don’t fully understand the nature of our application.

Our application microservices don’t persist any state information in the local container file system, this allowed us to scale our microservices based on the application load. We also saw potential for huge cost saving by using spot instances instead of on-demand or reserved instances. Here is a sample Terraform code snippet we use to launch spot worker nodes,

In all our dev environments, we only use spot-instances, and in production we use a combination of spot and reserved instances. Additionally, you can stop spot instances backed by EBS and start them at will, instead of relying on the “Stop” interruption behavior to stop spot instances when interrupted. You will not be charged for instance usage while your instance is stopped, and that can result in additional cost savings especially in Dev environments.

8. Don’t second guess pod QoS

In my opinion Pod Quality of Service (QoS) is the most overlooked feature of Kubernetes which has a big impact on your application. If you don’t know what pod QoS is, please check this out. With QoS you can influence Kubernetes pod scheduling and eviction decisions thus resulting in a more predictable application performance. For our application, we classified our microservices into three categories based on their criticality — High, Medium, and Low. For microservices with High/Medium criticality, we went with Guaranteed QoS class, and for low criticality microservices, we went with Burstable QoS class. BestEffort QoS class is usually not recommended for production, and I would definitely suggest you to stay away from it.

Hope this article was useful and worth your time. Kubernetes can be quite overwhelming given the plethora of choices that are out there. The best way to get your Kubernetes production journey underway is by getting all the basic foundational stuff right, enough said!