A Practical Guide to AWS Elastic Kubernetes Service Cross-Cluster Service Discovery using Consul

Service discovery has gained a lot of prominence over the last few years due to the growing popularity of microservices based applications in enterprises. Kubernetes has become the most popular choice for running container based workloads, and it provides out of the box service discovery and service registry capabilities in the form of CoreDNS which allows services deployed within a cluster discoverable by other services in the same cluster. Given the dynamic nature of microservices due to auto-scaling where services scale up and down based on the load, service discovery becomes all the more important in ensuring that the application performs reliably.

Service discovery in a single-cluster environment is very straight-forward, however it can get complicated if you are dealing with services from different Kubernetes clusters interacting with each other. For service to service communication across different Kubernetes clusters, you can no longer rely only on the CoreDNS. You have to have a mechanism that’ll allow a Kubernetes cluster to discover services running in other clusters so that services can talk to them in the same way they talk to services located locally. In this article I’m going to talk about how to enable cross-cluster service discovery using Hashicorp Consul. Kubernetes services synced to the Consul catalog enable them to be accessed by any node that is part of the Consul cluster, including other distinct Kubernetes clusters.

  1. For this example, we have two Kubernetes clusters — cluster 1 and cluster 2 deployed in the same VPC but they sit in different subnets.
  2. Consul server sits in cluster 1 and the Consul Agents in cluster 2 are part of the consul cluster in cluster 1. This setup allows Consul agents to discover Kubernetes services in cluster 2 and add them to consul service registry. Services added to Consul service registry gets synced to cluster 1 automatically in the form of Kubernetes External Name service which allows services in cluster 1 to discover services from external clusters and talk to them like local services.
  3. We have Audit beat service running in cluster 1 and Elasticsearch, Logstash, Kibana stack running in cluster 2. Audit beat service connects to Logstash service in cluster 2 through the service discovery mechanism.
  4. External cluster services discovered by Consul are synced to Kubernetes with a .service.consul suffix.
  5. By default, any DNS request in cluster 1 will go to CoreDNS. However, when Audit beat service tries to resolve Logstash service, CoreDNS will not know how to resolve the DNS query because the Logstash service is outside of its jurisdiction.
  6. In order for the Audit beat service to resolve Logstash service in cluster 2, we have to let CoreDNS in cluster 1 know to handle those DNS queries. This can be achieved by adding a stub domain for .consul. The stub domain will be responsible handling for any .consul DNS queries in cluster 1. The stub domain is backed by Consul DNS service in cluster 1 which gets deployed along with the Consul server.
  7. Consul DNS acts as the service registry for all the external services, whereas CoreDNS does that job for all the internal services. DNS queries for any external services will be forwarded to Consul DNS by CoreDNS.

Let’s go ahead and see this in action.

Deploy Consul server & Consul DNS in EKS Cluster 1 using Helm

Step 1: Determine the latest version of the Consul Helm chart by visiting this GitHub repo. Clone the chart at that version.

Step 2: Ensure you’ve checked out the correct version with helm inspect chart:

Step 3: Before installing Consul server in cluster 1, make sure to enable the following settings in the values.yaml file to enable Service Sync to Kubernetes.

Step 4: Now install Consul with the default configuration using Helm 3 run:

If using Helm 2, run:

Verify if the Consul server and sync pods are running. Due to a limitation of anti-affinity rules with DaemonSets, a client-mode agent runs alongside server-mode agents in Kubernetes. The server agents are run as a StatefulSet, using persistent volume claims to store the server state. This also ensures that the node ID is persisted so that servers can be rescheduled onto new IP addresses without causing issues. The server agents are configured with anti-affinity rules so that they are placed on different nodes. A readiness probe is configured that marks the pod as ready only when it has established a leader.

Access Consul UI via a NodePort service or LoadBalancer. In this case, I exposed Consul UI as a NodePort service to keep the costs down.

My EKS worker nodes are running in public subnets with public IP addresses attached to them, hence I’m able access the UI service exposed as NodePort via the public IP address.

In the above screenshot you can see the Consul server nodes. Once we deploy Consul agents in cluster 2, we should be able to see those nodes also.

Verify if the Consul DNS service has been created.

Deploy Consul Client Agents only in EKS Cluster 2 using Helm

The only difference between deploying Consul Server and client agents is the values.yaml used with Helm. For the Consul client agents deployment you have to set the following flags in values.yaml before deploying.

You also have to specify the Consul server address in order for the client agent nodes to join the cluster. In this example, I’m specifying the private IP address of Consul server nodes manually, but in a dynamic environment this IP address could keep changing and it is not recommended to explicitly specify the IP addresses like this. You can auto-discovery using tags to handle this in an automated fashion.

Now install Consul with the default configuration using Helm 3 run:

If using Helm 2, run:

Verify if the Consul client agent and sync pods are running. We can also verify if the client agent nodes have successfully joined the Consul server from the UI.

In the above screenshot, you can see that Consul client agent nodes from other clusters have successfully joined the Consul server.

Add DNS Stub domain in Cluster 1

In order to redirect any DNS queries for the .consul domain to Consul DNS, we have to add a stub domain. This can be achieved by updating the CoreDNS ConfigMap in the kube-system namespace.

Edit the coredns ConfigMap as shown in the following screenshot

In cluster 1 Consul domain server is located at 10.100.178.229. To configure it in CoreDNS, the create the following stanza in the CoreDNS ConfigMap.

After making this change, make sure that CoreDNS pods are running fine in the kube-system namespace. In order for the changes to take effect, you need to scale CoreDNS deployment to 0 to get rid of the existing pods and scale it back up. The new pods will use the updated ConfigMap settings.

Deploy ELK stack in Cluster 2 using Helm

Deploy Elasticsearch, Logstash and Kibana stack using Helm chart in cluster 1. I used the charts available in Helm stable repo to deploy the ELK stack.

Elasticsearch & Kibana — https://github.com/helm/charts/tree/master/stable/elastic-stack

Logstash — https://github.com/helm/charts/tree/master/stable/logstash

Validate Service Discovery

Now that we have Consul running on both Kubernetes clusters with service sync enabled, you should be able to see the all the services deployed in cluster 2 in Consul service registry, and they also get synced to cluster 1 as External Name service.

Services discovered in cluster 2 have been added to Consul registry with ‘eks2’ as service prefix and tag. You can specify the prefix and tag in Helm values.yaml used for Client Agent deployment.

Services discovered in cluster 2 have been automatically added as External Name services in cluster 1. Let’s take a look at the Logstash service in Consul service registry which got synced from cluster 2. We’ll use this service name when deploying Auditbeat in cluster 1 to push all the audit logs to Logstash.

Deploy Auditbeat in Cluster 1 using Helm

Auditbeat is a lightweight shipper to audit the activities of users and processes on your systems. For example, you can use Auditbeat to collect and centralize audit events from the Linux Audit Framework.

Before deploying Auditbeat, you have to make few changes to the values.yaml file to point Auditbeat to Logstash service in cluster 2. In the following example, I’m using the Logstash service name in Consul registry in order to connect to Logstash service in cluster 2.

We can verify if we are able to resolve the Logstash service in cluster 2 from cluster 1 using DNS utility. Take a look at this Kubernetes documentation to deploy the DNS utility pod in cluster 1. After deploying the DNS utility pod, get a shell to the DNS utility pod.

Now perform nslookup for the Logstash service name, and see whether it gets resolved

In the screenshot above, you can see that we are able to resolve the Logstash service running in cluster 2.

Verify Auditbeat logs in Kibana

Now that we have established connectivity between Auditbeat service in cluster 1 and Logstash service in cluster 2, you should be able to see the audit logs in Kibana. Logs received by Logstash from Auditbeat get forwarded to Elasticsearch, and visualized via Kibana. I’ve exposed Kibana as a NodePort service in cluster 2, and we can access Kibana using EKS worker node’s public IP address. From the following screenshot, we can see that the logs are getting delivered to Elasticsearch.

Even though Logstash is exposed as a ClusterIP service in cluster 2, we are able to access the service from cluster 1. This is happening because we are using the default VPC CNI plugin that gets shipped with EKS. With VPC CNI plugin, all the pods will get a private IP address from the underlying subnet, hence the pod becomes reachable even though we are exposing it as a ClusterIP service. You also have to make sure that for this to work, you enable network connectivity between VPCs if the clusters are sitting in two different VPCs. You also have to make sure that you have defined SG and NACLs rules to allow connectivity between clusters.

Other options for Cross-cluster Service Discovery

It is very common for microservices based workloads to have a service mesh like Istio for Service Discovery and routing. If you are using Istio, you can consider the following options for cross-cluster Service Discovery.

  1. Istio Service Entry + Istio CoreDNS + CoreDNS — Involves a lot of manual effort to manage the Service registry
  2. Istio CoreDNS + Istio Admiral — You’ll have to rely entirely on Istio CoreDNS.
  3. AWS App Mesh Cloud Map

We tried these options in our environments and we felt that Consul is the most ideal solution for cross-cluster service discovery and routing. Don’t take my word for it, please try this out in your environment and see the results for yourself.

Cross-cluster Service Discovery can make your application more reliable, and in the age of modern dynamic systems you can no longer rely on a static service registry which needs to be updated manually whenever something changes. Coupling a service like Consul with a service mesh will make your applications even more resilient and secure. Please note that Consul can also be used as a service mesh to simplify your Service discovery and routing needs for microservices based applications.

An AWS and Google Cloud certified Cloud Professional leading a team of 60+ world class engineers to deliver business value to enterprises across ASEAN