Linkerd service mesh in production

Alex Lundberg

Published in

CommonBond Product & Engineering

9 min readMar 21, 2020

The need for improved monitoring

Recently our company made a push for better metrics on our microservices. Namely, we wanted to quickly know:

How long services took to respond, what endpoints on a service took longest to respond, and what distribution of response times we receive.
When a service failed, how it failed(status code, logs), and what route on the service failed.
What dependencies a service has, and what service is at fault when an outage occurs.

We had monitoring that offered some coverage. Namely response code and latency metrics from our ingress controller, and more detailed metrics from individual applications instrumented with Prometheus. Often these would not give enough information to quickly diagnose issues and pinpoint the cause of failures. So to improve our monitoring, we looked to add a service mesh solution to our Kubernetes clusters.

How a service mesh helps

A service mesh works by injecting your application pods with a sidecar that proxies all of the application’s network traffic. The sidecar will gather metrics, such a latency, request rate, success rate, status code, host and destination service for each request. The sidecars will also create encrypted tls connections with each other to give end-to-end encryption within your cluster and prevent network sniffing. Finally, the sidecars give additional functionality into intelligent routing between your services. This can include things like canary deployments, blue-green deployments, retries, timeouts, load-balancing, and circuit breakers.

Our need for a service mesh was primarily driven for the need to have improved monitoring. The intelligent routing and mutual tls are nice to have as well, but not our focus. We also desired a service mesh that was simple and had low operational overhead.

Which service mesh do we use?

We investigated three servicemesh solutions:

Istio
Maesh
Linkerd

Istio is the largest and the most used service mesh solution on the market currently. It is incredibly comprehensive; however, it is currently very challenging to set up and maintain. We had installed Istio twice but removed it. We found the documentation was a moving target, and various connectivity issues had eroded trust with developers.

We took a cursory look into other new service mesh solutions such as Maesh. Maesh integrates cleanly with Traefik and runs a Daemonset instead of sidecar proxies. The Daemonset greatly reduces the number of sidecar proxies in your cluster which means less resource consumption. However, if a Daemonset pod fails, network connectivity is stopped for all pods on the node which makes this architecture less resilient to failure.

We found Linkerd to most closely match our desire from a service mesh. Linkerd describes itself as an “ultralight service mesh for Kubernetes” and has relatively low overhead for installation and maintenance, while still offering comprehensive observability into our microservices. It differentiates itself from Istio, choosing to be less feature-rich, only functional on Kubernetes, and un-opinionated about ingress controller, instead offering a solution that is simpler to manage. Furthermore, Linkerd is well supported as a CNCF project and has many industry players using it in production.

Starting off with Linkerd.

Linkerd offers a command-line tool to setup and install linkerd. Soon I had installed linkerd into a test cluster, installed the demo “booksapp”, instrumented route monitoring, connected the metrics to our Prometheus instance, and gathered visibility into application performance with their dashboard application. The documentation on the Linkerd website was very easy to follow and I had almost no issues in setting up a minimum working version.

Once we successfully installed a demo version of Linkerd, we installed it on our development Kubernetes cluster, and gradually injected proxies into each namespace.

Development cluster installation with Linkerd.

Moving from a demo installation of Linkerd to a semi-production installation requires some changes from the default installation.

The Linkerd installation and configuration should be committed to code.
Linkerd should be installed in high availability mode.
The secrets should not be committed to git and instead managed by a secret-management service.
Linkerd should function well with our gitops solution, ArgoCD.
The Linkerd dashboard should be available to users without needing to install and setup the Linkerd cli.
It should be easy for developers to get enhanced monitoring from linkerd into their application.

Linkerd offers a helm installation and a high availability option values.yaml file which takes care of the first two bullet points. Linkerd, as of release 2.7.0, allows you to create your root authority and issuer certificate separately and manage cert-rotation with cert-manager which takes care of bullet point three.

Managing the helm chart with ArgoCD, or any gitops solution, is easily done; however, the app will often appear as out-of-sync from its active manifests in Kubernetes. The certificate rotation of the issuer certificate done through cert-manager, as well as the built-in certificate rotation which rotates each control-plane component tls certificate daily, will cause your original helm installation to never match what is currently active. Worse, syncing the app will cause those certificates to revert and may require that some individual control plan components be restarted to allow for sidecar injection. This is something to be mindful of when we need to update Linkerd in the future.

Linkerd provides a dashboard that can be accessed through port-forwarding the linkerd-web deployment to your local machine. The dashboard provides a dependency graph of your services, taps into route calls, and offers connectivity metrics such as success rate, latency, and request rate. We wanted all developers and product managers to be able to access the dashboard without needing to install the linkerd cli and port-forward the linkerd-web deployment.

We discovered that the dashboard creates a WebSocket connection to the end-users machine, and the aws-elb’s provisioned from our Traefik LoadBalancer services will drop the headers needed to provide this connection. We instead set up a new load-balancer and added the missing headers with an Nginx server.

Injecting an application with Linkerd

It was critical that developers can easily instrument their applications with Linkerd and get per-route monitoring. Linkerd monitors routes based on the serviceprofile CRD which describes the routes to monitor and their timeouts and retries. Linkerd can create serviceprofiles using a swagger spec. We then added the serviceprofile to the Kubernetes manifests and distribute it to the different environments using Kustomize. The only tricky issue was that the serviceprofile name needs to exactly match the FQDN of the service, so distributing the serviceprofile to separate environments requires overwriting the serviceprofile name. We accomplished this by adding an inline Kustomize JsonPatch to replace the name for each environment:

patchesJson6902:
- target:
    group: linkerd.io
    version: v1alpha2
    kind: ServiceProfile
    name: NAME
  patch: |-
    - op: replace
      path: "/metadata/name"
      value: "$servicename.$namespace.svc.cluster.local"

Linkerd needs to know the Kubernetes destination service for requests coming in from external to the cluster. We added a JsonPatch to each ingress to append the annotation needed:

- op: add
  path: "/metadata/annotations/ingress.kubernetes.io~1custom-request-headers"   
  value: "l5d-dst-override:$servicename.$namespace.svc.cluster.local:serviceport"

We controlled the injection of the sidecar at the namespace level, so all developers needed to do to get enhanced monitoring was to generate a serviceprofile from a swagger-spec, then add a name patch for each environment, then add the Linkerd header to their ingress if available.

The Road to Production

All that needed to be done now was to gradually inject new development namespaces with Linkerd, fix any issues, then install Linkerd in production.

Issues we noticed

We discovered that one application was failing its readiness and liveness probes, and the logs told us that one of the external dependencies was taking much longer to resolve. This was because the external dependency was a domain name with four periods: x.x.x.x and the Linkerd-proxy assumed that this was a Kubernetes service. After failing to find the Kubernetes service after some timeout, the proxy then looked external to the cluster to correctly find the service. We then created an ExternalName service that pointed to the external service and directed the application to use this service instead. The Linkerd-proxy then resolved the service quickly and our application was operational.

Some of our applications logged failures during their first second, but resolved themselves later. This was because the application started up faster than the Linkerd-sidecar proxy. The application then tried to route traffic through a sidecar that didn’t exist. For some applications, this resolved itself quickly and no intervention was needed; for others, such as some reverse-proxies that failed to resolve DNS, then failed to provide proxying until their DNS check retriggered, were instead removed from sidecar injection. Future versions of Kubernetes are set to offer sidecar support to prevent race conditions like these from happening on fast-booting pods, but for now, we are stuck with handling these issues as they arise.

After adding serviceprofiles, some routes would fail to get a response. This was because the default timeout was configured to 10 seconds before the Linkerd proxy would return a 504 error to the requestor. We fixed this by adding a timeout to each route in our serviceprofiles.

gsed -i '/pathRegex*/a \ \ \ \ timeout: 900s' serviceprofile.yaml

We had an issue where the Linkerd-proxy was failing to proxy any network traffic. This was an intermittent bug with the stable-2.7.0 proxy and has caused many users to rollback to the stable-2.6.1. We rolled back to the stable-2.6.1 version but noticed that many sidecar proxies were restarting from hitting their memory limit. This was due to a memory leak in the log-management system that was fixed in the latest stable release. We disabled logging except for critical errors which caused our sidecar memory usage to stabilize.

Memory Usage of the sidecar before and after adjusting log level

Building an Improved Dashboard

We needed additional monitoring that the built-in Linkerd Grafana panels and the Linkerd dashboard didn’t provide as as these don’t provide metrics based on status-code and route. These metrics are present as dimensions in the proxy-metrics, so we created our own Grafana dashboard to monitor latency, success rate, and request rate. This dashboard is available on Grafanlabs https://grafana.com/grafana/dashboards/11868, and a fork of this may be packaged with Linkerd in the future.

A custom dashboard that lets us analyze traffic by sender, receiver, status code, and route.

CpuMageddon

As we injected more sidecars, we noticed more scheduling failures for new pods. The reason for this was that each proxy sidecar requests a default of 100m CPU when running linkerd in high-availability mode. Multiply this by hundreds or thousands of sidecars in your cluster, and you can very easily hit your resourcequotas or simply consume all the cpus available on your nodes and prevent scheduling of new pods.

Worse, since these cpus may not be being used, you may fail to trigger a node scale-up depending on your autoscaling solution. To fix this, we simply gathered the historical averages of proxy cpu and memory consumption and adding that to our helm installation as the proxy requests.

Installing Linkerd in Production

Over three weeks, we monitored development and fixed the issues that arose. We then moved Linkerd into production, namespace by namespace. There were no surprises here. The installation was smooth and had no downtime for all injected applications. I have to give thanks to the Linkerd team for making this process painless and straightforward.

Afterthoughts

The biggest takeaway here is that our installation with Linkerd in production was seamless because we took the needed time to discover and fix issues in development. Each of the issues we discovered in our development environment could have had disastrous consequences in production if not mitigated. For those wishing to bring Linkerd or any service mesh into production it is important to migrate gradually and to give time to discover and fix issues related to your setup.
Again I want to give my thanks to the Linkerd team for their comprehensive documentation and for their promptness in answering my questions on slack.