Ingrid's road to service mesh

Cover image

A (sentimental) journey from Ingrid's beginning to Linkerd

by Krzysztof Dryś · 5 min read Linkerd kubernetes gRPC history

Ingrid backend has always been build on Kubernetes microservices written in Go and communicating over gRPC. While we use a lot of other technologies (for example: Redis, Google Cloud Pub/Sub and Prometheus), developers spend significant part of their time writing code in Go and sending gRPC requests.

This article is a story of how we have been learning Kubernetes and gRPC over the last few years.

The beginnings

In 2016 our cluster had no load balancing. Why? It turns out that having load balancing with gRPC and Kubernetes is not that straightforward.

At that time we had (among others) two services: internal configuration service and checkout service. They were in client-server relation: the checkout service would call the configuration service. Whenever a checkout service pod started, it would "pick" (at random) one of the pods of the configuration service. Then it would only send requests to this particular pod. If Kubernetes killed that particular configuration service pod, the checkout service pod would automatically pick a new one.

This is how "vanilla" gRPC works with Kubernetes. If want to understand the details, read this article.

Going back to the question "why Ingrid had no load balancing at that time", we can give another answer. We didn't have it, because we just didn't need it. We didn't have that much traffic. Also, notice that despite we had no load balancing, we still had high availability. What we were missing, was scalability. In the worst case, one configuration service pod, had to handle all the traffic. But, at that time it was acceptable,

Looking for a load balancer

Once Ingrid acquired more clients, scaling became more important to us. We started looking for a proper load balancer. We didn't have that many options or at least we could not find them. It was around 2017 and Istio v1.0.0 would be released only one year later. Finally, we decided to go for a custom solution called "grpclb".

This protocol is unlike your regular L4/L7 load balancer. With grpclb, checkout service pod would first ask grpclb server about all available configuration service pods. Then, it would send load balance the requests using round-robin strategy.

Our job was to implement grpclb service, which would watch cluster endpoints (just like kubectl get ep --watch) and update the clients. Here is the spec, which we needed to implement.

Ups and downs with grpclb and certificates

Implementing grpclb server was fun. We learned a lot about gRPC and Kubernetes. It solved our main problem: load balancing and horizontal scaling. It served us well for many years. It required relatively few resources. It didn't introduce latency, because the client connects directly to servers.

Unfortunately, we never had time to add features to it. We had round-robin load balancing, but no fancy stuff like canary deployments or traffic split.

We also had another growing pain: certs rotation. In the beginning of Ingrid we decided we will be using secure traffic. Each service would mount a private certificate to prove its identity, and a root certificate to check other's identity. Rotating these certs was becoming more and more problematic.

Looking for a service mesh

We could add features to our grpclb server and try to solve certs rotation problem using cert manager. Instead, we decided to fill two needs with one deed and adopt a service mesh.

First we evaluated Istio. But it proved to be a little too complex for us to handle. First, we had problems with properly configuring Envoy. And, after updating version from v1.9 to v1.10 broke our test cluster, we decided to go for something simpler.

After reading this mesh comparison we started looking at Linkerd, as it was labeled as the simplest to operate. One day later we meshed a few services on our test cluster. Two weeks later we meshed first production deployments.

First, we meshed the deployments which are not on the critical path of the checkout widget. When this proved stable, we started meshing more and more services on production, including ones on the critical path.

The process was rather simple thanks to the great documentation and really helpful community on Linkerd Slack. When we asked questions there, they would typically be answered in hours (usually with a detailed explanation).

Challenges of introducing a service mesh

So far, Linkerd lived up to the promise of be simple to operate. The community has been more than welcoming. Having said that, we still had a few struggles when implementing Linkerd. This is not a complaint against the product (which is great), but rather a testimonial.

Protocol detectionpermalink

When proxying traffic, Linkerd will try to automatically detect the protocol. Most of the time it works great. It does not work for server-speak-first protocols, like FTP. Of course, everything is properly documented. But still: Ingrid often connects to third party FTP servers, and it is hard to replicate someone's FTP server in test environment. So we had a few situations, when meshing a service on production caused an alert that a remote FTP server is not responding.

Migrating away from our own certspermalink

Our services had used our own certs. Service mesh (as far as we know, any service mesh) expects "insecure" traffic, so it can apply its own TLS. We have more than 50 services and migrating all of them required significant work.

What we ended up doing, was to add an insecure port to each service. For the configuration service, the migration process was the following:

  1. We added the support for the insecure port,
  2. We meshed the configuration service,
  3. We meshed the checkout service and made it to connected to the insecure port of the configuration service,
  4. Then we repeated (3.) for each client of the configuration service,
  5. Finally, we dropped secure port (and certs) of the configuration service.

Even though we tried to script as much stuff as possible, we still needed a lot of manual work. It is important to note, that the same work would be required for any service mesh.

Linkerd comes with a lot of metrics, and we could use them to check if anyone connects to the secure port before dropping it in step 5. Here is the prometheus query, which we used to see all "used" ports of the configuration service.

group(
label_replace(
tcp_open_connections{deployment="configuration-service", direction="inbound", target_addr!=""},
"target_pod",
"$2",
"target_addr",
"([0-9\\.]+):([0-9]+)"
)) by (target_pod)

Metricspermalink

As said before, Linkerd comes with a lot of very useful metrics. So far, we have been getting response times from inside the service using go-grpc-prometheus. Now we can think about getting them from Linkerd. This makes adding services written in other languages (here we come Rust, finally!) much easier.

As often, there is catch. A lot of new metrics, means much more work for Prometheus. Luckily, we have a very nice and flexible setup based on Thanos and cortex. But still, increase of the number of injected metrics is something, which you should be aware of, when adding a service mesh. Of course this goes for any service mesh, not only Linkerd.

Web dashboardpermalink

Compared to everything working out-of-the-box and generally great, the web dashboard needs some polishing. When we enter the page for the checkout service it often throws 502 at us. This is probably because the checkout service receives a lot of traffic and connects to many other services. Also, the architecture diagram rendered on the web dashboard, does not seem to handle cluster with more than 50 services.

Luckily, most of the functions from web is also available in CLI, which works great. Also, we think that in the long run we will not depend that much on the web dashboard. Instead, we want to have as much monitoring as possible in our Grafana. This is possible, because, to large extend, web dashboard operates on Prometheus metrics.

Dependency diagram
Diagram of dependencies between the services, as rendered by Linkerd web dashboard. While we could improve a thing or two, our architecture is not as bad as this image could suggest.

Ingrid 💚 Linkerd

All in all, we are very happy about selecting Linkerd. It is very simple to operate and so far we didn't have big problems with it.

What's next? We don't have full coverage of Linkerd yet. So the first challenge is to mesh each and every service. Next, we will look into things like canary deployments and traffic mirroring.


Cover photo by @kamillehmann on Unsplash

Does Ingrid sound like an interesting place to work at? We are always looking for good people! Check out our open positions