A few tricks for Linkerd

Cover image

Inspect your cluster traffic with Linkerd, jq and Prometheus

by Krzysztof Dryś · 5 min read linkerd jq kubernetes

Get a quick overview of the traffic between your Kubernetes deployments using tools and metrics provided by Linkerd.

Linkerd CLIpermalink

Before you start

In the first part, we will be using linkerd viz to see the requests the deployments are making. Some commands require you to have linkerd-linkerd-viz-tap-admin Kubernetes role. If you are on GKE, you need to add this role to your user explicitly:

kubectl create clusterrolebinding \
  $(whoami)-tap-admin \
  --clusterrole=linkerd-linkerd-viz-tap-admin \
  --user=$(gcloud config get-value account)

Read the documentation to learn more.

Slow routespermalink

Is your service slow? Or more precisely, is 99% percentile of the response time too long? Maybe it is because the service is making slow calls to other services? We will inspect this using linkerd command line tool.

Let's say that the API served by the your-service deployment is slow, and we suspect this is because of the slow calls to other services. We can use command linkerd viz routes deployment/your-service -t 10m to see the summary of the routes your-service has been using in the last ten minutes.

The output is something like:

> linkerd viz routes deployment/your-service -t 10m
ROUTE            SERVICE     SUCCESS    RPS         LATENCY_P50   LATENCY_P95   LATENCY_P99
/get_user        service-a   100.00%   2.1rps          75ms          98ms         100ms
/get_metadata    service-b    95.00%   3.2rps         175ms         301ms         405ms

Be aware, that to get such output you need to define the service profiles first. In our case we would need to define them at least for service-a and service-b.

Ok, so the output from linkerd viz routes gives as a nice overview. To get more details, we need to change a few things. First, we will change the output format to json, using -o json flag. Next, we will process the output with jq:

linkerd viz routes deployment/your-service  --to namespace/default -o json -t 10m | jq '."deployment/your-service"[] | select(.latency_ms_p99 > 100)'

We use jq to select only these routes, where more than 1% of the calls were slower than 100 milliseconds.

Sample output:

[
{
"route": "/get_user",
"authority": "service-a",
"effective_success": 1,
"effective_rps": 0.21833333333333332,
"actual_success": 1,
"actual_rps": 0.21833333333333332,
"latency_ms_p50": 16,
"latency_ms_p95": 27,
"latency_ms_p99": 110
},
{
"route": "/get_metadata",
"authority": "service-b",
"effective_success": 1,
"effective_rps": 0.31833333333333336,
"actual_success": 1,
"actual_rps": 0.31833333333333336,
"latency_ms_p50": 16,
"latency_ms_p95": 27,
"latency_ms_p99": 200
}
]

Finally, we will watch the whole expression. Before that, to make working with quotes easier, we create file filter.jq like this:

echo '."deployment/your-service"[] | select(.latency_ms_p99 > 100)' > filter.jq

Then we can do:

watch -n1 'linkerd viz routes deployment/your-service  --to namespace/default -o json -t 10m | jq -f filter.jq'

This command will print the slow routes every second.

Of course, you can modify the jq expression to get different filters. Adjusting the threshold response time should be easy. Unfortunately, there is no way for CLI command to return other percentiles (other than 50%, 90% and 99%). Therefore, to look at 99.9% we need to switch from CLI to Prometheus, where histogram_quantile function will allow us to look at whatever percentile we want.

Slow requests

Command linkerd viz routes shows the aggregated data. Sometimes we want more details. Using linkerd viz tap allows to inspect the requests live. For example:

> linkerd viz tap deployment/service-checkout
req id=0:0 proxy=out src=10.16.6.24:50096 dst=10.16.0.205:9090 tls=true :method=POST :authority=service-a:9090 :path=/ingrid.service.sessionauth.SessionAuth/EnsureSessionToken
req id=0:0 proxy=out src=10.16.2.113:39738 dst=10.16.6.244:9090 tls=true :method=POST :authority=service-sessionauth:9090 :path=/ingrid.service.sessionauth.SessionAuth/EnsureSessionToken
rsp id=0:0 proxy=out src=10.16.2.113:39738 dst=10.16.6.244:9090 tls=true :status=200 latency=14688µs
end id=0:0 proxy=out src=10.16.2.113:39738 dst=10.16.6.244:9090 tls=true grpc-status=OK duration=82µs response-length=63B
req id=0:1 proxy=out src=10.16.2.113:55510 dst=10.16.2.118:9090 tls=true :method=POST :authority=service-sites-cache:9090 :path=/ingrid.service.config.Config/GetSite
rsp id=0:0 proxy=out src=10.16.6.24:50096 dst=10.16.0.205:9090 tls=true :status=200 latency=17418µs
end id=0:0 proxy=out src=10.16.6.24:50096 dst=10.16.0.205:9090 tls=true grpc-status=OK duration=75µs response-length=63B

This will continuously print the information about the request from and to our checkout service.

Again, we can use JSON output format and jq to get more details. For example to print only the slow outgoing requests:

linkerd viz tap deployment/your-service --namespace default  --to-namespace default -o json | jq '. | select( (.responseEndEvent.sinceRequestInit.nanos > 1000) and (.proxyDirection == "OUTBOUND"))'

Sample output:

{
"source": {
"ip": "10.16.4.212",
"port": 51766,
"metadata": {
"control_plane_ns": "linkerd",
"deployment": "your-service",
"namespace": "default",
"pod": "your-service-994965469-fc8d8",
"pod_template_hash": "994965469",
"serviceaccount": "default",
"tls": "loopback"
}
},
"destination": {
"ip": "10.16.1.126",
"port": 9090,
"metadata": {
"control_plane_ns": "linkerd",
"deployment": "service-config",
"namespace": "default",
"pod": "service-config-784c8f7b5b-ff2bh",
"pod_template_hash": "784c8f7b5b",
"server_id": "default.default.serviceaccount.identity.linkerd.cluster.local",
"service": "service-config",
"serviceaccount": "default",
"tls": "true"
}
},
"routeMeta": null,
"proxyDirection": "OUTBOUND",
"responseEndEvent": {
"id": {
"base": 10628,
"stream": 14
},
"sinceRequestInit": {
"nanos": 5486208
},
"sinceResponseInit": {
"nanos": 3305745
},
"responseBytes": 330997,
"trailers": [
{
"name": "grpc-status",
"valueStr": "0"
},
{
"name": "grpc-message",
"valueStr": ""
}
],
"grpcStatusCode": 0
}
}

Prometheus metricspermalink

So far, we have been using command line tools to monitor the slow requests. For looking at the historical data, we can use the Prometheus metrics exposed by Linkerd.

request_total

The first metric to look at is request_total. We can use it, to get the number of outgoing requests made by our service:

sum(rate(request_total{deployment="$service"}[1m])) by (direction)

The response from Prometheus looks like this:

metricvalue
direction="outbound"146.2688667666063
direction="inbound"43.61067963844134

By default, Prometheus will evaluate the query for the latest point in time. You can use tools like Grafana to visualise the historical data.

Looking at the direction label gives us a high overview on the traffic. Usually, we are also interested, where the outgoing requests are going to. To get this information, we can look at the authority label of request_total. From the metrics' documentation:

authority: The value of the :authority (HTTP/2) or Host (HTTP/1.1) header of the request.

We can use this label to show outgoing direction:

sum(irate(request_total{deployment="$service", direction="outbound"}[1m])) by (authority)

This query should give you a response similar to:

metricvalue
authority="service-geo.default.svc.cluster.local:9090"0.037083308548817
authority="169.254.169.254"0.2688667666063
authority="165.72.205.188"0.202141581147407

In our case there are three lines in the response. Each corresponds to our deployment calling other services:

  • The first entry is another internal service, specifically service-geo, our internal geocoding solution. This is a full kubernetes service address. It means: connect to service service-geo in the namespace default at port 9090.
  • The second entry (169.254.169.254) is Google metadata server. Ingrid is hosted on GCP (Google Cloud Platform), and our deployments connect to the metadata server to authorize to Google's APIs.
  • The last entry (165.72.205.188) is an external API our service connects to.

More metrics

The metric request_total will give us insights about services called by our deployment. Often, we will need more detailed information. We might want to know which endpoints in these services are used by our deployment and what are the response times.

To get the information about the endpoints, we need to setup the service profiles first. Then, we can use route_request_total like this:

sum(irate(request_total{deployment="$service", direction="outbound"}[1m])) by (dst, rt_route)

Notice, that we used dst and rt_route instead of authority. The first one is the destination service. The second one is the route, which is another word for an endpoint which we are calling.

What about the response times? Depending on whether we have the service profiles or not, we can use route_response_latency_ms (for response times per endpoint) or response_latency_ms (for response times per service). Assuming we want the first one, this will give use 99% percentile of the response time per endpoint:

histogram_quantile(0.95, sum(request_total{deployment="$service", direction="outbound"}[1m])) by (le, dst, rt_route))

Summarypermalink

Linkerd, among other things, helps you monitor traffic in your Kubernetes cluster. It does so in many ways. You can look at each and every request (using linkerd viz tap) or at the aggregated view (using Prometheus or linkerd viz routes). In this article, I have described a few things, which I found useful when we were introducing Linkerd in Ingrid. For more information read about telemetry and monitoring in Linkerd and Prometheus metrics exposed by Linkerd proxy.


Cover photo by @brettjordan on Unsplash

Does Ingrid sound like an interesting place to work at? We are always looking for good people! Check out our open positions