A few tricks for Linkerd

Inspect your cluster traffic with Linkerd, jq and Prometheus
Get a quick overview of the traffic between your Kubernetes deployments using tools and metrics provided by Linkerd.
Linkerd CLIpermalink
Before you start
In the first part, we will be using linkerd viz
to see the requests the deployments are making. Some commands require you to have linkerd-linkerd-viz-tap-admin
Kubernetes role. If you are on GKE, you need to add this role to your user explicitly:
kubectl create clusterrolebinding \
$(whoami)-tap-admin \
--clusterrole=linkerd-linkerd-viz-tap-admin \
--user=$(gcloud config get-value account)
Read the documentation to learn more.
Slow routespermalink
Is your service slow? Or more precisely, is 99% percentile of the response time too long? Maybe it is because the service is making slow calls to other services? We will inspect this using linkerd
command line tool.
Let's say that the API served by the your-service
deployment is slow, and we suspect this is because of the slow calls to other services. We can use command linkerd viz routes deployment/your-service -t 10m
to see the summary of the routes your-service
has been using in the last ten minutes.
The output is something like:
> linkerd viz routes deployment/your-service -t 10m
ROUTE SERVICE SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99
/get_user service-a 100.00% 2.1rps 75ms 98ms 100ms
/get_metadata service-b 95.00% 3.2rps 175ms 301ms 405ms
Be aware, that to get such output you need to define the service profiles first. In our case we would need to define them at least for service-a
and service-b
.
Ok, so the output from linkerd viz routes
gives as a nice overview. To get more details, we need to change a few things. First, we will change the output format to json
, using -o json
flag. Next, we will process the output with jq
:
linkerd viz routes deployment/your-service --to namespace/default -o json -t 10m | jq '."deployment/your-service"[] | select(.latency_ms_p99 > 100)'
We use jq
to select only these routes, where more than 1% of the calls were slower than 100 milliseconds.
Sample output:
[
{
"route": "/get_user",
"authority": "service-a",
"effective_success": 1,
"effective_rps": 0.21833333333333332,
"actual_success": 1,
"actual_rps": 0.21833333333333332,
"latency_ms_p50": 16,
"latency_ms_p95": 27,
"latency_ms_p99": 110
},
{
"route": "/get_metadata",
"authority": "service-b",
"effective_success": 1,
"effective_rps": 0.31833333333333336,
"actual_success": 1,
"actual_rps": 0.31833333333333336,
"latency_ms_p50": 16,
"latency_ms_p95": 27,
"latency_ms_p99": 200
}
]
Finally, we will watch
the whole expression. Before that, to make working with quotes easier, we create file filter.jq
like this:
echo '."deployment/your-service"[] | select(.latency_ms_p99 > 100)' > filter.jq
Then we can do:
watch -n1 'linkerd viz routes deployment/your-service --to namespace/default -o json -t 10m | jq -f filter.jq'
This command will print the slow routes every second.
Of course, you can modify the jq
expression to get different filters. Adjusting the threshold response time should be easy. Unfortunately, there is no way for CLI command to return other percentiles (other than 50%, 90% and 99%). Therefore, to look at 99.9% we need to switch from CLI to Prometheus, where histogram_quantile
function will allow us to look at whatever percentile we want.
Slow requests
Command linkerd viz routes
shows the aggregated data. Sometimes we want more details. Using linkerd viz tap
allows to inspect the requests live. For example:
> linkerd viz tap deployment/service-checkout
req id=0:0 proxy=out src=10.16.6.24:50096 dst=10.16.0.205:9090 tls=true :method=POST :authority=service-a:9090 :path=/ingrid.service.sessionauth.SessionAuth/EnsureSessionToken
req id=0:0 proxy=out src=10.16.2.113:39738 dst=10.16.6.244:9090 tls=true :method=POST :authority=service-sessionauth:9090 :path=/ingrid.service.sessionauth.SessionAuth/EnsureSessionToken
rsp id=0:0 proxy=out src=10.16.2.113:39738 dst=10.16.6.244:9090 tls=true :status=200 latency=14688µs
end id=0:0 proxy=out src=10.16.2.113:39738 dst=10.16.6.244:9090 tls=true grpc-status=OK duration=82µs response-length=63B
req id=0:1 proxy=out src=10.16.2.113:55510 dst=10.16.2.118:9090 tls=true :method=POST :authority=service-sites-cache:9090 :path=/ingrid.service.config.Config/GetSite
rsp id=0:0 proxy=out src=10.16.6.24:50096 dst=10.16.0.205:9090 tls=true :status=200 latency=17418µs
end id=0:0 proxy=out src=10.16.6.24:50096 dst=10.16.0.205:9090 tls=true grpc-status=OK duration=75µs response-length=63B
This will continuously print the information about the request from and to our checkout service.
Again, we can use JSON output format and jq
to get more details. For example to print only the slow outgoing requests:
linkerd viz tap deployment/your-service --namespace default --to-namespace default -o json | jq '. | select( (.responseEndEvent.sinceRequestInit.nanos > 1000) and (.proxyDirection == "OUTBOUND"))'
Sample output:
{
"source": {
"ip": "10.16.4.212",
"port": 51766,
"metadata": {
"control_plane_ns": "linkerd",
"deployment": "your-service",
"namespace": "default",
"pod": "your-service-994965469-fc8d8",
"pod_template_hash": "994965469",
"serviceaccount": "default",
"tls": "loopback"
}
},
"destination": {
"ip": "10.16.1.126",
"port": 9090,
"metadata": {
"control_plane_ns": "linkerd",
"deployment": "service-config",
"namespace": "default",
"pod": "service-config-784c8f7b5b-ff2bh",
"pod_template_hash": "784c8f7b5b",
"server_id": "default.default.serviceaccount.identity.linkerd.cluster.local",
"service": "service-config",
"serviceaccount": "default",
"tls": "true"
}
},
"routeMeta": null,
"proxyDirection": "OUTBOUND",
"responseEndEvent": {
"id": {
"base": 10628,
"stream": 14
},
"sinceRequestInit": {
"nanos": 5486208
},
"sinceResponseInit": {
"nanos": 3305745
},
"responseBytes": 330997,
"trailers": [
{
"name": "grpc-status",
"valueStr": "0"
},
{
"name": "grpc-message",
"valueStr": ""
}
],
"grpcStatusCode": 0
}
}
Prometheus metricspermalink
So far, we have been using command line tools to monitor the slow requests. For looking at the historical data, we can use the Prometheus metrics exposed by Linkerd.
request_total
The first metric to look at is request_total
. We can use it, to get the number of outgoing requests made by our service:
sum(rate(request_total{deployment="$service"}[1m])) by (direction)
The response from Prometheus looks like this:
metric | value |
---|---|
direction="outbound" | 146.2688667666063 |
direction="inbound" | 43.61067963844134 |
By default, Prometheus will evaluate the query for the latest point in time. You can use tools like Grafana to visualise the historical data.
Looking at the direction
label gives us a high overview on the traffic. Usually, we are also interested, where the outgoing requests are going to. To get this information, we can look at the authority
label of request_total
. From the metrics' documentation:
authority: The value of the :authority (HTTP/2) or Host (HTTP/1.1) header of the request.
We can use this label to show outgoing direction:
sum(irate(request_total{deployment="$service", direction="outbound"}[1m])) by (authority)
This query should give you a response similar to:
metric | value |
---|---|
authority="service-geo.default.svc.cluster.local:9090" | 0.037083308548817 |
authority="169.254.169.254" | 0.2688667666063 |
authority="165.72.205.188" | 0.202141581147407 |
In our case there are three lines in the response. Each corresponds to our deployment calling other services:
- The first entry is another internal service, specifically
service-geo
, our internal geocoding solution. This is a full kubernetes service address. It means: connect to serviceservice-geo
in the namespacedefault
at port9090
. - The second entry (
169.254.169.254
) is Google metadata server. Ingrid is hosted on GCP (Google Cloud Platform), and our deployments connect to the metadata server to authorize to Google's APIs. - The last entry (
165.72.205.188
) is an external API our service connects to.
More metrics
The metric request_total
will give us insights about services called by our deployment. Often, we will need more detailed information. We might want to know which endpoints in these services are used by our deployment and what are the response times.
To get the information about the endpoints, we need to setup the service profiles first. Then, we can use route_request_total
like this:
sum(irate(request_total{deployment="$service", direction="outbound"}[1m])) by (dst, rt_route)
Notice, that we used dst
and rt_route
instead of authority
. The first one is the destination service. The second one is the route, which is another word for an endpoint which we are calling.
What about the response times? Depending on whether we have the service profiles or not, we can use route_response_latency_ms
(for response times per endpoint) or response_latency_ms
(for response times per service). Assuming we want the first one, this will give use 99% percentile of the response time per endpoint:
histogram_quantile(0.95, sum(request_total{deployment="$service", direction="outbound"}[1m])) by (le, dst, rt_route))
Summarypermalink
Linkerd, among other things, helps you monitor traffic in your Kubernetes cluster. It does so in many ways. You can look at each and every request (using linkerd viz tap
) or at the aggregated view (using Prometheus or linkerd viz routes
). In this article, I have described a few things, which I found useful when we were introducing Linkerd in Ingrid. For more information read about telemetry and monitoring in Linkerd and Prometheus metrics exposed by Linkerd proxy.
Cover photo by @brettjordan on Unsplash