Service Mesh Academy
Live logo

Linkerd in production 101: what you need to know

Live on
Dec 7, 2021

In this hands-on session, we’ll walk you through the basics of what you need to know to successfully deploy and operate Linkerd in production environments — from operational monitoring to setting up TLS certificates, per-route metrics, fine-grained traffic policy, and more. Whether you’re a Linkerd novice, expert, or just curious, this course will set you up for service mesh success with Linkerd, the wildly popular open source CNCF service mesh.

Transcript

(Note: this transcript has been automatically generated with light editing. It may contain errors! When in doubt, please watch the original talk!)

Welcome and logistics

Jason: Hello, and welcome, everyone. Thank you so much for joining us. I’m going to re-add the Slack link into the chat. Please join the Linkerd Slack if you haven’t had a chance. We have a workshops channel where you can go and discuss stuff with the folks that are doing the workshop, as well as some folks from Buoyant who can answer questions as we go. Today, we’re talking about running Linkerd in production, and we’ve got Charles. Charles, do you mind going to the next slide? Okay, Steven. Well, it’s workshops with an S. You have to specifically join the channel. So, we have Charles Pretzer here, who’s actually going to be leading this workshop. All right?

He is a principal field engineer here at Buoyant. He is an expert in getting folks able to be successful with Linkerd and has learned all of the tips and tricks over the years. You can find him on GitHub and Slack and Twitter, and he’ll talk to you a little bit more about himself as he gets going. Today we’re going to tell you about the workshop, and then go through everything from the checklist, what you need to know to monitor, debug, and how you’re going to handle upgrades.

A little bit about this, we are putting on a series of these workshops, and we’re calling them the Service Mesh Academy, and there’ll be a bit more announcements about that coming up soon. Today, we’re talking about the basics, what you need to know to go to prod, and there’s some assumption that you already know a little bit about Linkerd and a little bit about Kubernetes. There are lots of 101 materials that we’re happy to share with you, and if you ask for that in the workshop Slack, I’ll point you to plenty of resources.

Yeah. No, that’s fine. We can go to the next slide. It all covers an article that we published, which is our Linkerd Production Runbook, and we’ll send that in the chat as well. Thank you so much, Steven. What’s not covered? Again, the basics or really advanced topics in Linkerd, and those will have their own workshops later. All right. With that, I’m going to hand you over to Charles to learn how to get to production with Linkerd. Take it away, Charles.

Charles: Hello, everybody. Thank you for joining us, making this a truly global workshop. It’s so great to have you all here. This is going to be a very informative talk, in my opinion. As Jason mentioned, everything that we’re talking about comes from our Linkerd Runbook, which is based on the experiences and the ongoing experiences that we have with folks in the Linkerd community, as well as Buoyant customers. We saw the agenda, so now we can just jump right into what it takes to go to production, and what we’ve put together here is an essential productionization checklist. There are a few things, if you run Linkerd or kick the tires on Linkerd, you know that you can play with the CLI. It comes with its own Prometheus, a lot of very simple and easy things to get you started, which are part of the core design principles for Linkerd.

Getting ready and the repository

This going to production workshop is going to talk about the next step where we dive into what it’s like to set up a cluster to run in production. Well, once you have a production cluster, we talk about deploying Linkerd to that cluster, and this checklist is something that you can use to get you started. We’ve got a companion repository that goes along with this, and I believe the link was sent in the email that you all sent. If you’ve got the link to the repository, you’ll see a file in there called create.sh. It’s going to do some dependency checks, and then it will… As long as you’ve got all the necessary binaries, it will deploy the cluster for you.

I encourage you, if you’re going to follow along in the workshop, to do that sooner rather than later. We’ve got probably 10 or 15 minutes before we get to the hands-on section, so you want to make sure that that cluster’s up and running if you want to follow along. With that being said, the reason I bring up that script is that it sets up in your local environment what would be a production type of environment that creates the script, deploys Linkerd using Helm, not the Linkerd install command, and it deploys it in high availability mode. If you look at the script, you’ll see that I’ve added a couple of custom configurations to enable us, or enable the debug log levels for the control plane components, and that’s something that we’ll take a look at in the hands-on part.

In addition to those two items, going down this checklist, you’ll want to use your own image registry, but we’ll talk about that in another slide coming up. We’ll set up monitoring and alerting. We’ll discuss that as well, and then the rest of these topics, understanding how to debug Linkerd, that’s going to be our hands-on part, and understanding how to install and upgrade Linkerd, again, we’ll cover that. So, the thing that we’re not covering in this workshop is planning for and documenting your certificate management. Certificate management is one of those advanced topics that Jason just mentioned that we’ll be covering separately. So, this is your basic checklist. I promise you, with this, you can get to production and have a happy, healthy cluster running with all of the observability, security, and reliability features that you’re used to from Linkerd.

High availability

When we talk about high availability mode, what we’re doing is the values-ha.yaml file, which is in the repository, in the Linkerd chart, deploys three replicas of the critical control plane components. Today for Linkerd 2.11, that’s the destination controller, the proxy injector, and the identity component. So, within the destination component, there are multiple containers. That’s not relevant for high availability mode. What’s important to know is that, in a production type environment, you should have at least three nodes that are going to be handling traffic, which means typically, you don’t want to deploy any Linkerd control plane components to your Kubernetes control plane node, so even though the script that I have creates three nodes, typically, a production environment would have at least four nodes, and then you’d deploy Linkerd in HA mode, which means then that you get one of each of the Linkerd control plane components on each node so that in the event of a failure of one of those nodes, or two of those nodes, you still have one or two of the control plane component replicas running.

High availability is a fairly common topic, so the point here is to eliminate single points of failure. In high availability mode, we need to make sure that the proxy injector is up before any of the injected pods, your application pods, can be injected with the Linkerd proxy. The way that this works is that there’s a webhook, and that detects whether a workload should be injected with the Linkerd proxy, and that proxy injector has to be up and running and healthy in order to do the work that it needs to work to inject the proxy YAML into your pod definition. High availability mode also includes node anti-affinity, which means that when we have these three replicas — and the reason we need a minimum of three nodes is because no single node will have more than one replica of each of the control plane components. So, this anti-affinity ensures that, again, the separation of the responsibilities that those control plane components, in the event of a node failure, the control plane will be up and running.

Setting resource requests and limits

Finally, it sets resource requests and limits on control plane components. What we’ve got set in our values-ha.yaml file is the basics for running in production. We’ve certainly seen through working with other members of the community and other companies that sometimes these resource requests and limits need to be adjusted based on your workload. I’ll talk about that a little bit more when we get into the monitoring and alerting section of this workshop. Finally, one of the requirements that you’ll see in the create.sh file is that we add a label to the kube-system namespace as a whole. You don’t want Linkerd to inject the pods in your kube-system namespace. We treat those as being very sacred, and Linkerd shouldn’t be handling the traffic for a kube-system.

So, we add a special label there, config.linkerd.io, add mission webhooks equals disabled to the kube-system namespace, again, to make sure that those pods are not injected. Technically, you can use this label on any namespace that you want. We only require it for a kube-system. I mentioned earlier about using your own image registry. The main reason for this is because you don’t want to rely on a public registry for downloading your images. In a production environment, you want to make sure things are up and running. When those images are pulled, you want to make sure that they are there, that they’re reliable, and the best way to do that is to run your own registry that you have control over so that, if something goes wrong, if it’s down for some reason, you can lean over to the person that you’re working with or whoever sent an email or Slack message to one of your coworkers and say, “Can you check on this registry? It seems to not be working. We need those images.”

The process here is pretty straightforward. What we don’t cover here is automating it, but it can be done very easily with scripting where you would do a docker pull from the Linkerd repository anytime a release is out, usually a stable release, but if you’re using edges, that’s great. Then you would docker pull from the public repository, docker push to your image repository. Very straightforward. If you have any questions about this, please drop them in Slack.

Monitoring Linkerd

Okay. Let’s jump into monitoring Linkerd. This workshop is all about checklists, so we’ve got another checklist for monitoring. When we’re talking about monitoring, this is separate from the observability features that we often associate with monitoring with Linkerd when you’re monitoring your application itself. What we’re talking about is monitoring the Linkerd control plane. So, the first thing we want to do is figure out where you’re going to put your metrics. Again, if you’ve used Linkerd in a development environment, you know deploying Linkerd-viz or previous versions of Linkerd in the control plane. We included an instance of Prometheus, and we still do that today in Linkerd-viz.

This is meant for development purposes only. When you are running in production, you’re going to want to have those metrics available to you for much longer periods than what Linkerd Prometheus offers, and you’ll also want them to be stored permanently. So, one of the things that we want to make sure that you’re doing is figure out where you’re putting your metrics. You can put them into an off-cluster resource like an external Prometheus, either cloud-hosted or one that you run yourself, or there are plenty of third-party choices like Buoyant Cloud, Datadog, New Relic. I’ve seen folks put them into, I believe, Splunk, even. You can store stuff there. So, plenty of options for you. We want you to use what is going to be best for you.

So, that being said, Buoyant Cloud will do a lot of this for you, including the next item on the checklist, which is alerting uncontrolled plane and data plane components. So, one of the main things that we want to understand in any application is the health of the services themselves. So, for the Linkerd components — like any other container or service that runs code — we want to make sure that those are healthy. For the most part, the Linkerd control plane components should always be 100% healthy. If they aren’t, we want to know why, and that’s where the alerting comes in.

So, there are options that you can use to set those up as well. Again, there are so many that we’re not going to cover anyone in particular here. One of the other things we want to make sure that you understand and that I touched on just a minute ago is that the latency and resource usage need to be determined empirically, and that goes back to setting those resource limits in the values-ha file. So, I’ll show you an example in just a second of what one of the Linkerd control plane components looks like with a well known number of workloads and requests per second. These are things that you’re going to have to suss out for yourself in your own environment based on the number of replicas, the number of workloads, the amount of traffic that you have, in order to understand what those limits need to be set at, and then you can set the alerts appropriately.

TLS certificate expirations

Finally, TLS certificate expirations you want an alert on as well. If your certificates expire while you’re running, while the cluster Linkerd is running or the cluster is running, that is real bad. So, it’s important to understand, that to give yourself enough time. We mark with Linkerd check. We look for certificates that expire within 60 days. This is just a guideline, but we definitely suggest that you pick a duration that matches your team, that gives them time to go in and rotate those certificates, and then alert on that appropriately.

So, here we see, when we’re talking about alerting, we see alerts inside of Buoyant Cloud, and there are alerts that are set here based off on success rates and P95 latencies, and we’ve allocated, or we’ve categorized them as sub-two, sub-three. Just at the top here we see that, if the success rate for the Linkerd destination component — we’re going to pick on the destination component a lot today. If that ever goes below 99%, send me a sub-two alert. We also have one on latency. So, if the latency is ever greater than 50 milliseconds, send me a sub-two alert. This is an example of the kind of alerting that you want, and again, the things that you will have to decide empirically-based on your application’s performance and resource usage.

Here, again, we see some graphs that are part of Buoyant Cloud, and I told you just a minute ago that we would give you a concrete example about the amount of resources that are used based off of the number of workloads and latency within an application, or, sorry, throughput. So, this is a real-life example of a medium cluster. We can see that the destination component is using roughly 820 megabytes of memory. The proxy injector is using roughly 420 megabytes, and the identity component is using roughly 165 megabytes. The CPU for all of these is below .02 usage per course, and this is based on about 500 pods and 2,000 requests per second. So, by understanding your throughput, your requests per second, and the number of pods that you have running, this will be either bigger or smaller, and you can adjust those resource limits appropriately so that Kubernetes doesn’t go in and evict those pods for using too many resources.

So, I wish that I could say, “Here’s a formula that will get it done for you,” but unfortunately, applications are so different. Usage by workloads, by services, is so different across applications that there is no silver bullet for these things. So, you’ll have to do some iterating and figure out which resource limits and requests are appropriate for your environment. It’s an investment that’s worth doing, and there are no secrets to it. It’s very straightforward once you understand the profile of your application. So, if there are any questions about that, again, please drop them in Slack or in the chat, and we’ll keep moving on here into the debugging section.

Debugging

So, if you haven’t already, now would be the time to run that create script to get your cluster up and running, because, after the next few slides, that’s when I’m going to jump into the hands-on part of the workshop, because debugging Linkerd is… When I’m not doing workshops or presentations or working with some of the folks on the team here, I’m usually chatting with somebody in Slack or working with one of our customers, helping them through Linkerd issues or debugging Linkerd to understand what’s happening within their environment.

So, we have a number of tools, and the purpose of this section is to present to you the tools that you can use to debug the Linkerd control plane. This is different than some of the other talks that we’ve done where we use Linkerd to debug your application itself. In this case, we’re treating the Linkerd control plane as the application, and we’re going to use some tools to understand what’s happening in the event that the control plane is acting up, or even the data plane for that matter. So, you can debug the Linkerd control plane, start by debugging it using the tools that you already use and are familiar with today. Two of those are… Well, one of the tools is kube control, and two of the commands that it has are kube control events and kube control logs.

Charles Pretzer: Anytime that you see a control plane component that is in a state other than running, that’s where I would… When I’m debugging with customers or with Linkerd community, the first thing that I’ll look at is the output from kube control get events, for the Linkerd namespace specifically, and this is what we’ll look at in the hands-on section. This will tell you if there is a problem communicating with the API server in some cases. It’ll tell you if… That can happen for many different reasons, but the events are a good place to start, and oftentimes will have the message to point you to the solution that you need in order to get the control plane into a healthy state and get those Pods up and running.

Charles Pretzer: Oftentimes, in conjunction with kube control events, we can identify a specific workload or a specific Pod that isn’t starting up for some reason, and we can also see that in the Pod status, and we can use kube control logs to view the container logs. Now, I mentioned in the create script, my values-ha.yaml file has a couple of configurations, custom for this particular deployment, and those configurations are to set the log levels of the control plane components to trace log levels, which can be pretty verbose. So, this is not normally something that I would do in production, but for the purposes of this workshop, for us to be able to see, A, what it’s like to specify a custom configuration file, and B, to be able to look at those logs without having to go in and restart any of the workloads.

Charles Pretzer: In a production environment, you would leave those log levels where they are, and if an event happened, we would then go in and set those log levels to something that will output more information for us so that we can then use kube control logs to view those container logs. Going beyond the basics, we’ve got the Linkerd CLI. So, we mentioned that we’re not going to use the Linkerd CLI to install Linkerd. What we are going to use the Linkerd CLI is for the debugging process. So, for those tools, and we’ll walk through each of those in the hands-on, we’ve got Linkerd check, Linkerd diagnostics, Linkerd-viz Tap, Linkerd identity, and Linkerd authz. I’ll go through each of these in just a few minutes.

Charles Pretzer: I mentioned log levels earlier. There are two places… Because you’ve got the control plane and the data plane, we have the ability to control log levels in each one of those. The log levels that I just mentioned in my values-ha.yaml configuration file… Actually, sorry. It’s a values.yaml file for the control plane configuration. The proxy log levels, you have multiple ways of configuring those. One is globally in the Helm chart, and this generally… In a production environment, I wouldn’t do this, but it’s possible to do. The reason for this is because when you’re debugging data plane components, you don’t necessarily need to have all of the proxies set to a verbose log level like trace or debug. You really are looking at one workload and the replicas within that.

Charles Pretzer: So, for that we can use a per workload or per namespace through the annotation, config.linkerd.io proxy log level, and in this case I’ve got it set to warn, which is for all of the Rust components, the crates. Those will be set to warn, which is a less verbose level, and then anything in the Linkerd 2 proxy namespace will have a trace log level, which is very verbose, and again, this is not something that I would run in production for a long time, but it is a tool that we’ve used in the past and that we frequently use to understand the traffic that the Linkerd proxy is handling, both on the inbound side when it’s receiving requests, and on the outbound side.

Charles Pretzer: So, one of the really neat things, one of my favorite things about the Linkerd proxy is that you can actually configuration this at runtime. We’re not going to do this today, but the Linkerd proxy has a configuration endpoint, the Linkerd admin endpoint, and when you port forward to that port, you can send a put request that uses the same syntax, this Rust syntax for setting log levels, and in this case we’re setting Linkerd will debug, which means any Linkerd crate within the proxy, we’re going to set that to debug, and we send that put request, and that will configuration the proxy log level in realtime. This is super helpful when we see one particular proxy acting up within a set of replicas, so that’s indicative of an event that is not systemic, meaning that the service itself is fine. There’s just one replica that’s acting up. So, this gives us some really fine-grain control here.

Charles Pretzer: So, going from top to bottom we have we can set everything, we can set for a specific workload, or all the Pods in a namespace, or we can set the proxy log level at a specific container. So, this is very valuable and very helpful, and I can’t count on my hands the number of times that I’ve worked through this with folks, and even done this myself. The control plane components, I mentioned these earlier. The only real takeaway that you want from this slide is that the policy controller, which is new in 2.11, is written in Rust, and so it uses that same Rust syntax, linkerd=info,warn. The other components that are written in Go have a more probably familiar syntax for setting the proxy log level where you specify just the name, error, warn, info, debug and trace, and from left to right, that is least verbose to most verbose.

Charles Pretzer: Just like the proxy log level, you can set them globally in the Helm chart, which is what I’ve done in that values.yaml file in the repository, or you can do it per workload by modifying the individual Helm templates. So, that is the basics of debugging the Linkerd control plane. One other thing that I want to mention to you is that we have this Linkerd debug sidecar, which is a really powerful tool, but with great power comes great responsibility, I think is how they say it, and so this has to be run as a root user, but it’s a container that has tshark, tcpdump, lsof, and iproute2 binaries. So, I’ve used this with folks to capture TCP dump packets and inspect those packets individually when there’s something that’s really, really funky going on.

Charles Pretzer: I don’t want to call it a last-ditch effort, but this is when you’re using the Linkerd debug sidecar, it’s really powerful, and it’s something that we’re using when we’re really down in the weeds, trying to understand that packet level communication. Oftentimes, actually, all the time, when it’s between the Linkerd proxy and the service container, Linkerd proxy container and service container within the same Pod, so we’re looking a lot of times at just that local host traffic that’s happening within a Pod itself, between two containers. There’s a link here for you to explore more. Again, really powerful tool, and we mention it here because we’re talking about overall toolkits and what’s available to us. So, with that, let’s jump into the hands-on section. I’m going to jump over to my terminal here. [crosstalk 00:27:09].

Jason: Hey, Charles?

Charles Pretzer: Yeah?

Jason: Sorry to interrupt. I just wanted to ask folks who are listening… One, we’ve had a super active chat, so thank you very much. Does anyone have any questions for Charles before we go further? Just give me a second.

Charles Pretzer: I’m going to take a sip of water while… Yeah.

Jason: Yeah. Good water break time. All right. Sorry to interrupt you, Charles. I think we can carry on. Thanks, folks, and again, continue to ask stuff in the chat. Love to help you and make sure that you’re able to successfully complete the demo.

Charles Pretzer: Yeah. Great. So, I showed you the create script, so there’s nothing magical going on here. This is just a very quick and dirty script to get you started. This is a bit difficult to read, but what we’re doing here… There are some important parts here. Give me one second to pull over my editor because it’s going to be easier to read in there. There we go. Okay. So, this is the debugging steps. We want to look at the create script. There we go, much easier to read. I’ll make it a little bit bigger.

Jason: Charles, a couple people who are doing the create script have said that they need to independently run the command, Helm repo add Linkerd. So, Steven posted about that in the chat, and there’s also some commentary in the workshop, if you’re following along.

Charles Pretzer: I haven’t looked, but I will… Yeah, that’s a good point. We do check to see if Helm is installed, but because I already have the Linkerd repo added to my machine, I failed to add that check. So, PRs are welcome. If not, I’ll get it fixed right after this. I see some other chats in there. Thanks, Steven. Appreciate the PR. I just want to walk you through this quickly. The checks for the binaries are straightforward. What we create is three servers, a three-node Cluster using k3d, and then we create our certificate. So, we talked before about certificate management. The important thing to note here is that when you deploy Linkerd using Helm, we have to specify our own certificates, which is different from the Linkerd CLI.

Charles Pretzer: Again, what’s important here is to understand that there’s a trust anchor, which is this [CACert 00:29:55] and [CAKey 00:29:55], and from that CACert and CAKey, or certificate and key, are the issuer certificate and key, which are created as [inaudible 00:30:08] certificates. We’ll cover all those in the upcoming workshop about certificate management. The important thing to know here is that when we do Helm install, we have to make sure that we provide those to the Helm, sorry, as values to the Helm file, or Helm installation command.

Charles Pretzer: The other thing to take note of here is that I also added the namespace label for kube-system just to show you that it’s there and that it has to be done. However you end up configuring this in your CI/CD system, you’ll just want to make sure that you add that label at some point. Yeah. So, that’s that script, very simple, and again, a similar thing for deploying Linkerd-viz. I put these out into two different scripts just to make them a little more easy to comprehend and wrap your mind around. They could be in the same script if we wanted them to. So, going to take a quick look into the chat to make sure that everybody is at least able to get started.

Jason: There was another reminder from Nathan. A couple folks have run into an issue on macOS where the script uses the date command and the macOS has [gdate 00:31:36].

Charles Pretzer: Okay.

Jason: So, we’ll post something about that in the workshop Slack as well, but you may need to do an edit if you’re on macOS.

Charles Pretzer: Good point, and our docs actually have the command for generating the date for both macOS and Linux. This is a Linux script, but we’ll get that updated as well to do some iOS, or, sorry, OS detection. Thanks for pointing that out. Okay. I also have created this… Let’s look at debugging.md. This is kind of a quick and dirty cheat sheet for all of the commands that I mentioned in the slide. So, if we take a quick look at the events that I mentioned, we’ll go in order of the kube control commands through the Linkerd commands, and then that’ll be the… Those are the steps that we’ll go through.

Charles Pretzer: So, first thing I want to do is look at the events for the Linkerd namespace. What I find really helpful is to sort by last timestamp. Otherwise, they get all jumbled up. But in this case… Oh, no resource found in Linkerd namespace? Oh, you know why? Because you’ll see that I was working last night, and there have been… All the events have expired since then. So, I’ll just restart these deployments, and while that’s happening, we can take a look at some of the other parts of the script. So, actually, I should have events now already, and again, the only reason I did that is because the previous events in there had expired. We want to get some new events so that we can actually see what’s going on here, and this is what a healthy set of events looks like when the Linkerd control plane is starting, aside from the killing because I’ve restarted the Pods.

Charles Pretzer: But you’ll see created, started, pulled. If there’s anything… One of the things that we want to look for when debugging the control plane is anything that is not normal. So, here we have a warning. This is failed scheduling just based off of the anti-affinity rules. That’s because all the Pods were already running, and some on the nodes, and one needed to be terminated before another could be started. So, when you’re using kube control events, anything that’s not normal is going to be something that you want to stand out, that’s going to stand out, that you want to look at, and sort by timestamp, that’s going to make your life much easier rather than having to go through and look at when a message was last, or an event was last written.

Charles Pretzer: Okay. So, next thing we want to get into is the logs. So, I’m going to pick on the identity Pod today, grab one of our replicas out of the namespace here, Linkerd namespace. This is one of the identity replicas that’s running, and what we’re going to do is… Oops, wrong window. Sorry about that. What we’re going to do is we’ll get the logs for each of the containers within the identity component. There’s the Linkerd identity component itself, and that’s responsible for issuing certificates to the… Sorry. When it receives a certificate signing request from a proxy, then it signs that certificate and sends it back to the proxy, and that’s used for mutual TLS.

Charles Pretzer: There’s also the Linkerd proxy itself, so that tells us when it’s handling inbound and outbound traffic, and then another one that sometimes we have to debug, but not often, is the Linkerd-init container, and that’s responsible for setting the IP table scrolls. A lot of times when we see issues with components not being able to communicate with each other, we have to take a look at the IP table’s rules and the output from the Linkerd-init script to make sure that those IP table rules were set up correctly. So, let me do… We’ll look at the logs for the identity component, specifically the… Oops. You probably can’t see the Zoom toolbar. It’s in the way there, but it’s preventing me from typing. I’m kidding. My bad typing is because my fingers don’t work well in the morning.

Charles Pretzer: Kube control logs, Linkerd identity, -n Linkerd, and if I do this without specifying a container… What did I do wrong here? It’s just kube control logs, isn’t it? [inaudible 00:36:52], so another fix on my side. I will get that done. So, if we don’t specify a container, you’ve probably seen this before, we need to specify a container. I often do this just to see which containers are available. So, first we’ll look at identity, and there’s nothing abnormal happening here, but what I want to show you is that we can confirm that our log levels have been set because we see that these debug messages are in there. By default, the Linkerd control plane components will warn, or, sorry, will log, I believe, just info messages.

Charles Pretzer: So, here we can see that debug is set. In fact… Oops. Let’s see. What I want to look at is… Well, we don’t need to do that. I was just going to show you, if I looked at the manifest file for the deployment and for the log level, you would be able to see that the log levels were set to the values in my values-ha.yaml file. So, that is with the identity component logs. In trace mode, it will tell you any time a certificate has been requested, which proxy has requested it, and the associated service, as well as the identity associated with that service, and it will tell you that it has issued that certificate, or in the case where it’s not issuing certificates, it will tell you why it hasn’t issued the certificate. So, this is a valuable tool for understanding how proxies receive their certificates, and if for some reason they aren’t, we can debug the identity component.

Charles Pretzer: Let’s take a look at some of the Linkerd proxy logs, Linkerd-proxy, and this is, again, a very typical startup for a Linkerd proxy. If you’re debugging something that’s actively receiving traffic, the trace level output is… You can’t read it because the screen is moving too quickly. So, again, this is something that we do for a short period of time. We capture those logs somewhere where we can… Typically, what happens if we see bad behavior, we set, or, sorry, we see unexpected behavior, we set the log level. We reproduce the behavior, unset the log level, and then analyze those logs. So, that’s the typical debugging workflow that work through with folks in the community and with our customers.

Charles Pretzer: Finally, just because I want to show you what the init container logs, the script itself outputs all of the IP table rules. So, it tells you what the current state of IP tables is, and here we see that there were no IP tables that were already set. This can be different if you’ve got multiple init containers that are manipulating IP tables, and that’s one of the reasons that we write those IP tables first, to be able to debug any potential conflicts or routing issues from other containers. Then it actually writes all of the commands, the actual IP table’s commands that are used by the Linkerd proxy. Finally, it shows you the output from the IP tables once everything is done. So, this is really about once the script is run completely, so this is a really helpful tool in understanding low-level IP tables routing network behavior that happens within the Pod itself.

Charles Pretzer: So, that’s the basics of using kube control logs and kube control get events. Let’s jump into the Linkerd CLI for debugging. I’ll run through these quickly. You’ve got Linkerd check. This is the basic… If anything goes wrong with the Linkerd control plane, we want to run Linkerd check. This is going to keep running. It checks the extensions as well if any extensions are deployed, and if there’s anything in here that’s unexpected, then it’s going to tell you. For example, my control plane proxies are not running the current version, and I think this is a bug that was fixed in a recent edge. Let’s see. We’ve got Linkerd check, Linkerd version. Linkerd diagnostics is a really powerful tool, so let me get another… Let me see what I have. Yeah, perfect. Okay.

Charles Pretzer: So, we want to do Linkerd… Let me move this up so it’s more readable. DG is short for diagnostics. We want to get proxy metrics. Sorry. I’ve lost it behind Zoom again. Let me make this a little bit smaller. Okay. We want to get it off of that identity Pod. So, this is going to give you the raw Prometheus metrics, and where we use this a lot is looking at the TCP connections. So, it’s a bit difficult to read, but here, if we see something where connections are being refused by the Pod, a lot of times what we’ll look at is the TCP close total compared to the… Let’s see. There’s a TCP open total. Having a hard time seeing it here. Sorry about that. Oh, here we go, TCP open total. We compare that with TCP close total. If there’s a big disparity, we know that there are a bunch of open connections, and we also display the open connections. So, this is a tool that I’ve used to debug issues where we see connections being refused or closed by the service container.

Charles Pretzer: Another one that I want to go through is the Linkerd identity command, and this gives you the ability to inspect directly the certificates that are associated with a Linkerd proxy. So, here we see the identity component has… There are three identity components, or replicas of the identity component, and the certificates that it’s using show our common name. Again, this is an inspection of the certificate. It will tell you when it expires. We’ll use this to check to make sure that the certificates are healthy in the event that two Pods can’t communicate with each other.

Charles Pretzer: Finally, new in Linkerd 2.11 is the Linkerd authz command, and this tell you all of the, oops, all of the server authorizations and server CRDs that are currently active in your environment. So, this is specifically for controlling policy and understanding when, oops, looking at policy and understanding which policies are in place and which Pods can communicate with each other, or which services can communicate with each other. So, in this case we can see that we’ve got servers and authorizations for all of the components in the Linkerd-viz namespace, and if we were to look at those separately, we would be able to look at what’s allowed and denied. That’s all part of a different workshop.

Charles Pretzer: We went over in one of the community meetings, I think it was September or October, where we talked about proxies, sorry, polices and understanding how the syntax behind those. So, those are the main tools for debugging using Linkerd CLI. In some cases, we’ve had to dig into the [inaudible 00:45:16] logs. For seasoned Kubernetes operators, this is something that it should… It’s never comfortable, but it should be something that you understand how to do, looking at the [inaudible 00:45:28] logs themselves to see if there anything going on in the system that looks suspicious. So, that is how we debug Linkerd, the control plane and the data plane, in production.

Charles Pretzer: Let’s jump back over to our slides, and I hope you were all able to follow along with that. Apologies that the script was Linux-focused. I hope that you all will be able to follow along on your own. So, let’s take a quick look at upgrading and rolling back, and then we’ll jump into some questions. So, upgrading Linkerd, this is another checklist. We just forgot to put the work checklist there. Always test in lower environment. This is nothing new here. The data plane is designed to work with future control planes, so you upgrade the control plane first, and then gradually restart the Pods that are meshed with Linkerd so that they get the new version of the proxy.

Charles Pretzer: So, a really good example of this is here we show upgrading 2.10.2 to 2.11.1. You upgrade the control plane to 2.11.1. Those proxies that are running the 2.10.2 control plane will still function, and on your own time you can go through and roll those out, restart those workloads so that those Pods get the 2.11.1 proxy. So, the compatibility is there. That’s one of the key features that is tested for when a new version of Linkerd is being built, whether it’s an edge or a stable. One of the questions we get a lot is upgrading Linkerd, and one of the things I’ve seen happen is folks will leapfrog versions. Always, always, always, please, always upgrade sequentially.

Charles Pretzer: So, if you’re on 2.9, go to 2.10, then 2.11. Don’t jump 2.10 to get to 2.11. This is just because there are changes that happen across versions, and we document these all very well in the upgrade notes, and if there’s anything in there, anything that happens, like specifically a breaking change, then it will be documented in there, and you’ll be able to understand what needs to happen as you upgrade any additional changes that you need to make, or if it’s just a seamless upgrade, then it’ll say, “Let’s just go ahead and upgrade.” But the main thing is don’t skip versions because there may be special notes for each of the releases.

Charles Pretzer: One quick note about upgrading with Helm is that there are some flags that you should become familiar with. I only learned about these recently myself. We document them well in the upgrade docs. Reuse values and reset values are flags that are a part of the Helm binary, and I’ve created a little table here. The matrix, whether you want to reuse values or reset values, can be driven by the decision of whether you’ve got any configuration overrides, or if you’re using vanilla Linkerd as configured out of the box. Have a look at these. Have a look at the docs. For the most part, the default behavior is fine. I haven’t seen a ton of people… Actually, I’ve never seen anybody ask questions about these particular flags.

Charles Pretzer: So, most of the time, you’ll be able to Helm upgrade without having to specify any of these flags, which means that there is a default behavior that occurs. Reuse values is the default behavior if there are overrides, and reset values if there are overrides. The difference between the two is reuse values will take what’s currently in the configuration and apply it insomuch as it can to the current, sorry, the new version, and reset values takes the configuration from the release, so 2.11.1, if you use reset values, will overwrite the values, anything that’s in the 2.10.2. release. So, that’s upgrading, and rolling back, it’s fundamentally the same as upgrading with Helm.

Charles Pretzer: There’s one little gotcha with Helm that I always have a hard time wrapping my brain around, is that you can Helm upgrade to a previous version. It’s just Helm semantics, and you can Helm upgrade from 2.11 to 2.10.2, and everything will work as expected. I will point out again breaking changes are documented in the Helm upgrade guide, and I know I mentioned earlier that you should use version control for your Helm configuration files, but this is going to help you a ton if you need to do any rollbacks. So, that’s the basics for rolling back. There’s not a ton to talk about here.

Charles Pretzer: With that, I think we’ve covered everything from monitoring and alerting, as well as identifying and using the tools for debugging Linkerd, the data plane and the control plane, and upgrades and rollbacks. We’ll be handling questions throughout the day in the workshops channel, and going forward, I hope that folks have been able to get their questions answered during the chat. Let’s take a minute, the next few minutes. Jason, do you have any questions for me that I didn’t cover that you think would be helpful for the folks here, or are there some interesting questions that you saw come in that we should answer?

Jason: Yeah. Good one from Nathan. In general, is it safe to Helm rollback the control plane if you want to roll back in a hurry?

Charles Pretzer: What is in a hurry? Yeah. So, to me that means you’ve just upgraded from 2.10 to 2.11, let’s say, and you’ve decided that you want to roll back to 2.10. Yes. For me, I think that’s totally fine. The thing to think about there is whether you have rolled any of the data plane components… Sorry. Yeah. Yeah, right. Any of the data plane components. So, if any of the proxies have been upgraded to 2.11 in that example, you want to make sure that as soon as you’ve rolled back to 2.10, you also restart those workloads as well to get those proxies back to 2.10.2. In most cases, the proxies will work in that… I’m trying to think. I shouldn’t make that blanket statement. Make sure that the proxies are at the version of your control plane or lower for compatibility’s sake.

Jason: We have another good one from [Shumet 00:52:42], I believe, and I’m sorry if I mispronounced your name. If I have set up Linkerd using the CLI, what is the best way to move it to a Helm-based deployment?

Charles Pretzer: Good question. So, if you set up Linkerd using the CLI and you let Linkerd generate the certificates for you, then you’re in a situation where you’ll want to rotate those certificates out, including the trust anchor, which is why we need to have the workshop for certificate management. Off the top of my head, you will… Well, if you’re in a position to, I would redeploy Linkerd from scratch with Helm. That’s probably the most seamless route, but not everybody’s in that position, and I totally get that. So, the thing that I would do is create those new certificates or get the certificates from your organization’s security team, and then go through the steps for manually rotating the certificates, the trust anchors, specifically, and then when you Helm deploy, you would specify those certificates as well.

Jason: Okay. So, folks who are listening, there are documents on the Linkerd docs that discuss exactly how you would go through rotating the certificates within Linkerd, and so that’s what Charles is referring to there.

Charles Pretzer: Yeah. Yep. I’m trying to think. No. So, actually, I ran through this exercise yesterday where I did a Helm deployment with one certificate, and then did Helm upgrade with a different certificate, and as you can see, the problem there is that you now have mismatched certificates in the proxies and in the control plane, which means the trust is broken and the system wouldn’t work. So, you want to make sure that you have those certificates in place before you do any upgrading using Helm from the Linkerd CLI.

Jason: Yeah, and Demetrius actually was really helpful and posted the link for everyone. Thank you so much for that, Demetrius.

Charles Pretzer: Yeah. That’s awesome. Thank you. What else, Jason? We got 90 seconds.

Jason: We have one that we’re not going to touch because it’s a roadmap one regarding Linkerd [inaudible 00:55:20], so Morton, we can talk about that a little bit offline. For folks that are in that are joining, we’re about at time, but would love to have you join the Linkerd Slack, and I’m going to post that one more time, the link to join our Slack, and feel free to reach out to myself or Charles anytime, and also come to that workshops channel to just ask us about anything you saw in today’s workshop.

Charles Pretzer: That’s also the place to get updates about upcoming workshops as well.

Jason: Yeah. So, with that, why don’t we plug the next workshop in the series?

Charles Pretzer: You’ve got it. Next workshop.

Jason: Let me post the link to Slack. I’m so sorry about that. Sorry. Go ahead, Charles.

Charles Pretzer: No, no problem. Next workshop is locking down your Kubernetes Cluster with Linkerd. So, we’ll talk more about security here. That’s going to happen on January 13th, which is a Thursday, and all the times are there, so register today, and we look forward to seeing you there. We love doing these workshops, and I can’t remember if Jason mentioned this earlier, but as we’ve started the Service Mesh Academy series, we’re basing it off of the feedback that you all give us.

Charles Pretzer: So, when you ask questions about Linkerd or specifically fill out the form to say, “This is what I want to learn about,” we take that and we say, “People are asking a lot about running Linkerd in production, and how do I debug it? Let’s do that workshop.” So, please continue to give us your ideas. Tell us what you want to hear about, and we are also interested in some of the edge cases out there that might be fun to track down that other people may be running into as well. Yeah. I’ll leave it at that. Thank you so much for joining. Thank you for being members of the Service Mesh Academy. I’m pretty excited for these workshops, and we look forward to doing more and seeing you all there.

Jason: Yeah. Thank you so much, everyone.

In this hands-on session, we’ll walk you through the basics of what you need to know to successfully deploy and operate Linkerd in production environments — from operational monitoring to setting up TLS certificates, per-route metrics, fine-grained traffic policy, and more. Whether you’re a Linkerd novice, expert, or just curious, this course will set you up for service mesh success with Linkerd, the wildly popular open source CNCF service mesh.

Transcript

(Note: this transcript has been automatically generated with light editing. It may contain errors! When in doubt, please watch the original talk!)

Welcome and logistics

Jason: Hello, and welcome, everyone. Thank you so much for joining us. I’m going to re-add the Slack link into the chat. Please join the Linkerd Slack if you haven’t had a chance. We have a workshops channel where you can go and discuss stuff with the folks that are doing the workshop, as well as some folks from Buoyant who can answer questions as we go. Today, we’re talking about running Linkerd in production, and we’ve got Charles. Charles, do you mind going to the next slide? Okay, Steven. Well, it’s workshops with an S. You have to specifically join the channel. So, we have Charles Pretzer here, who’s actually going to be leading this workshop. All right?

He is a principal field engineer here at Buoyant. He is an expert in getting folks able to be successful with Linkerd and has learned all of the tips and tricks over the years. You can find him on GitHub and Slack and Twitter, and he’ll talk to you a little bit more about himself as he gets going. Today we’re going to tell you about the workshop, and then go through everything from the checklist, what you need to know to monitor, debug, and how you’re going to handle upgrades.

A little bit about this, we are putting on a series of these workshops, and we’re calling them the Service Mesh Academy, and there’ll be a bit more announcements about that coming up soon. Today, we’re talking about the basics, what you need to know to go to prod, and there’s some assumption that you already know a little bit about Linkerd and a little bit about Kubernetes. There are lots of 101 materials that we’re happy to share with you, and if you ask for that in the workshop Slack, I’ll point you to plenty of resources.

Yeah. No, that’s fine. We can go to the next slide. It all covers an article that we published, which is our Linkerd Production Runbook, and we’ll send that in the chat as well. Thank you so much, Steven. What’s not covered? Again, the basics or really advanced topics in Linkerd, and those will have their own workshops later. All right. With that, I’m going to hand you over to Charles to learn how to get to production with Linkerd. Take it away, Charles.

Charles: Hello, everybody. Thank you for joining us, making this a truly global workshop. It’s so great to have you all here. This is going to be a very informative talk, in my opinion. As Jason mentioned, everything that we’re talking about comes from our Linkerd Runbook, which is based on the experiences and the ongoing experiences that we have with folks in the Linkerd community, as well as Buoyant customers. We saw the agenda, so now we can just jump right into what it takes to go to production, and what we’ve put together here is an essential productionization checklist. There are a few things, if you run Linkerd or kick the tires on Linkerd, you know that you can play with the CLI. It comes with its own Prometheus, a lot of very simple and easy things to get you started, which are part of the core design principles for Linkerd.

Getting ready and the repository

This going to production workshop is going to talk about the next step where we dive into what it’s like to set up a cluster to run in production. Well, once you have a production cluster, we talk about deploying Linkerd to that cluster, and this checklist is something that you can use to get you started. We’ve got a companion repository that goes along with this, and I believe the link was sent in the email that you all sent. If you’ve got the link to the repository, you’ll see a file in there called create.sh. It’s going to do some dependency checks, and then it will… As long as you’ve got all the necessary binaries, it will deploy the cluster for you.

I encourage you, if you’re going to follow along in the workshop, to do that sooner rather than later. We’ve got probably 10 or 15 minutes before we get to the hands-on section, so you want to make sure that that cluster’s up and running if you want to follow along. With that being said, the reason I bring up that script is that it sets up in your local environment what would be a production type of environment that creates the script, deploys Linkerd using Helm, not the Linkerd install command, and it deploys it in high availability mode. If you look at the script, you’ll see that I’ve added a couple of custom configurations to enable us, or enable the debug log levels for the control plane components, and that’s something that we’ll take a look at in the hands-on part.

In addition to those two items, going down this checklist, you’ll want to use your own image registry, but we’ll talk about that in another slide coming up. We’ll set up monitoring and alerting. We’ll discuss that as well, and then the rest of these topics, understanding how to debug Linkerd, that’s going to be our hands-on part, and understanding how to install and upgrade Linkerd, again, we’ll cover that. So, the thing that we’re not covering in this workshop is planning for and documenting your certificate management. Certificate management is one of those advanced topics that Jason just mentioned that we’ll be covering separately. So, this is your basic checklist. I promise you, with this, you can get to production and have a happy, healthy cluster running with all of the observability, security, and reliability features that you’re used to from Linkerd.

High availability

When we talk about high availability mode, what we’re doing is the values-ha.yaml file, which is in the repository, in the Linkerd chart, deploys three replicas of the critical control plane components. Today for Linkerd 2.11, that’s the destination controller, the proxy injector, and the identity component. So, within the destination component, there are multiple containers. That’s not relevant for high availability mode. What’s important to know is that, in a production type environment, you should have at least three nodes that are going to be handling traffic, which means typically, you don’t want to deploy any Linkerd control plane components to your Kubernetes control plane node, so even though the script that I have creates three nodes, typically, a production environment would have at least four nodes, and then you’d deploy Linkerd in HA mode, which means then that you get one of each of the Linkerd control plane components on each node so that in the event of a failure of one of those nodes, or two of those nodes, you still have one or two of the control plane component replicas running.

High availability is a fairly common topic, so the point here is to eliminate single points of failure. In high availability mode, we need to make sure that the proxy injector is up before any of the injected pods, your application pods, can be injected with the Linkerd proxy. The way that this works is that there’s a webhook, and that detects whether a workload should be injected with the Linkerd proxy, and that proxy injector has to be up and running and healthy in order to do the work that it needs to work to inject the proxy YAML into your pod definition. High availability mode also includes node anti-affinity, which means that when we have these three replicas — and the reason we need a minimum of three nodes is because no single node will have more than one replica of each of the control plane components. So, this anti-affinity ensures that, again, the separation of the responsibilities that those control plane components, in the event of a node failure, the control plane will be up and running.

Setting resource requests and limits

Finally, it sets resource requests and limits on control plane components. What we’ve got set in our values-ha.yaml file is the basics for running in production. We’ve certainly seen through working with other members of the community and other companies that sometimes these resource requests and limits need to be adjusted based on your workload. I’ll talk about that a little bit more when we get into the monitoring and alerting section of this workshop. Finally, one of the requirements that you’ll see in the create.sh file is that we add a label to the kube-system namespace as a whole. You don’t want Linkerd to inject the pods in your kube-system namespace. We treat those as being very sacred, and Linkerd shouldn’t be handling the traffic for a kube-system.

So, we add a special label there, config.linkerd.io, add mission webhooks equals disabled to the kube-system namespace, again, to make sure that those pods are not injected. Technically, you can use this label on any namespace that you want. We only require it for a kube-system. I mentioned earlier about using your own image registry. The main reason for this is because you don’t want to rely on a public registry for downloading your images. In a production environment, you want to make sure things are up and running. When those images are pulled, you want to make sure that they are there, that they’re reliable, and the best way to do that is to run your own registry that you have control over so that, if something goes wrong, if it’s down for some reason, you can lean over to the person that you’re working with or whoever sent an email or Slack message to one of your coworkers and say, “Can you check on this registry? It seems to not be working. We need those images.”

The process here is pretty straightforward. What we don’t cover here is automating it, but it can be done very easily with scripting where you would do a docker pull from the Linkerd repository anytime a release is out, usually a stable release, but if you’re using edges, that’s great. Then you would docker pull from the public repository, docker push to your image repository. Very straightforward. If you have any questions about this, please drop them in Slack.

Monitoring Linkerd

Okay. Let’s jump into monitoring Linkerd. This workshop is all about checklists, so we’ve got another checklist for monitoring. When we’re talking about monitoring, this is separate from the observability features that we often associate with monitoring with Linkerd when you’re monitoring your application itself. What we’re talking about is monitoring the Linkerd control plane. So, the first thing we want to do is figure out where you’re going to put your metrics. Again, if you’ve used Linkerd in a development environment, you know deploying Linkerd-viz or previous versions of Linkerd in the control plane. We included an instance of Prometheus, and we still do that today in Linkerd-viz.

This is meant for development purposes only. When you are running in production, you’re going to want to have those metrics available to you for much longer periods than what Linkerd Prometheus offers, and you’ll also want them to be stored permanently. So, one of the things that we want to make sure that you’re doing is figure out where you’re putting your metrics. You can put them into an off-cluster resource like an external Prometheus, either cloud-hosted or one that you run yourself, or there are plenty of third-party choices like Buoyant Cloud, Datadog, New Relic. I’ve seen folks put them into, I believe, Splunk, even. You can store stuff there. So, plenty of options for you. We want you to use what is going to be best for you.

So, that being said, Buoyant Cloud will do a lot of this for you, including the next item on the checklist, which is alerting uncontrolled plane and data plane components. So, one of the main things that we want to understand in any application is the health of the services themselves. So, for the Linkerd components — like any other container or service that runs code — we want to make sure that those are healthy. For the most part, the Linkerd control plane components should always be 100% healthy. If they aren’t, we want to know why, and that’s where the alerting comes in.

So, there are options that you can use to set those up as well. Again, there are so many that we’re not going to cover anyone in particular here. One of the other things we want to make sure that you understand and that I touched on just a minute ago is that the latency and resource usage need to be determined empirically, and that goes back to setting those resource limits in the values-ha file. So, I’ll show you an example in just a second of what one of the Linkerd control plane components looks like with a well known number of workloads and requests per second. These are things that you’re going to have to suss out for yourself in your own environment based on the number of replicas, the number of workloads, the amount of traffic that you have, in order to understand what those limits need to be set at, and then you can set the alerts appropriately.

TLS certificate expirations

Finally, TLS certificate expirations you want an alert on as well. If your certificates expire while you’re running, while the cluster Linkerd is running or the cluster is running, that is real bad. So, it’s important to understand, that to give yourself enough time. We mark with Linkerd check. We look for certificates that expire within 60 days. This is just a guideline, but we definitely suggest that you pick a duration that matches your team, that gives them time to go in and rotate those certificates, and then alert on that appropriately.

So, here we see, when we’re talking about alerting, we see alerts inside of Buoyant Cloud, and there are alerts that are set here based off on success rates and P95 latencies, and we’ve allocated, or we’ve categorized them as sub-two, sub-three. Just at the top here we see that, if the success rate for the Linkerd destination component — we’re going to pick on the destination component a lot today. If that ever goes below 99%, send me a sub-two alert. We also have one on latency. So, if the latency is ever greater than 50 milliseconds, send me a sub-two alert. This is an example of the kind of alerting that you want, and again, the things that you will have to decide empirically-based on your application’s performance and resource usage.

Here, again, we see some graphs that are part of Buoyant Cloud, and I told you just a minute ago that we would give you a concrete example about the amount of resources that are used based off of the number of workloads and latency within an application, or, sorry, throughput. So, this is a real-life example of a medium cluster. We can see that the destination component is using roughly 820 megabytes of memory. The proxy injector is using roughly 420 megabytes, and the identity component is using roughly 165 megabytes. The CPU for all of these is below .02 usage per course, and this is based on about 500 pods and 2,000 requests per second. So, by understanding your throughput, your requests per second, and the number of pods that you have running, this will be either bigger or smaller, and you can adjust those resource limits appropriately so that Kubernetes doesn’t go in and evict those pods for using too many resources.

So, I wish that I could say, “Here’s a formula that will get it done for you,” but unfortunately, applications are so different. Usage by workloads, by services, is so different across applications that there is no silver bullet for these things. So, you’ll have to do some iterating and figure out which resource limits and requests are appropriate for your environment. It’s an investment that’s worth doing, and there are no secrets to it. It’s very straightforward once you understand the profile of your application. So, if there are any questions about that, again, please drop them in Slack or in the chat, and we’ll keep moving on here into the debugging section.

Debugging

So, if you haven’t already, now would be the time to run that create script to get your cluster up and running, because, after the next few slides, that’s when I’m going to jump into the hands-on part of the workshop, because debugging Linkerd is… When I’m not doing workshops or presentations or working with some of the folks on the team here, I’m usually chatting with somebody in Slack or working with one of our customers, helping them through Linkerd issues or debugging Linkerd to understand what’s happening within their environment.

So, we have a number of tools, and the purpose of this section is to present to you the tools that you can use to debug the Linkerd control plane. This is different than some of the other talks that we’ve done where we use Linkerd to debug your application itself. In this case, we’re treating the Linkerd control plane as the application, and we’re going to use some tools to understand what’s happening in the event that the control plane is acting up, or even the data plane for that matter. So, you can debug the Linkerd control plane, start by debugging it using the tools that you already use and are familiar with today. Two of those are… Well, one of the tools is kube control, and two of the commands that it has are kube control events and kube control logs.

Charles Pretzer: Anytime that you see a control plane component that is in a state other than running, that’s where I would… When I’m debugging with customers or with Linkerd community, the first thing that I’ll look at is the output from kube control get events, for the Linkerd namespace specifically, and this is what we’ll look at in the hands-on section. This will tell you if there is a problem communicating with the API server in some cases. It’ll tell you if… That can happen for many different reasons, but the events are a good place to start, and oftentimes will have the message to point you to the solution that you need in order to get the control plane into a healthy state and get those Pods up and running.

Charles Pretzer: Oftentimes, in conjunction with kube control events, we can identify a specific workload or a specific Pod that isn’t starting up for some reason, and we can also see that in the Pod status, and we can use kube control logs to view the container logs. Now, I mentioned in the create script, my values-ha.yaml file has a couple of configurations, custom for this particular deployment, and those configurations are to set the log levels of the control plane components to trace log levels, which can be pretty verbose. So, this is not normally something that I would do in production, but for the purposes of this workshop, for us to be able to see, A, what it’s like to specify a custom configuration file, and B, to be able to look at those logs without having to go in and restart any of the workloads.

Charles Pretzer: In a production environment, you would leave those log levels where they are, and if an event happened, we would then go in and set those log levels to something that will output more information for us so that we can then use kube control logs to view those container logs. Going beyond the basics, we’ve got the Linkerd CLI. So, we mentioned that we’re not going to use the Linkerd CLI to install Linkerd. What we are going to use the Linkerd CLI is for the debugging process. So, for those tools, and we’ll walk through each of those in the hands-on, we’ve got Linkerd check, Linkerd diagnostics, Linkerd-viz Tap, Linkerd identity, and Linkerd authz. I’ll go through each of these in just a few minutes.

Charles Pretzer: I mentioned log levels earlier. There are two places… Because you’ve got the control plane and the data plane, we have the ability to control log levels in each one of those. The log levels that I just mentioned in my values-ha.yaml configuration file… Actually, sorry. It’s a values.yaml file for the control plane configuration. The proxy log levels, you have multiple ways of configuring those. One is globally in the Helm chart, and this generally… In a production environment, I wouldn’t do this, but it’s possible to do. The reason for this is because when you’re debugging data plane components, you don’t necessarily need to have all of the proxies set to a verbose log level like trace or debug. You really are looking at one workload and the replicas within that.

Charles Pretzer: So, for that we can use a per workload or per namespace through the annotation, config.linkerd.io proxy log level, and in this case I’ve got it set to warn, which is for all of the Rust components, the crates. Those will be set to warn, which is a less verbose level, and then anything in the Linkerd 2 proxy namespace will have a trace log level, which is very verbose, and again, this is not something that I would run in production for a long time, but it is a tool that we’ve used in the past and that we frequently use to understand the traffic that the Linkerd proxy is handling, both on the inbound side when it’s receiving requests, and on the outbound side.

Charles Pretzer: So, one of the really neat things, one of my favorite things about the Linkerd proxy is that you can actually configuration this at runtime. We’re not going to do this today, but the Linkerd proxy has a configuration endpoint, the Linkerd admin endpoint, and when you port forward to that port, you can send a put request that uses the same syntax, this Rust syntax for setting log levels, and in this case we’re setting Linkerd will debug, which means any Linkerd crate within the proxy, we’re going to set that to debug, and we send that put request, and that will configuration the proxy log level in realtime. This is super helpful when we see one particular proxy acting up within a set of replicas, so that’s indicative of an event that is not systemic, meaning that the service itself is fine. There’s just one replica that’s acting up. So, this gives us some really fine-grain control here.

Charles Pretzer: So, going from top to bottom we have we can set everything, we can set for a specific workload, or all the Pods in a namespace, or we can set the proxy log level at a specific container. So, this is very valuable and very helpful, and I can’t count on my hands the number of times that I’ve worked through this with folks, and even done this myself. The control plane components, I mentioned these earlier. The only real takeaway that you want from this slide is that the policy controller, which is new in 2.11, is written in Rust, and so it uses that same Rust syntax, linkerd=info,warn. The other components that are written in Go have a more probably familiar syntax for setting the proxy log level where you specify just the name, error, warn, info, debug and trace, and from left to right, that is least verbose to most verbose.

Charles Pretzer: Just like the proxy log level, you can set them globally in the Helm chart, which is what I’ve done in that values.yaml file in the repository, or you can do it per workload by modifying the individual Helm templates. So, that is the basics of debugging the Linkerd control plane. One other thing that I want to mention to you is that we have this Linkerd debug sidecar, which is a really powerful tool, but with great power comes great responsibility, I think is how they say it, and so this has to be run as a root user, but it’s a container that has tshark, tcpdump, lsof, and iproute2 binaries. So, I’ve used this with folks to capture TCP dump packets and inspect those packets individually when there’s something that’s really, really funky going on.

Charles Pretzer: I don’t want to call it a last-ditch effort, but this is when you’re using the Linkerd debug sidecar, it’s really powerful, and it’s something that we’re using when we’re really down in the weeds, trying to understand that packet level communication. Oftentimes, actually, all the time, when it’s between the Linkerd proxy and the service container, Linkerd proxy container and service container within the same Pod, so we’re looking a lot of times at just that local host traffic that’s happening within a Pod itself, between two containers. There’s a link here for you to explore more. Again, really powerful tool, and we mention it here because we’re talking about overall toolkits and what’s available to us. So, with that, let’s jump into the hands-on section. I’m going to jump over to my terminal here. [crosstalk 00:27:09].

Jason: Hey, Charles?

Charles Pretzer: Yeah?

Jason: Sorry to interrupt. I just wanted to ask folks who are listening… One, we’ve had a super active chat, so thank you very much. Does anyone have any questions for Charles before we go further? Just give me a second.

Charles Pretzer: I’m going to take a sip of water while… Yeah.

Jason: Yeah. Good water break time. All right. Sorry to interrupt you, Charles. I think we can carry on. Thanks, folks, and again, continue to ask stuff in the chat. Love to help you and make sure that you’re able to successfully complete the demo.

Charles Pretzer: Yeah. Great. So, I showed you the create script, so there’s nothing magical going on here. This is just a very quick and dirty script to get you started. This is a bit difficult to read, but what we’re doing here… There are some important parts here. Give me one second to pull over my editor because it’s going to be easier to read in there. There we go. Okay. So, this is the debugging steps. We want to look at the create script. There we go, much easier to read. I’ll make it a little bit bigger.

Jason: Charles, a couple people who are doing the create script have said that they need to independently run the command, Helm repo add Linkerd. So, Steven posted about that in the chat, and there’s also some commentary in the workshop, if you’re following along.

Charles Pretzer: I haven’t looked, but I will… Yeah, that’s a good point. We do check to see if Helm is installed, but because I already have the Linkerd repo added to my machine, I failed to add that check. So, PRs are welcome. If not, I’ll get it fixed right after this. I see some other chats in there. Thanks, Steven. Appreciate the PR. I just want to walk you through this quickly. The checks for the binaries are straightforward. What we create is three servers, a three-node Cluster using k3d, and then we create our certificate. So, we talked before about certificate management. The important thing to note here is that when you deploy Linkerd using Helm, we have to specify our own certificates, which is different from the Linkerd CLI.

Charles Pretzer: Again, what’s important here is to understand that there’s a trust anchor, which is this [CACert 00:29:55] and [CAKey 00:29:55], and from that CACert and CAKey, or certificate and key, are the issuer certificate and key, which are created as [inaudible 00:30:08] certificates. We’ll cover all those in the upcoming workshop about certificate management. The important thing to know here is that when we do Helm install, we have to make sure that we provide those to the Helm, sorry, as values to the Helm file, or Helm installation command.

Charles Pretzer: The other thing to take note of here is that I also added the namespace label for kube-system just to show you that it’s there and that it has to be done. However you end up configuring this in your CI/CD system, you’ll just want to make sure that you add that label at some point. Yeah. So, that’s that script, very simple, and again, a similar thing for deploying Linkerd-viz. I put these out into two different scripts just to make them a little more easy to comprehend and wrap your mind around. They could be in the same script if we wanted them to. So, going to take a quick look into the chat to make sure that everybody is at least able to get started.

Jason: There was another reminder from Nathan. A couple folks have run into an issue on macOS where the script uses the date command and the macOS has [gdate 00:31:36].

Charles Pretzer: Okay.

Jason: So, we’ll post something about that in the workshop Slack as well, but you may need to do an edit if you’re on macOS.

Charles Pretzer: Good point, and our docs actually have the command for generating the date for both macOS and Linux. This is a Linux script, but we’ll get that updated as well to do some iOS, or, sorry, OS detection. Thanks for pointing that out. Okay. I also have created this… Let’s look at debugging.md. This is kind of a quick and dirty cheat sheet for all of the commands that I mentioned in the slide. So, if we take a quick look at the events that I mentioned, we’ll go in order of the kube control commands through the Linkerd commands, and then that’ll be the… Those are the steps that we’ll go through.

Charles Pretzer: So, first thing I want to do is look at the events for the Linkerd namespace. What I find really helpful is to sort by last timestamp. Otherwise, they get all jumbled up. But in this case… Oh, no resource found in Linkerd namespace? Oh, you know why? Because you’ll see that I was working last night, and there have been… All the events have expired since then. So, I’ll just restart these deployments, and while that’s happening, we can take a look at some of the other parts of the script. So, actually, I should have events now already, and again, the only reason I did that is because the previous events in there had expired. We want to get some new events so that we can actually see what’s going on here, and this is what a healthy set of events looks like when the Linkerd control plane is starting, aside from the killing because I’ve restarted the Pods.

Charles Pretzer: But you’ll see created, started, pulled. If there’s anything… One of the things that we want to look for when debugging the control plane is anything that is not normal. So, here we have a warning. This is failed scheduling just based off of the anti-affinity rules. That’s because all the Pods were already running, and some on the nodes, and one needed to be terminated before another could be started. So, when you’re using kube control events, anything that’s not normal is going to be something that you want to stand out, that’s going to stand out, that you want to look at, and sort by timestamp, that’s going to make your life much easier rather than having to go through and look at when a message was last, or an event was last written.

Charles Pretzer: Okay. So, next thing we want to get into is the logs. So, I’m going to pick on the identity Pod today, grab one of our replicas out of the namespace here, Linkerd namespace. This is one of the identity replicas that’s running, and what we’re going to do is… Oops, wrong window. Sorry about that. What we’re going to do is we’ll get the logs for each of the containers within the identity component. There’s the Linkerd identity component itself, and that’s responsible for issuing certificates to the… Sorry. When it receives a certificate signing request from a proxy, then it signs that certificate and sends it back to the proxy, and that’s used for mutual TLS.

Charles Pretzer: There’s also the Linkerd proxy itself, so that tells us when it’s handling inbound and outbound traffic, and then another one that sometimes we have to debug, but not often, is the Linkerd-init container, and that’s responsible for setting the IP table scrolls. A lot of times when we see issues with components not being able to communicate with each other, we have to take a look at the IP table’s rules and the output from the Linkerd-init script to make sure that those IP table rules were set up correctly. So, let me do… We’ll look at the logs for the identity component, specifically the… Oops. You probably can’t see the Zoom toolbar. It’s in the way there, but it’s preventing me from typing. I’m kidding. My bad typing is because my fingers don’t work well in the morning.

Charles Pretzer: Kube control logs, Linkerd identity, -n Linkerd, and if I do this without specifying a container… What did I do wrong here? It’s just kube control logs, isn’t it? [inaudible 00:36:52], so another fix on my side. I will get that done. So, if we don’t specify a container, you’ve probably seen this before, we need to specify a container. I often do this just to see which containers are available. So, first we’ll look at identity, and there’s nothing abnormal happening here, but what I want to show you is that we can confirm that our log levels have been set because we see that these debug messages are in there. By default, the Linkerd control plane components will warn, or, sorry, will log, I believe, just info messages.

Charles Pretzer: So, here we can see that debug is set. In fact… Oops. Let’s see. What I want to look at is… Well, we don’t need to do that. I was just going to show you, if I looked at the manifest file for the deployment and for the log level, you would be able to see that the log levels were set to the values in my values-ha.yaml file. So, that is with the identity component logs. In trace mode, it will tell you any time a certificate has been requested, which proxy has requested it, and the associated service, as well as the identity associated with that service, and it will tell you that it has issued that certificate, or in the case where it’s not issuing certificates, it will tell you why it hasn’t issued the certificate. So, this is a valuable tool for understanding how proxies receive their certificates, and if for some reason they aren’t, we can debug the identity component.

Charles Pretzer: Let’s take a look at some of the Linkerd proxy logs, Linkerd-proxy, and this is, again, a very typical startup for a Linkerd proxy. If you’re debugging something that’s actively receiving traffic, the trace level output is… You can’t read it because the screen is moving too quickly. So, again, this is something that we do for a short period of time. We capture those logs somewhere where we can… Typically, what happens if we see bad behavior, we set, or, sorry, we see unexpected behavior, we set the log level. We reproduce the behavior, unset the log level, and then analyze those logs. So, that’s the typical debugging workflow that work through with folks in the community and with our customers.

Charles Pretzer: Finally, just because I want to show you what the init container logs, the script itself outputs all of the IP table rules. So, it tells you what the current state of IP tables is, and here we see that there were no IP tables that were already set. This can be different if you’ve got multiple init containers that are manipulating IP tables, and that’s one of the reasons that we write those IP tables first, to be able to debug any potential conflicts or routing issues from other containers. Then it actually writes all of the commands, the actual IP table’s commands that are used by the Linkerd proxy. Finally, it shows you the output from the IP tables once everything is done. So, this is really about once the script is run completely, so this is a really helpful tool in understanding low-level IP tables routing network behavior that happens within the Pod itself.

Charles Pretzer: So, that’s the basics of using kube control logs and kube control get events. Let’s jump into the Linkerd CLI for debugging. I’ll run through these quickly. You’ve got Linkerd check. This is the basic… If anything goes wrong with the Linkerd control plane, we want to run Linkerd check. This is going to keep running. It checks the extensions as well if any extensions are deployed, and if there’s anything in here that’s unexpected, then it’s going to tell you. For example, my control plane proxies are not running the current version, and I think this is a bug that was fixed in a recent edge. Let’s see. We’ve got Linkerd check, Linkerd version. Linkerd diagnostics is a really powerful tool, so let me get another… Let me see what I have. Yeah, perfect. Okay.

Charles Pretzer: So, we want to do Linkerd… Let me move this up so it’s more readable. DG is short for diagnostics. We want to get proxy metrics. Sorry. I’ve lost it behind Zoom again. Let me make this a little bit smaller. Okay. We want to get it off of that identity Pod. So, this is going to give you the raw Prometheus metrics, and where we use this a lot is looking at the TCP connections. So, it’s a bit difficult to read, but here, if we see something where connections are being refused by the Pod, a lot of times what we’ll look at is the TCP close total compared to the… Let’s see. There’s a TCP open total. Having a hard time seeing it here. Sorry about that. Oh, here we go, TCP open total. We compare that with TCP close total. If there’s a big disparity, we know that there are a bunch of open connections, and we also display the open connections. So, this is a tool that I’ve used to debug issues where we see connections being refused or closed by the service container.

Charles Pretzer: Another one that I want to go through is the Linkerd identity command, and this gives you the ability to inspect directly the certificates that are associated with a Linkerd proxy. So, here we see the identity component has… There are three identity components, or replicas of the identity component, and the certificates that it’s using show our common name. Again, this is an inspection of the certificate. It will tell you when it expires. We’ll use this to check to make sure that the certificates are healthy in the event that two Pods can’t communicate with each other.

Charles Pretzer: Finally, new in Linkerd 2.11 is the Linkerd authz command, and this tell you all of the, oops, all of the server authorizations and server CRDs that are currently active in your environment. So, this is specifically for controlling policy and understanding when, oops, looking at policy and understanding which policies are in place and which Pods can communicate with each other, or which services can communicate with each other. So, in this case we can see that we’ve got servers and authorizations for all of the components in the Linkerd-viz namespace, and if we were to look at those separately, we would be able to look at what’s allowed and denied. That’s all part of a different workshop.

Charles Pretzer: We went over in one of the community meetings, I think it was September or October, where we talked about proxies, sorry, polices and understanding how the syntax behind those. So, those are the main tools for debugging using Linkerd CLI. In some cases, we’ve had to dig into the [inaudible 00:45:16] logs. For seasoned Kubernetes operators, this is something that it should… It’s never comfortable, but it should be something that you understand how to do, looking at the [inaudible 00:45:28] logs themselves to see if there anything going on in the system that looks suspicious. So, that is how we debug Linkerd, the control plane and the data plane, in production.

Charles Pretzer: Let’s jump back over to our slides, and I hope you were all able to follow along with that. Apologies that the script was Linux-focused. I hope that you all will be able to follow along on your own. So, let’s take a quick look at upgrading and rolling back, and then we’ll jump into some questions. So, upgrading Linkerd, this is another checklist. We just forgot to put the work checklist there. Always test in lower environment. This is nothing new here. The data plane is designed to work with future control planes, so you upgrade the control plane first, and then gradually restart the Pods that are meshed with Linkerd so that they get the new version of the proxy.

Charles Pretzer: So, a really good example of this is here we show upgrading 2.10.2 to 2.11.1. You upgrade the control plane to 2.11.1. Those proxies that are running the 2.10.2 control plane will still function, and on your own time you can go through and roll those out, restart those workloads so that those Pods get the 2.11.1 proxy. So, the compatibility is there. That’s one of the key features that is tested for when a new version of Linkerd is being built, whether it’s an edge or a stable. One of the questions we get a lot is upgrading Linkerd, and one of the things I’ve seen happen is folks will leapfrog versions. Always, always, always, please, always upgrade sequentially.

Charles Pretzer: So, if you’re on 2.9, go to 2.10, then 2.11. Don’t jump 2.10 to get to 2.11. This is just because there are changes that happen across versions, and we document these all very well in the upgrade notes, and if there’s anything in there, anything that happens, like specifically a breaking change, then it will be documented in there, and you’ll be able to understand what needs to happen as you upgrade any additional changes that you need to make, or if it’s just a seamless upgrade, then it’ll say, “Let’s just go ahead and upgrade.” But the main thing is don’t skip versions because there may be special notes for each of the releases.

Charles Pretzer: One quick note about upgrading with Helm is that there are some flags that you should become familiar with. I only learned about these recently myself. We document them well in the upgrade docs. Reuse values and reset values are flags that are a part of the Helm binary, and I’ve created a little table here. The matrix, whether you want to reuse values or reset values, can be driven by the decision of whether you’ve got any configuration overrides, or if you’re using vanilla Linkerd as configured out of the box. Have a look at these. Have a look at the docs. For the most part, the default behavior is fine. I haven’t seen a ton of people… Actually, I’ve never seen anybody ask questions about these particular flags.

Charles Pretzer: So, most of the time, you’ll be able to Helm upgrade without having to specify any of these flags, which means that there is a default behavior that occurs. Reuse values is the default behavior if there are overrides, and reset values if there are overrides. The difference between the two is reuse values will take what’s currently in the configuration and apply it insomuch as it can to the current, sorry, the new version, and reset values takes the configuration from the release, so 2.11.1, if you use reset values, will overwrite the values, anything that’s in the 2.10.2. release. So, that’s upgrading, and rolling back, it’s fundamentally the same as upgrading with Helm.

Charles Pretzer: There’s one little gotcha with Helm that I always have a hard time wrapping my brain around, is that you can Helm upgrade to a previous version. It’s just Helm semantics, and you can Helm upgrade from 2.11 to 2.10.2, and everything will work as expected. I will point out again breaking changes are documented in the Helm upgrade guide, and I know I mentioned earlier that you should use version control for your Helm configuration files, but this is going to help you a ton if you need to do any rollbacks. So, that’s the basics for rolling back. There’s not a ton to talk about here.

Charles Pretzer: With that, I think we’ve covered everything from monitoring and alerting, as well as identifying and using the tools for debugging Linkerd, the data plane and the control plane, and upgrades and rollbacks. We’ll be handling questions throughout the day in the workshops channel, and going forward, I hope that folks have been able to get their questions answered during the chat. Let’s take a minute, the next few minutes. Jason, do you have any questions for me that I didn’t cover that you think would be helpful for the folks here, or are there some interesting questions that you saw come in that we should answer?

Jason: Yeah. Good one from Nathan. In general, is it safe to Helm rollback the control plane if you want to roll back in a hurry?

Charles Pretzer: What is in a hurry? Yeah. So, to me that means you’ve just upgraded from 2.10 to 2.11, let’s say, and you’ve decided that you want to roll back to 2.10. Yes. For me, I think that’s totally fine. The thing to think about there is whether you have rolled any of the data plane components… Sorry. Yeah. Yeah, right. Any of the data plane components. So, if any of the proxies have been upgraded to 2.11 in that example, you want to make sure that as soon as you’ve rolled back to 2.10, you also restart those workloads as well to get those proxies back to 2.10.2. In most cases, the proxies will work in that… I’m trying to think. I shouldn’t make that blanket statement. Make sure that the proxies are at the version of your control plane or lower for compatibility’s sake.

Jason: We have another good one from [Shumet 00:52:42], I believe, and I’m sorry if I mispronounced your name. If I have set up Linkerd using the CLI, what is the best way to move it to a Helm-based deployment?

Charles Pretzer: Good question. So, if you set up Linkerd using the CLI and you let Linkerd generate the certificates for you, then you’re in a situation where you’ll want to rotate those certificates out, including the trust anchor, which is why we need to have the workshop for certificate management. Off the top of my head, you will… Well, if you’re in a position to, I would redeploy Linkerd from scratch with Helm. That’s probably the most seamless route, but not everybody’s in that position, and I totally get that. So, the thing that I would do is create those new certificates or get the certificates from your organization’s security team, and then go through the steps for manually rotating the certificates, the trust anchors, specifically, and then when you Helm deploy, you would specify those certificates as well.

Jason: Okay. So, folks who are listening, there are documents on the Linkerd docs that discuss exactly how you would go through rotating the certificates within Linkerd, and so that’s what Charles is referring to there.

Charles Pretzer: Yeah. Yep. I’m trying to think. No. So, actually, I ran through this exercise yesterday where I did a Helm deployment with one certificate, and then did Helm upgrade with a different certificate, and as you can see, the problem there is that you now have mismatched certificates in the proxies and in the control plane, which means the trust is broken and the system wouldn’t work. So, you want to make sure that you have those certificates in place before you do any upgrading using Helm from the Linkerd CLI.

Jason: Yeah, and Demetrius actually was really helpful and posted the link for everyone. Thank you so much for that, Demetrius.

Charles Pretzer: Yeah. That’s awesome. Thank you. What else, Jason? We got 90 seconds.

Jason: We have one that we’re not going to touch because it’s a roadmap one regarding Linkerd [inaudible 00:55:20], so Morton, we can talk about that a little bit offline. For folks that are in that are joining, we’re about at time, but would love to have you join the Linkerd Slack, and I’m going to post that one more time, the link to join our Slack, and feel free to reach out to myself or Charles anytime, and also come to that workshops channel to just ask us about anything you saw in today’s workshop.

Charles Pretzer: That’s also the place to get updates about upcoming workshops as well.

Jason: Yeah. So, with that, why don’t we plug the next workshop in the series?

Charles Pretzer: You’ve got it. Next workshop.

Jason: Let me post the link to Slack. I’m so sorry about that. Sorry. Go ahead, Charles.

Charles Pretzer: No, no problem. Next workshop is locking down your Kubernetes Cluster with Linkerd. So, we’ll talk more about security here. That’s going to happen on January 13th, which is a Thursday, and all the times are there, so register today, and we look forward to seeing you there. We love doing these workshops, and I can’t remember if Jason mentioned this earlier, but as we’ve started the Service Mesh Academy series, we’re basing it off of the feedback that you all give us.

Charles Pretzer: So, when you ask questions about Linkerd or specifically fill out the form to say, “This is what I want to learn about,” we take that and we say, “People are asking a lot about running Linkerd in production, and how do I debug it? Let’s do that workshop.” So, please continue to give us your ideas. Tell us what you want to hear about, and we are also interested in some of the edge cases out there that might be fun to track down that other people may be running into as well. Yeah. I’ll leave it at that. Thank you so much for joining. Thank you for being members of the Service Mesh Academy. I’m pretty excited for these workshops, and we look forward to doing more and seeing you all there.

Jason: Yeah. Thank you so much, everyone.