Home / Blog / Karpenter - A New Way to Manage Kubernetes Node Groups

Karpenter - A New Way to Manage Kubernetes Node Groups

Fernando Battistella
Karpenter - A New Way to Manage Kubernetes Node Groups

One of the most common discussions that happen when adopting Kubernetes is around autoscaling. You can autoscale your workloads horizontally or vertically, but the main challenge has always been the nodes.

The hypervisor doesn't have visibility into what the container is actually consuming in a virtual machine, nor is it aware of the workload resource requirements, and without that information the cloud provider can't reliably handle the node autoscaling. The solution was to let something that does have that information handle it, and so we have the Cluster Autoscaler.

The Cluster Autoscaler automatically adjusts the size of an autoscaling group (ASG), when a pod failed to run in the cluster due to insufficient resources, or when nodes in the cluster are underutilized for a set period of time, and their pods can fit into other existing nodes.

Looking at the above description, it seems like the Cluster Autoscaler is just fine, and in most cases it is, but what if you need a new type of node that isn't available yet in your cluster's nodegroups?

Most organizations will have their clusters deployed using some kind of infrastructure as code tool like Terraform or AWS Cloudformation, which means that updates to this codebase will be necessary when changing the node groups. Configuring details and restrictions of these node groups is not always a straightforward process either.

New nodes can also take a while to be available to Kubernetes, and once they are available you might still run into racing conditions scheduling pods into these nodes.

Recently, AWS released Karpenter to address these issues and bring a more native approach to managing your cluster nodes.

Let's take a look at how both solutions work, current pros and cons.

Cluster Autoscaler and Karpenter

How does the Cluster Autoscaler work?

  • We deploy a workload to the cluster
  • Kubernetes scheduler could not find a node that will fit our pod
  • Pod is marked as Pending and Unschedulable
  • Cluster Autoscaler looks for pods in a Pending state
  • It increases the ASG desired count if the pending pods do not fit in the current nodes
  • The ASG creates a new instance
  • Instance joins the cluster
  • Kubernetes scheduler finds the new node and, if the pod fits in it, assigns the pod to it

So the Cluster Autoscaler doesn't really deal with the nodes themselves, it just adjusts the AWS ASG and lets AWS take care of everything else on the infrastructure side, and relies on the Kubernetes scheduler to assign the pod to a node.

While this works, it can introduce a number of failure modes, like a racing condition having a pod being assigned to your new node before your old pod, triggering the whole loop again and leaving your pod pending for a longer period.

What about Karpenter?

Karpenter does not manipulate ASGs, it handles the instances directly. Instead of creating code to deploy a new node group, then target your workload to that group, you just deploy your workload, and Karpenter will create an EC2 instance that matches your constraints, if it has a matching Provisioner. A Provisioner in Karpenter is a manifest that describes a node group. You can have multiple Provisioners for different needs, just like node groups.

Ok, if its like node groups, what is the advantage? The catch is in the way that Karpenter works. Let's do the same exercise we did for the Cluster Autoscaler, but now with Karpenter.

  • We deploy a workload to the cluster
  • Kubernetes scheduler could not find a node that will fit our pod
  • Pod is marked as Pending and Unschedulable
  • Karpenter evaluates the resources and constraints of the Unschedulable pods against the available Provisioners and creates matching EC2 instances
  • Instance(s) joins the cluster
  • Karpenter immediately binds the pods to the new node(s) without waiting for the Kubernetes scheduler

Just by not relying on ASGs and handling the nodes itself, it cuts on the time needed to provision a new node, as it doesn't need to wait for the ASG to respond to a change in its sizing, it can request a new instance in seconds.

In our tests, a pending pod got a node created for it in 2 seconds, and was running in about 1 minute in average, versus 2 to 5 minutes with the Cluster Autoscaler.

The possible racing condition we talked about before, is not possible in this model as the pods are immediately assigned to the new nodes.

Other interesting things the Provisioner can do is setting a ttl for empty nodes, so a node that has no pods, other than DaemonSet pods, is terminated when the ttl is reached.

It can also ensure nodes are current by enforcing a ttl for the nodes in general, meaning a node is recycled once the ttl is reached.

Ok! So Karpenter is great, let's dump the Cluster Autoscaler! Not so fast! There is one feature that Karpenter is missing from Cluster Autoscaler, which is rebalancing nodes, the later can drain a node when its utilization falls under a certain treshold and its pods fit in other nodes.

Talk is Cheap! Show me the demo!

Let's get this running! We're following the getting started guide from karpenter.sh with a couple twists.

At the time this post was written Karpenter 0.5.2 was the latest version available.

First the good old warning for all demo code.

WARNING! This code is for use in testing only, broad permissions are given to Karpenter, and there was no effort in securing the cluster.

Now go and checkout our git repository from https://github.com/ops-guru/karpenter-blog-post.

We will use Terraform, and Helm to deploy:

  • a VPC and Subnets
  • an EKS cluster with one node (need to run Karpenter somewhere right?)
  • an IAM role to allow Karpenter to manipulate some AWS resources it needs to manage nodes for us (more details on those in the Getting Started with Terraform page at Karpenter's website)
  • Karpenter using its helm chart with access to its IAM role through IAM Roles for Service Accounts

To that end we will first export a couple environment variables.

  • AWS_PROFILE is our AWS cli profile configured with our credentials (if yours are in your default profile you can skip this one)
  • AWS_DEFAULT_REGION to select which region to create resources in
  • CLUSTER_NAME to give our cluster a nice name
  • KUBECONFIG and KUBE_CONFIG_PATH to tell kubectl, helm and terraform where our kubeconfig file is (which will be created by terraform for us)
export AWS_PROFILE=opsguru
export AWS_DEFAULT_REGION=ca-central-1
export CLUSTER_NAME=opsguru-karpenter-test
export KUBECONFIG=${PWD}/kubeconfig_${CLUSTER_NAME}
export KUBE_CONFIG_PATH=${KUBECONFIG}

Let's create our cluster and deploy Karpenter into it. Init terraform, then check the plan and confirm. EKS cluster creation takes around 10 minutes.

terraform init
terraform apply -var cluster_name=${CLUSTER_NAME} -var region=${AWS_DEFAULT_REGION}

Now that you've got some coffee let's talk node groups.

Our demo will assume we want two node groups in the cluster. One using on-demand instances, another using spot instances.

How can we do this with Karpenter? We just need to define Provisioners for each of these groups. Instead of rambling about it, let's take a look at the provisioner resources for our two node groups.

Our on-demand instances are for our cluster addons, we will want a taint to ensure only cluster addons are deployed there. We also want to restrict the node types to m5.large and m5.2xlarge instances in both our availability zones.

cat <<EOF > node_group_addons.yaml
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: addons-ondemand
spec:
  requirements:
    - key: node.kubernetes.io/instance-type # If not included, all instance types are considered
      operator: In
      values: ["m5.large", "m5.2xlarge"]
    - key: "topology.kubernetes.io/zone" # If not included, all zones are considered
      operator: In
      values: ["${AWS_DEFAULT_REGION}a", "${AWS_DEFAULT_REGION}b"]
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["on-demand"]
  labels: # Kubernetes labels
    managed-by: karpenter
    purpose: addons
  provider:
    instanceProfile: KarpenterNodeInstanceProfile-${CLUSTER_NAME}
    tags: # AWS EC2 Tags
      managed-by: karpenter
  ttlSecondsAfterEmpty: 30 # If a node is empty of non daemonset pods for this ttl, it is removed
  taints:
    - key: opsguru.io/addons
      effect: NoSchedule
EOF

What are we looking at?

  • Any pods that nodeSelect to managed-by: karpenter and tolerates our opsguru.io/addons taint, and can fit into a m5.large or m5.2xlarge node will have a node provisioned for it, if needed
  • The nodes will be on-demand type nodes
  • The nodes will be deployed in either our AZ a or b
  • If the node is empty for more than 30 seconds we terminate it
  • Kubernetes labels managed-by: karpenter and purpose: addons will be added to the nodes
  • An EC2 tag managed-by: karpenter will be applied to the nodes

Our spot instances are for any other workloads, we will not taint them and we will use c5 instances. Any workloads that can't fit on our initial cluster node (the one created with Terraform), and do not tolerate the opsguru.io/addons from the on-demand group, should be scheduled in these nodes.

cat <<EOF > node_group_general_spot.yaml
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: spot-general
spec:
  requirements:
    - key: node.kubernetes.io/instance-type # If not included, all instance types are considered
      operator: In
      values: ["c5.large", "c5.2xlarge"]
    - key: "topology.kubernetes.io/zone" # If not included, all zones are considered
      operator: In
      values: ["${AWS_DEFAULT_REGION}a", "${AWS_DEFAULT_REGION}b"]
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["spot"]
  labels: # Kubernetes labels
    managed-by: karpenter
  provider:
    instanceProfile: KarpenterNodeInstanceProfile-${CLUSTER_NAME}
    tags: # AWS EC2 Tags
      managed-by: karpenter
  ttlSecondsAfterEmpty: 30 # If a node is empty of non daemonset pods for this ttl, it is removed
EOF

This one is quite similar to the first Provisioner, but we're using spot instances instead of on-demand, c5 type nodes, and no taint.

Now that we have our provisioners defined, let's install Karpenter using Helm.

helm repo add karpenter https://charts.karpenter.sh
helm repo update
helm install karpenter \
    -n karpenter \
    --create-namespace \
    --version 0.5.2 \
    --set serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn=$(terraform output -raw iam_role_arn) \
    --set controller.clusterName=${CLUSTER_NAME} \
    --set controller.clusterEndpoint=$(terraform output -raw cluster_endpoint) \
    --wait \
    karpenter/karpenter

Ok! We got almost everything we need to see this working, we're just missing one little thing, actual workloads :D

You can apply the workloads folder from our git repository, we have two manifests there:

  • addon.yaml - a deployment of a pause container, with a nodeSelector to the label purpose: addons, and tolerating the taint defined in the provisioner, with 1 replica
  • general.yaml - a deployment of a pause container, with a nodeSelector to the label managed-by: karpenter, with 20 replicas
kubectl get pods -o=custom-columns="NAME:.metadata.name,STATUS:.status.conditions[*].reason,MESSAGE:.status.conditions[*].message,NODE:.spec.nodeName"
NAME                               STATUS          MESSAGE                                                                         NODE
addon-7fc784b5d-fg2dx              Unschedulable   0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector.   <none>
general-workloads-5df49fcb-2hhqg   Unschedulable   0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector.   <none>
general-workloads-5df49fcb-4mlqt   Unschedulable   0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector.   <none>
general-workloads-5df49fcb-4zx4v   Unschedulable   0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector.   <none>
general-workloads-5df49fcb-5788h   Unschedulable   0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector.   <none>
general-workloads-5df49fcb-7b76r   Unschedulable   0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector.   <none>
...

With these deployed you can see that all 11 pods are pending, and their status says they're Unschedulable, the one node we have in the cluster does not match their constraints (nodeSelector), and that they have no node assigned.

Let's check the status of our nodes:

kubectl get nodes
NAME                                          STATUS   ROLES    AGE     VERSION
ip-10-0-1-47.ca-central-1.compute.internal    Ready    <none>   46h     v1.21.5-eks-bc4871b

kubectl describe node ip-10-0-1-47.ca-central-1.compute.internal
Name:               ip-10-0-1-47.ca-central-1.compute.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m5.large
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=ca-central-1
                    failure-domain.beta.kubernetes.io/zone=ca-central-1a
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-1-47.ca-central-1.compute.internal
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=m5.large
                    topology.kubernetes.io/region=ca-central-1
                    topology.kubernetes.io/zone=ca-central-1a
...

The existing node indeed doesn't have the labels we're trying to use for our nodeSelector in any of our workloads.

Now let's deploy our first provisioner addons-ondemand.

kubectl apply -f node_group_addons.yaml
provisioner.karpenter.sh/addons-ondemand created

If you're following the Karpenter controller logs you will see a node be provisioned and the pod bound to it immediately.

kubectl logs -n karpenter -l karpenter=controller -f
2021-12-17T18:49:33.800Z        INFO    controller.provisioning Batched 1 pods in 1.000321584s  {"commit": "870e2f6", "provisioner": "addons-ondemand"}
2021-12-17T18:49:33.804Z        INFO    controller.provisioning Computed packing of 1 node(s) for 1 pod(s) with instance type option(s) [m5.large m5.2xlarge]   {"commit": "870e2f6", "provisioner": "addons-ondemand"}
2021-12-17T18:49:36.061Z        INFO    controller.provisioning Launched instance: i-03ffbc75bd75a68e7, hostname: ip-10-0-1-114.ca-central-1.compute.internal, type: m5.large, zone: ca-central-1a, capacityType: on-demand     {"commit": "870e2f6", "provisioner": "addons-ondemand"}
2021-12-17T18:49:36.098Z        INFO    controller.provisioning Bound 1 pod(s) to node ip-10-0-1-114.ca-central-1.compute.internal      {"commit": "870e2f6", "provisioner": "addons-ondemand"}

If you check our pods again, you will see that its scheduled to a node.

kubectl get pods -o=custom-columns="NAME:.metadata.name,STATUS:.status.conditions[*].reason,MESSAGE:.status.conditions[*].message,NODE:.spec.nodeName"
NAME                               STATUS          MESSAGE                                                                                                                                                  NODE
addon-7fc784b5d-fg2dx              <none>          <none>                                                                                                                                                   ip-10-0-1-114.ca-central-1.compute.internal
general-workloads-5df49fcb-2hhqg   Unschedulable   0/2 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {opsguru.io/addons: }, that the pod didn't tolerate.   <none>
general-workloads-5df49fcb-4mlqt   Unschedulable   0/2 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {opsguru.io/addons: }, that the pod didn't tolerate.   <none>
general-workloads-5df49fcb-4zx4v   Unschedulable   0/2 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {opsguru.io/addons: }, that the pod didn't tolerate.   <none>
...

You will also notice that our general workloads are still Unschedulable, but that the message says that now 2 nodes don't match, one doesn't match the selector, the other has a taint the workload doesn't tolerate.

Let's see our nodes now.

kubectl get nodes
NAME                                          STATUS   ROLES    AGE   VERSION
ip-10-0-1-114.ca-central-1.compute.internal   Ready    <none>   11m   v1.21.5-eks-bc4871b
ip-10-0-1-47.ca-central-1.compute.internal    Ready    <none>   46h   v1.21.5-eks-bc4871b

There is our new node! Let's see what Karpenter got us.

kubectl describe node ip-10-0-1-114.ca-central-1.compute.internal
Name:               ip-10-0-1-114.ca-central-1.compute.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m5.large
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=ca-central-1
                    failure-domain.beta.kubernetes.io/zone=ca-central-1a
                    karpenter.sh/capacity-type=on-demand
                    karpenter.sh/provisioner-name=addons-ondemand
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-1-114.ca-central-1.compute.internal
                    kubernetes.io/os=linux
                    managed-by=karpenter
                    node.kubernetes.io/instance-type=m5.large
                    purpose=addons
                    topology.kubernetes.io/region=ca-central-1
                    topology.kubernetes.io/zone=ca-central-1a
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Fri, 17 Dec 2021 11:49:36 -0700
Taints:             opsguru.io/addons:NoSchedule
...
Capacity:
  attachable-volumes-aws-ebs:  25
  cpu:                         2
  ephemeral-storage:           20959212Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      7934464Ki
  pods:                        29
Allocatable:
  attachable-volumes-aws-ebs:  25
  cpu:                         1930m
  ephemeral-storage:           18242267924
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      7244288Ki
  pods:                        29
...

Our addon requires 1 core and 100Mb of memory, it has a nodeSelector pointing to the label purpose with value addons and tolerates the opsguru.io/addons taint.

Our Provisioner addons-ondemand matches all these conditions, and in its instance type options we have m5.large that can fit our pod (you can see that the node has 1930m allocatable, our pod needs 1000m). Since the request matches a Provisioner's settings, we got a node for the workload.

What about our other pods? Well, let's get their Provisioner up!

kubectl apply -f node_group_general_spot.yaml
provisioner.karpenter.sh/spot-general created

Once we apply the Provisioner you will see in Karpenter's logs:

2021-12-17T21:53:22.009Z        INFO    controller.provisioning Waiting for unschedulable pods  {"commit": "870e2f6", "provisioner": "spot-general"}
2021-12-17T21:53:34.896Z        INFO    controller.provisioning Batched 20 pods in 1.410203663s {"commit": "870e2f6", "provisioner": "spot-general"}
2021-12-17T21:53:34.906Z        INFO    controller.provisioning Computed packing of 3 node(s) for 20 pod(s) with instance type option(s) [c5.2xlarge]   {"commit": "870e2f6", "provisioner": "spot-general"}
2021-12-17T21:53:38.533Z        INFO    controller.provisioning Launched instance: i-082db60871ae40c9d, hostname: ip-10-0-1-162.ca-central-1.compute.internal, type: c5.2xlarge, zone: ca-central-1a, capacityType: spot        {"commit": "870e2f6", "provisioner": "spot-general"}
2021-12-17T21:53:38.533Z        INFO    controller.provisioning Launched instance: i-03d7f3f1d4bffdea4, hostname: ip-10-0-2-46.ca-central-1.compute.internal, type: c5.2xlarge, zone: ca-central-1b, capacityType: spot {"commit": "870e2f6", "provisioner": "spot-general"}
2021-12-17T21:53:38.533Z        INFO    controller.provisioning Launched instance: i-09dc16d84a292604c, hostname: ip-10-0-2-169.ca-central-1.compute.internal, type: c5.2xlarge, zone: ca-central-1b, capacityType: spot        {"commit": "870e2f6", "provisioner": "spot-general"}
2021-12-17T21:53:38.591Z        INFO    controller.provisioning Bound 7 pod(s) to node ip-10-0-1-162.ca-central-1.compute.internal      {"commit": "870e2f6", "provisioner": "spot-general"}
2021-12-17T21:53:38.666Z        INFO    controller.provisioning Bound 7 pod(s) to node ip-10-0-2-46.ca-central-1.compute.internal       {"commit": "870e2f6", "provisioner": "spot-general"}
2021-12-17T21:53:38.830Z        INFO    controller.provisioning Bound 6 pod(s) to node ip-10-0-2-169.ca-central-1.compute.internal      {"commit": "870e2f6", "provisioner": "spot-general"}

Our 20 pods were split in 3 nodes, we can confirm that they are all scheduled by retrying our previous command to check their status:

kubectl get pods -o=custom-columns="NAME:.metadata.name,STATUS:.status.conditions[*].reason,MESSAGE:.status.conditions[*].message,NODE:.spec.nodeName"
NAME                               STATUS   MESSAGE   NODE
addon-7fc784b5d-fg2dx              <none>   <none>    ip-10-0-1-114.ca-central-1.compute.internal
general-workloads-5df49fcb-7f2mf   <none>   <none>    ip-10-0-1-162.ca-central-1.compute.internal
general-workloads-5df49fcb-7rls5   <none>   <none>    ip-10-0-2-46.ca-central-1.compute.internal
general-workloads-5df49fcb-9qs99   <none>   <none>    ip-10-0-2-169.ca-central-1.compute.internal
general-workloads-5df49fcb-bqnvc   <none>   <none>    ip-10-0-1-162.ca-central-1.compute.internal
general-workloads-5df49fcb-d775z   <none>   <none>    ip-10-0-2-169.ca-central-1.compute.internal
general-workloads-5df49fcb-g5kdd   <none>   <none>    ip-10-0-1-162.ca-central-1.compute.internal
general-workloads-5df49fcb-gxkn9   <none>   <none>    ip-10-0-1-162.ca-central-1.compute.internal
general-workloads-5df49fcb-jhq85   <none>   <none>    ip-10-0-1-162.ca-central-1.compute.internal
general-workloads-5df49fcb-jvnhl   <none>   <none>    ip-10-0-2-46.ca-central-1.compute.internal
general-workloads-5df49fcb-nfhq5   <none>   <none>    ip-10-0-2-169.ca-central-1.compute.internal
general-workloads-5df49fcb-qpkdb   <none>   <none>    ip-10-0-1-162.ca-central-1.compute.internal
general-workloads-5df49fcb-scmdp   <none>   <none>    ip-10-0-2-169.ca-central-1.compute.internal
general-workloads-5df49fcb-tgtct   <none>   <none>    ip-10-0-1-162.ca-central-1.compute.internal
general-workloads-5df49fcb-ts4pt   <none>   <none>    ip-10-0-2-46.ca-central-1.compute.internal
general-workloads-5df49fcb-v6cql   <none>   <none>    ip-10-0-2-46.ca-central-1.compute.internal
general-workloads-5df49fcb-wqhtl   <none>   <none>    ip-10-0-2-169.ca-central-1.compute.internal
general-workloads-5df49fcb-xpw52   <none>   <none>    ip-10-0-2-46.ca-central-1.compute.internal
general-workloads-5df49fcb-xzgkq   <none>   <none>    ip-10-0-2-169.ca-central-1.compute.internal
general-workloads-5df49fcb-z47dd   <none>   <none>    ip-10-0-2-46.ca-central-1.compute.internal
general-workloads-5df49fcb-zpd6s   <none>   <none>    ip-10-0-2-46.ca-central-1.compute.internal

We should now have 5 nodes, 1 original node from Terraform, 1 from our addons-ondemand provisioner, 3 from the spot-general provisioner.

kubectl get nodes
NAME                                          STATUS   ROLES    AGE     VERSION
ip-10-0-1-114.ca-central-1.compute.internal   Ready    <none>   3h7m    v1.21.5-eks-bc4871b
ip-10-0-1-162.ca-central-1.compute.internal   Ready    <none>   3m57s   v1.21.5-eks-bc4871b
ip-10-0-1-47.ca-central-1.compute.internal    Ready    <none>   2d1h    v1.21.5-eks-bc4871b
ip-10-0-2-169.ca-central-1.compute.internal   Ready    <none>   3m57s   v1.21.5-eks-bc4871b
ip-10-0-2-46.ca-central-1.compute.internal    Ready    <none>   3m57s   v1.21.5-eks-bc4871b

Let's dig a bit into our new nodes now, which instance types we have now?

kubectl get nodes -l karpenter.sh/provisioner-name=spot-general -o jsonpath='{.items[*].metadata.labels.node\.kubernetes\.io\/instance-type}'
c5.2xlarge c5.2xlarge c5.2xlarge

Our general-workloads deployment pods only differ from the addon deployment for their nodeSelector and the lack of toleration for the opsguru.io/addons taint. Their nodeSelector label is set to managed-by: karpenter which also matches the addons-ondemand provisioner, but without the taint they can only match with the new Provisioner.

With the Provisioner matched, Karpenter now needs to decide which instance type to use between c5.large and c5.2xlarge. A c5.large has 2vCPUs and 4Gb of memory, so it should only be able to take one of our pods (2vCPUs should have ~1900m allocatable, we need 1000m per pod), this would require us to have one instance per pod, quite a lot of resource waste in there (almost half the instance would sit unused).

Now a c5.2xlarge has 8vCPUs and 16Gb of memory, which should fit 7 of our pods in each instance (8vCPUs should have ~7900m allocatable). This matches what we're seeing, 3 nodes, 7 pods in 1 instance, 7 pods in another instance, 6 pods in the last instance, 20 pods scheduled in the best way allowed by our provisioner.

Cleanup

Thanks for coming to our TED TALK! errr, quick review of Karpenter.

Now let's cleanup and see one more feature of Karpenter.

In both our Provisioners we have a setting ttlSecondsAfterEmpty: 30, which means that if a node has no pods (other than DaemonSet pods) for more than 30 seconds, it will be terminated.

We won't take their word for it, let's check it!

Let's delete our deployments:

kubectl delete deployment general-workloads addon
deployment.apps "general-workloads" deleted
deployment.apps "addon" deleted

In Karpenter's logs we can see the nodes getting a ttl and then being cordoned, drained and terminated.

2021-12-17T22:29:23.877Z        INFO    controller.node Added TTL to empty node {"commit": "870e2f6", "node": "ip-10-0-1-162.ca-central-1.compute.internal"}
2021-12-17T22:29:23.932Z        INFO    controller.node Added TTL to empty node {"commit": "870e2f6", "node": "ip-10-0-2-46.ca-central-1.compute.internal"}
2021-12-17T22:29:24.031Z        INFO    controller.node Added TTL to empty node {"commit": "870e2f6", "node": "ip-10-0-2-169.ca-central-1.compute.internal"}
2021-12-17T22:29:24.239Z        INFO    controller.node Added TTL to empty node {"commit": "870e2f6", "node": "ip-10-0-1-114.ca-central-1.compute.internal"}
2021-12-17T22:29:53.889Z        INFO    controller.node Triggering termination after 30s for empty node {"commit": "870e2f6", "node": "ip-10-0-1-162.ca-central-1.compute.internal"}
2021-12-17T22:29:53.915Z        INFO    controller.termination  Cordoned node   {"commit": "870e2f6", "node": "ip-10-0-1-162.ca-central-1.compute.internal"}
2021-12-17T22:29:53.948Z        INFO    controller.node Triggering termination after 30s for empty node {"commit": "870e2f6", "node": "ip-10-0-2-46.ca-central-1.compute.internal"}
2021-12-17T22:29:53.970Z        INFO    controller.termination  Cordoned node   {"commit": "870e2f6", "node": "ip-10-0-2-46.ca-central-1.compute.internal"}
2021-12-17T22:29:54.042Z        INFO    controller.node Triggering termination after 30s for empty node {"commit": "870e2f6", "node": "ip-10-0-2-169.ca-central-1.compute.internal"}
2021-12-17T22:29:54.068Z        INFO    controller.termination  Cordoned node   {"commit": "870e2f6", "node": "ip-10-0-2-169.ca-central-1.compute.internal"}
2021-12-17T22:29:54.070Z        INFO    controller.termination  Deleted node    {"commit": "870e2f6", "node": "ip-10-0-1-162.ca-central-1.compute.internal"}
2021-12-17T22:29:54.147Z        INFO    controller.termination  Deleted node    {"commit": "870e2f6", "node": "ip-10-0-2-46.ca-central-1.compute.internal"}
2021-12-17T22:29:54.247Z        INFO    controller.termination  Deleted node    {"commit": "870e2f6", "node": "ip-10-0-2-169.ca-central-1.compute.internal"}
2021-12-17T22:29:54.261Z        INFO    controller.node Triggering termination after 30s for empty node {"commit": "870e2f6", "node": "ip-10-0-1-114.ca-central-1.compute.internal"}
2021-12-17T22:29:54.290Z        INFO    controller.termination  Cordoned node   {"commit": "870e2f6", "node": "ip-10-0-1-114.ca-central-1.compute.internal"}
2021-12-17T22:29:54.425Z        INFO    controller.termination  Deleted node    {"commit": "870e2f6", "node": "ip-10-0-1-114.ca-central-1.compute.internal"}

Without workloads and nodes, we are left with our initial cluster, which Terraform will gladly destroy.

terraform destroy -var cluster_name=${CLUSTER_NAME} -var region=${AWS_DEFAULT_REGION}

Conclusion

Pros

This same demo in Cluster Autoscaler should be marginally slower (a couple minutes difference, which depending on your workloads might be crucial, or not), but at a larger scale (think several services with hundreds of pods each) this speed difference by itself is a major advantage.

Depending on how you manage tenancy in your clusters, you could even have the Provisioner deployed as part of your application through a helm chart, or just have an easier time managing node groups in general.

Cons

Karpenter still doesn't have a mechanism for removing underutilized nodes if their workloads can fit elsewhere, which is a feature present in the Cluster Autoscaler. This could possibly be handled by Descheduler but that can be a whole other blog post :)

Cluster Autoscaler has been around for a good while, and is beyond battle tested, while Karpenter is relatively new and might be rough around the edges.

Karpenter only works on AWS right now, it can be expanded for other cloud providers though.

Final Thoughts

Karpenter is extremely promising, and its pros will outweight the cons in most cases. It is not an all or nothing solution either, you can have it running in parallel to Cluster Autoscaler and have the best of both worlds.

There is a lot we didn't cover in here about Karpenter, take a look at our Related Links section at the bottom for some documentation and videos on it.

We are looking forward to how this tool develops going forward!

Related Links


Written by:

Fernando Battistella, Principal Architect at OpsGuru - Fernando has over two decades of experience in IT, with the last six years architecting cloud-native solutions for companies of all sizes. Specialized in Kubernetes and the Cloud Native ecosystem, he has helped multiple organizations design, build, migrate, operate and train their teams in cloud-native technologies and platforms.

Share the Blog

Contact UsLearn more about how OpsGuru can help you with cloud adoption through our cloud services.