OpsGuru Achieves AWS Generative AI Competency
OpsGuru is proud to announce that we have achieved the AWS Generative AI Competency, further solidifying its position as an AWS Premier Partner and a leader in Generative AI solutions....
One of the most common discussions that happen when adopting Kubernetes is around autoscaling. You can autoscale your workloads horizontally or vertically, but the main challenge has always been the nodes.
The hypervisor doesn't have visibility into what the container is actually consuming in a virtual machine, nor is it aware of the workload resource requirements, and without that information the cloud provider can't reliably handle the node autoscaling. The solution was to let something that does have that information handle it, and so we have the Cluster Autoscaler.
The Cluster Autoscaler automatically adjusts the size of an autoscaling group (ASG), when a pod failed to run in the cluster due to insufficient resources, or when nodes in the cluster are underutilized for a set period of time, and their pods can fit into other existing nodes.
Looking at the above description, it seems like the Cluster Autoscaler is just fine, and in most cases it is, but what if you need a new type of node that isn't available yet in your cluster's nodegroups?
Most organizations will have their clusters deployed using some kind of infrastructure as code tool like Terraform or AWS Cloudformation, which means that updates to this codebase will be necessary when changing the node groups. Configuring details and restrictions of these node groups is not always a straightforward process either.
New nodes can also take a while to be available to Kubernetes, and once they are available you might still run into racing conditions scheduling pods into these nodes.
Recently, AWS released Karpenter to address these issues and bring a more native approach to managing your cluster nodes.
Let's take a look at how both solutions work, current pros and cons.
How does the Cluster Autoscaler work?
Pending
and Unschedulable
Pending
stateSo the Cluster Autoscaler doesn't really deal with the nodes themselves, it just adjusts the AWS ASG and lets AWS take care of everything else on the infrastructure side, and relies on the Kubernetes scheduler to assign the pod to a node.
While this works, it can introduce a number of failure modes, like a racing condition having a pod being assigned to your new node before your old pod, triggering the whole loop again and leaving your pod pending for a longer period.
What about Karpenter?
Karpenter does not manipulate ASGs, it handles the instances directly. Instead of creating code to deploy a new node group, then target your workload to that group, you just deploy your workload, and Karpenter will create an EC2 instance that matches your constraints, if it has a matching Provisioner. A Provisioner in Karpenter is a manifest that describes a node group. You can have multiple Provisioners for different needs, just like node groups.
Ok, if its like node groups, what is the advantage? The catch is in the way that Karpenter works. Let's do the same exercise we did for the Cluster Autoscaler, but now with Karpenter.
Pending
and Unschedulable
Unschedulable
pods against the available Provisioners and creates matching EC2 instancesJust by not relying on ASGs and handling the nodes itself, it cuts on the time needed to provision a new node, as it doesn't need to wait for the ASG to respond to a change in its sizing, it can request a new instance in seconds.
In our tests, a pending pod got a node created for it in 2 seconds, and was running in about 1 minute in average, versus 2 to 5 minutes with the Cluster Autoscaler.
The possible racing condition we talked about before, is not possible in this model as the pods are immediately assigned to the new nodes.
Other interesting things the Provisioner can do is setting a ttl for empty nodes, so a node that has no pods, other than DaemonSet pods, is terminated when the ttl is reached.
It can also ensure nodes are current by enforcing a ttl for the nodes in general, meaning a node is recycled once the ttl is reached.
Ok! So Karpenter is great, let's dump the Cluster Autoscaler! Not so fast! There is one feature that Karpenter is missing from Cluster Autoscaler, which is rebalancing nodes, the later can drain a node when its utilization falls under a certain treshold and its pods fit in other nodes.
Let's get this running! We're following the getting started guide from karpenter.sh with a couple twists.
At the time this post was written Karpenter 0.5.2 was the latest version available.
First the good old warning for all demo code.
WARNING! This code is for use in testing only, broad permissions are given to Karpenter, and there was no effort in securing the cluster.
Now go and checkout our git repository from https://github.com/ops-guru/karpenter-blog-post.
We will use Terraform, and Helm to deploy:
To that end we will first export a couple environment variables.
AWS_PROFILE
is our AWS cli profile configured with our credentials (if yours are in your default profile you can skip this one)AWS_DEFAULT_REGION
to select which region to create resources inCLUSTER_NAME
to give our cluster a nice nameKUBECONFIG
and KUBE_CONFIG_PATH
to tell kubectl, helm and terraform where our kubeconfig file is (which will be created by terraform for us)export AWS_PROFILE=opsguru
export AWS_DEFAULT_REGION=ca-central-1
export CLUSTER_NAME=opsguru-karpenter-test
export KUBECONFIG=${PWD}/kubeconfig_${CLUSTER_NAME}
export KUBE_CONFIG_PATH=${KUBECONFIG}
Let's create our cluster and deploy Karpenter into it. Init terraform, then check the plan and confirm. EKS cluster creation takes around 10 minutes.
terraform init
terraform apply -var cluster_name=${CLUSTER_NAME} -var region=${AWS_DEFAULT_REGION}
Now that you've got some coffee let's talk node groups.
Our demo will assume we want two node groups in the cluster. One using on-demand instances, another using spot instances.
How can we do this with Karpenter? We just need to define Provisioners for each of these groups. Instead of rambling about it, let's take a look at the provisioner resources for our two node groups.
Our on-demand instances are for our cluster addons, we will want a taint to ensure only cluster addons are deployed there. We also want to restrict the node types to m5.large and m5.2xlarge instances in both our availability zones.
cat <<EOF > node_group_addons.yaml
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: addons-ondemand
spec:
requirements:
- key: node.kubernetes.io/instance-type # If not included, all instance types are considered
operator: In
values: ["m5.large", "m5.2xlarge"]
- key: "topology.kubernetes.io/zone" # If not included, all zones are considered
operator: In
values: ["${AWS_DEFAULT_REGION}a", "${AWS_DEFAULT_REGION}b"]
- key: "karpenter.sh/capacity-type"
operator: In
values: ["on-demand"]
labels: # Kubernetes labels
managed-by: karpenter
purpose: addons
provider:
instanceProfile: KarpenterNodeInstanceProfile-${CLUSTER_NAME}
tags: # AWS EC2 Tags
managed-by: karpenter
ttlSecondsAfterEmpty: 30 # If a node is empty of non daemonset pods for this ttl, it is removed
taints:
- key: opsguru.io/addons
effect: NoSchedule
EOF
What are we looking at?
managed-by: karpenter
and tolerates our opsguru.io/addons
taint, and can fit into a m5.large or m5.2xlarge node will have a node provisioned for it, if neededmanaged-by: karpenter
and purpose: addons
will be added to the nodesmanaged-by: karpenter
will be applied to the nodesOur spot instances are for any other workloads, we will not taint them and we will use c5 instances. Any workloads that can't fit on our initial cluster node (the one created with Terraform), and do not tolerate the opsguru.io/addons
from the on-demand group, should be scheduled in these nodes.
cat <<EOF > node_group_general_spot.yaml
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: spot-general
spec:
requirements:
- key: node.kubernetes.io/instance-type # If not included, all instance types are considered
operator: In
values: ["c5.large", "c5.2xlarge"]
- key: "topology.kubernetes.io/zone" # If not included, all zones are considered
operator: In
values: ["${AWS_DEFAULT_REGION}a", "${AWS_DEFAULT_REGION}b"]
- key: "karpenter.sh/capacity-type"
operator: In
values: ["spot"]
labels: # Kubernetes labels
managed-by: karpenter
provider:
instanceProfile: KarpenterNodeInstanceProfile-${CLUSTER_NAME}
tags: # AWS EC2 Tags
managed-by: karpenter
ttlSecondsAfterEmpty: 30 # If a node is empty of non daemonset pods for this ttl, it is removed
EOF
This one is quite similar to the first Provisioner, but we're using spot instances instead of on-demand, c5 type nodes, and no taint.
Now that we have our provisioners defined, let's install Karpenter using Helm.
helm repo add karpenter https://charts.karpenter.sh
helm repo update
helm install karpenter \
-n karpenter \
--create-namespace \
--version 0.5.2 \
--set serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn=$(terraform output -raw iam_role_arn) \
--set controller.clusterName=${CLUSTER_NAME} \
--set controller.clusterEndpoint=$(terraform output -raw cluster_endpoint) \
--wait \
karpenter/karpenter
Ok! We got almost everything we need to see this working, we're just missing one little thing, actual workloads :D
You can apply the workloads
folder from our git repository, we have two manifests there:
purpose: addons
, and tolerating the taint defined in the provisioner, with 1 replicamanaged-by: karpenter
, with 20 replicaskubectl get pods -o=custom-columns="NAME:.metadata.name,STATUS:.status.conditions[*].reason,MESSAGE:.status.conditions[*].message,NODE:.spec.nodeName"
NAME STATUS MESSAGE NODE
addon-7fc784b5d-fg2dx Unschedulable 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector. <none>
general-workloads-5df49fcb-2hhqg Unschedulable 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector. <none>
general-workloads-5df49fcb-4mlqt Unschedulable 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector. <none>
general-workloads-5df49fcb-4zx4v Unschedulable 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector. <none>
general-workloads-5df49fcb-5788h Unschedulable 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector. <none>
general-workloads-5df49fcb-7b76r Unschedulable 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector. <none>
...
With these deployed you can see that all 11 pods are pending, and their status says they're Unschedulable
, the one node we have in the cluster does not match their constraints (nodeSelector), and that they have no node assigned.
Let's check the status of our nodes:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-1-47.ca-central-1.compute.internal Ready <none> 46h v1.21.5-eks-bc4871b
kubectl describe node ip-10-0-1-47.ca-central-1.compute.internal
Name: ip-10-0-1-47.ca-central-1.compute.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=m5.large
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=ca-central-1
failure-domain.beta.kubernetes.io/zone=ca-central-1a
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-10-0-1-47.ca-central-1.compute.internal
kubernetes.io/os=linux
node.kubernetes.io/instance-type=m5.large
topology.kubernetes.io/region=ca-central-1
topology.kubernetes.io/zone=ca-central-1a
...
The existing node indeed doesn't have the labels we're trying to use for our nodeSelector in any of our workloads.
Now let's deploy our first provisioner addons-ondemand
.
kubectl apply -f node_group_addons.yaml
provisioner.karpenter.sh/addons-ondemand created
If you're following the Karpenter controller logs you will see a node be provisioned and the pod bound to it immediately.
kubectl logs -n karpenter -l karpenter=controller -f
2021-12-17T18:49:33.800Z INFO controller.provisioning Batched 1 pods in 1.000321584s {"commit": "870e2f6", "provisioner": "addons-ondemand"}
2021-12-17T18:49:33.804Z INFO controller.provisioning Computed packing of 1 node(s) for 1 pod(s) with instance type option(s) [m5.large m5.2xlarge] {"commit": "870e2f6", "provisioner": "addons-ondemand"}
2021-12-17T18:49:36.061Z INFO controller.provisioning Launched instance: i-03ffbc75bd75a68e7, hostname: ip-10-0-1-114.ca-central-1.compute.internal, type: m5.large, zone: ca-central-1a, capacityType: on-demand {"commit": "870e2f6", "provisioner": "addons-ondemand"}
2021-12-17T18:49:36.098Z INFO controller.provisioning Bound 1 pod(s) to node ip-10-0-1-114.ca-central-1.compute.internal {"commit": "870e2f6", "provisioner": "addons-ondemand"}
If you check our pods again, you will see that its scheduled to a node.
kubectl get pods -o=custom-columns="NAME:.metadata.name,STATUS:.status.conditions[*].reason,MESSAGE:.status.conditions[*].message,NODE:.spec.nodeName"
NAME STATUS MESSAGE NODE
addon-7fc784b5d-fg2dx <none> <none> ip-10-0-1-114.ca-central-1.compute.internal
general-workloads-5df49fcb-2hhqg Unschedulable 0/2 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {opsguru.io/addons: }, that the pod didn't tolerate. <none>
general-workloads-5df49fcb-4mlqt Unschedulable 0/2 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {opsguru.io/addons: }, that the pod didn't tolerate. <none>
general-workloads-5df49fcb-4zx4v Unschedulable 0/2 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {opsguru.io/addons: }, that the pod didn't tolerate. <none>
...
You will also notice that our general workloads are still Unschedulable
, but that the message says that now 2 nodes don't match, one doesn't match the selector, the other has a taint the workload doesn't tolerate.
Let's see our nodes now.
kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-1-114.ca-central-1.compute.internal Ready <none> 11m v1.21.5-eks-bc4871b
ip-10-0-1-47.ca-central-1.compute.internal Ready <none> 46h v1.21.5-eks-bc4871b
There is our new node! Let's see what Karpenter got us.
kubectl describe node ip-10-0-1-114.ca-central-1.compute.internal
Name: ip-10-0-1-114.ca-central-1.compute.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=m5.large
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=ca-central-1
failure-domain.beta.kubernetes.io/zone=ca-central-1a
karpenter.sh/capacity-type=on-demand
karpenter.sh/provisioner-name=addons-ondemand
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-10-0-1-114.ca-central-1.compute.internal
kubernetes.io/os=linux
managed-by=karpenter
node.kubernetes.io/instance-type=m5.large
purpose=addons
topology.kubernetes.io/region=ca-central-1
topology.kubernetes.io/zone=ca-central-1a
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Fri, 17 Dec 2021 11:49:36 -0700
Taints: opsguru.io/addons:NoSchedule
...
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 2
ephemeral-storage: 20959212Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7934464Ki
pods: 29
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 1930m
ephemeral-storage: 18242267924
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7244288Ki
pods: 29
...
Our addon requires 1 core and 100Mb of memory, it has a nodeSelector pointing to the label purpose
with value addons
and tolerates the opsguru.io/addons
taint.
Our Provisioner addons-ondemand
matches all these conditions, and in its instance type options we have m5.large that can fit our pod (you can see that the node has 1930m allocatable, our pod needs 1000m). Since the request matches a Provisioner's settings, we got a node for the workload.
What about our other pods? Well, let's get their Provisioner up!
kubectl apply -f node_group_general_spot.yaml
provisioner.karpenter.sh/spot-general created
Once we apply the Provisioner you will see in Karpenter's logs:
2021-12-17T21:53:22.009Z INFO controller.provisioning Waiting for unschedulable pods {"commit": "870e2f6", "provisioner": "spot-general"}
2021-12-17T21:53:34.896Z INFO controller.provisioning Batched 20 pods in 1.410203663s {"commit": "870e2f6", "provisioner": "spot-general"}
2021-12-17T21:53:34.906Z INFO controller.provisioning Computed packing of 3 node(s) for 20 pod(s) with instance type option(s) [c5.2xlarge] {"commit": "870e2f6", "provisioner": "spot-general"}
2021-12-17T21:53:38.533Z INFO controller.provisioning Launched instance: i-082db60871ae40c9d, hostname: ip-10-0-1-162.ca-central-1.compute.internal, type: c5.2xlarge, zone: ca-central-1a, capacityType: spot {"commit": "870e2f6", "provisioner": "spot-general"}
2021-12-17T21:53:38.533Z INFO controller.provisioning Launched instance: i-03d7f3f1d4bffdea4, hostname: ip-10-0-2-46.ca-central-1.compute.internal, type: c5.2xlarge, zone: ca-central-1b, capacityType: spot {"commit": "870e2f6", "provisioner": "spot-general"}
2021-12-17T21:53:38.533Z INFO controller.provisioning Launched instance: i-09dc16d84a292604c, hostname: ip-10-0-2-169.ca-central-1.compute.internal, type: c5.2xlarge, zone: ca-central-1b, capacityType: spot {"commit": "870e2f6", "provisioner": "spot-general"}
2021-12-17T21:53:38.591Z INFO controller.provisioning Bound 7 pod(s) to node ip-10-0-1-162.ca-central-1.compute.internal {"commit": "870e2f6", "provisioner": "spot-general"}
2021-12-17T21:53:38.666Z INFO controller.provisioning Bound 7 pod(s) to node ip-10-0-2-46.ca-central-1.compute.internal {"commit": "870e2f6", "provisioner": "spot-general"}
2021-12-17T21:53:38.830Z INFO controller.provisioning Bound 6 pod(s) to node ip-10-0-2-169.ca-central-1.compute.internal {"commit": "870e2f6", "provisioner": "spot-general"}
Our 20 pods were split in 3 nodes, we can confirm that they are all scheduled by retrying our previous command to check their status:
kubectl get pods -o=custom-columns="NAME:.metadata.name,STATUS:.status.conditions[*].reason,MESSAGE:.status.conditions[*].message,NODE:.spec.nodeName"
NAME STATUS MESSAGE NODE
addon-7fc784b5d-fg2dx <none> <none> ip-10-0-1-114.ca-central-1.compute.internal
general-workloads-5df49fcb-7f2mf <none> <none> ip-10-0-1-162.ca-central-1.compute.internal
general-workloads-5df49fcb-7rls5 <none> <none> ip-10-0-2-46.ca-central-1.compute.internal
general-workloads-5df49fcb-9qs99 <none> <none> ip-10-0-2-169.ca-central-1.compute.internal
general-workloads-5df49fcb-bqnvc <none> <none> ip-10-0-1-162.ca-central-1.compute.internal
general-workloads-5df49fcb-d775z <none> <none> ip-10-0-2-169.ca-central-1.compute.internal
general-workloads-5df49fcb-g5kdd <none> <none> ip-10-0-1-162.ca-central-1.compute.internal
general-workloads-5df49fcb-gxkn9 <none> <none> ip-10-0-1-162.ca-central-1.compute.internal
general-workloads-5df49fcb-jhq85 <none> <none> ip-10-0-1-162.ca-central-1.compute.internal
general-workloads-5df49fcb-jvnhl <none> <none> ip-10-0-2-46.ca-central-1.compute.internal
general-workloads-5df49fcb-nfhq5 <none> <none> ip-10-0-2-169.ca-central-1.compute.internal
general-workloads-5df49fcb-qpkdb <none> <none> ip-10-0-1-162.ca-central-1.compute.internal
general-workloads-5df49fcb-scmdp <none> <none> ip-10-0-2-169.ca-central-1.compute.internal
general-workloads-5df49fcb-tgtct <none> <none> ip-10-0-1-162.ca-central-1.compute.internal
general-workloads-5df49fcb-ts4pt <none> <none> ip-10-0-2-46.ca-central-1.compute.internal
general-workloads-5df49fcb-v6cql <none> <none> ip-10-0-2-46.ca-central-1.compute.internal
general-workloads-5df49fcb-wqhtl <none> <none> ip-10-0-2-169.ca-central-1.compute.internal
general-workloads-5df49fcb-xpw52 <none> <none> ip-10-0-2-46.ca-central-1.compute.internal
general-workloads-5df49fcb-xzgkq <none> <none> ip-10-0-2-169.ca-central-1.compute.internal
general-workloads-5df49fcb-z47dd <none> <none> ip-10-0-2-46.ca-central-1.compute.internal
general-workloads-5df49fcb-zpd6s <none> <none> ip-10-0-2-46.ca-central-1.compute.internal
We should now have 5 nodes, 1 original node from Terraform, 1 from our addons-ondemand provisioner, 3 from the spot-general provisioner.
kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-1-114.ca-central-1.compute.internal Ready <none> 3h7m v1.21.5-eks-bc4871b
ip-10-0-1-162.ca-central-1.compute.internal Ready <none> 3m57s v1.21.5-eks-bc4871b
ip-10-0-1-47.ca-central-1.compute.internal Ready <none> 2d1h v1.21.5-eks-bc4871b
ip-10-0-2-169.ca-central-1.compute.internal Ready <none> 3m57s v1.21.5-eks-bc4871b
ip-10-0-2-46.ca-central-1.compute.internal Ready <none> 3m57s v1.21.5-eks-bc4871b
Let's dig a bit into our new nodes now, which instance types we have now?
kubectl get nodes -l karpenter.sh/provisioner-name=spot-general -o jsonpath='{.items[*].metadata.labels.node\.kubernetes\.io\/instance-type}'
c5.2xlarge c5.2xlarge c5.2xlarge
Our general-workloads
deployment pods only differ from the addon
deployment for their nodeSelector and the lack of toleration for the opsguru.io/addons
taint. Their nodeSelector label is set to managed-by: karpenter
which also matches the addons-ondemand
provisioner, but without the taint they can only match with the new Provisioner.
With the Provisioner matched, Karpenter now needs to decide which instance type to use between c5.large and c5.2xlarge. A c5.large has 2vCPUs and 4Gb of memory, so it should only be able to take one of our pods (2vCPUs should have ~1900m allocatable, we need 1000m per pod), this would require us to have one instance per pod, quite a lot of resource waste in there (almost half the instance would sit unused).
Now a c5.2xlarge has 8vCPUs and 16Gb of memory, which should fit 7 of our pods in each instance (8vCPUs should have ~7900m allocatable). This matches what we're seeing, 3 nodes, 7 pods in 1 instance, 7 pods in another instance, 6 pods in the last instance, 20 pods scheduled in the best way allowed by our provisioner.
Thanks for coming to our TED TALK! errr, quick review of Karpenter.
Now let's cleanup and see one more feature of Karpenter.
In both our Provisioners we have a setting ttlSecondsAfterEmpty: 30
, which means that if a node has no pods (other than DaemonSet pods) for more than 30 seconds, it will be terminated.
We won't take their word for it, let's check it!
Let's delete our deployments:
kubectl delete deployment general-workloads addon
deployment.apps "general-workloads" deleted
deployment.apps "addon" deleted
In Karpenter's logs we can see the nodes getting a ttl and then being cordoned, drained and terminated.
2021-12-17T22:29:23.877Z INFO controller.node Added TTL to empty node {"commit": "870e2f6", "node": "ip-10-0-1-162.ca-central-1.compute.internal"}
2021-12-17T22:29:23.932Z INFO controller.node Added TTL to empty node {"commit": "870e2f6", "node": "ip-10-0-2-46.ca-central-1.compute.internal"}
2021-12-17T22:29:24.031Z INFO controller.node Added TTL to empty node {"commit": "870e2f6", "node": "ip-10-0-2-169.ca-central-1.compute.internal"}
2021-12-17T22:29:24.239Z INFO controller.node Added TTL to empty node {"commit": "870e2f6", "node": "ip-10-0-1-114.ca-central-1.compute.internal"}
2021-12-17T22:29:53.889Z INFO controller.node Triggering termination after 30s for empty node {"commit": "870e2f6", "node": "ip-10-0-1-162.ca-central-1.compute.internal"}
2021-12-17T22:29:53.915Z INFO controller.termination Cordoned node {"commit": "870e2f6", "node": "ip-10-0-1-162.ca-central-1.compute.internal"}
2021-12-17T22:29:53.948Z INFO controller.node Triggering termination after 30s for empty node {"commit": "870e2f6", "node": "ip-10-0-2-46.ca-central-1.compute.internal"}
2021-12-17T22:29:53.970Z INFO controller.termination Cordoned node {"commit": "870e2f6", "node": "ip-10-0-2-46.ca-central-1.compute.internal"}
2021-12-17T22:29:54.042Z INFO controller.node Triggering termination after 30s for empty node {"commit": "870e2f6", "node": "ip-10-0-2-169.ca-central-1.compute.internal"}
2021-12-17T22:29:54.068Z INFO controller.termination Cordoned node {"commit": "870e2f6", "node": "ip-10-0-2-169.ca-central-1.compute.internal"}
2021-12-17T22:29:54.070Z INFO controller.termination Deleted node {"commit": "870e2f6", "node": "ip-10-0-1-162.ca-central-1.compute.internal"}
2021-12-17T22:29:54.147Z INFO controller.termination Deleted node {"commit": "870e2f6", "node": "ip-10-0-2-46.ca-central-1.compute.internal"}
2021-12-17T22:29:54.247Z INFO controller.termination Deleted node {"commit": "870e2f6", "node": "ip-10-0-2-169.ca-central-1.compute.internal"}
2021-12-17T22:29:54.261Z INFO controller.node Triggering termination after 30s for empty node {"commit": "870e2f6", "node": "ip-10-0-1-114.ca-central-1.compute.internal"}
2021-12-17T22:29:54.290Z INFO controller.termination Cordoned node {"commit": "870e2f6", "node": "ip-10-0-1-114.ca-central-1.compute.internal"}
2021-12-17T22:29:54.425Z INFO controller.termination Deleted node {"commit": "870e2f6", "node": "ip-10-0-1-114.ca-central-1.compute.internal"}
Without workloads and nodes, we are left with our initial cluster, which Terraform will gladly destroy.
terraform destroy -var cluster_name=${CLUSTER_NAME} -var region=${AWS_DEFAULT_REGION}
This same demo in Cluster Autoscaler should be marginally slower (a couple minutes difference, which depending on your workloads might be crucial, or not), but at a larger scale (think several services with hundreds of pods each) this speed difference by itself is a major advantage.
Depending on how you manage tenancy in your clusters, you could even have the Provisioner deployed as part of your application through a helm chart, or just have an easier time managing node groups in general.
Karpenter still doesn't have a mechanism for removing underutilized nodes if their workloads can fit elsewhere, which is a feature present in the Cluster Autoscaler. This could possibly be handled by Descheduler but that can be a whole other blog post :)
Cluster Autoscaler has been around for a good while, and is beyond battle tested, while Karpenter is relatively new and might be rough around the edges.
Karpenter only works on AWS right now, it can be expanded for other cloud providers though.
Karpenter is extremely promising, and its pros will outweight the cons in most cases. It is not an all or nothing solution either, you can have it running in parallel to Cluster Autoscaler and have the best of both worlds.
There is a lot we didn't cover in here about Karpenter, take a look at our Related Links section at the bottom for some documentation and videos on it.
We are looking forward to how this tool develops going forward!
Fernando Battistella, Principal Architect at OpsGuru - Fernando has over two decades of experience in IT, with the last six years architecting cloud-native solutions for companies of all sizes. Specialized in Kubernetes and the Cloud Native ecosystem, he has helped multiple organizations design, build, migrate, operate and train their teams in cloud-native technologies and platforms.