Will appreciate any solutions or recommendations for this issue.
Context
I am currently using Terraform to set up my EKS cluster . We were noticing issues at the place I work when pods in the EKS managed node groups were over utilising resources and stalling kubelet and other system critical resources. The solution was to add restrictions on max_pods , kube-reserved and system-reserved params on the EKS node so that these reservations were respected before allocating pods .
Path to the Solution
The way to do this is to run the bootstrap.sh script on the EKS node at startup to override the defaults
/etc/eks/bootstrap.sh ${cluster_name}
--kubelet-extra-args "--max-pods=40 --system-reserved cpu=500m,memory=500Mi,ephemeral-storage=1Gi --kube-reserved cpu=1000m,memory=1024Mi,ephemeral-storage=3Gi"
As I am doing this on terraform I need to find a way to inject this shell script into the EKS launch template which will thereby pass it through to the ec2-user-data which will execute at the EKS node startup . I have also tested this bash script by manually running it on the EKS node and restarting with systemctl restart kubelet
it worked without a problem the nodes reacted as expected by limiting pods and allocatable resources .
Problem
My problem is that my current terraform script along with template file are not making it to the launch template user-data-script and hence the ec2 user script therefore not applying it to the eke-managed-nodes at launch .
A few ways I tried and confirmed this .
- verifying the launch template created by eks for the node-groups showed no indication of my script
- also by checking eks node cloud_init logs like this
vi /var/log/cloud-init.log
(and)vi /var/log/cloud-init-output.log
) and looking for my commands they were no where to be seen - ps aux | grep kubelet
vi /etc/kubernetes/kubelet/kubelet-config.json
k get nodes node_name -o json | grep -i allocatable -A 15
, clearly pointed to the fact that my reservations dint take effect
"allocatable": {
"cpu": "7910m", # Should have been 6500,because cpu reservation was 500+1000
"ephemeral-storage": "95551679124",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "31295956Ki",
"pods": "58"
},
"capacity": {
"cpu": "8",
"ephemeral-storage": "104845292Ki",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "32312788Ki",
"pods": "58"
},
Code
Terraform to set up EKS
Only keeping the bits I think can help in debugging.
module "eks" {
................
................
................
................
................
# EKS Managed Node Group(s)
eks_managed_node_group_defaults = {
instance_types = ["m6i.large", "m5.large", "m5n.large", "m5zn.large"]
disk_size = 100
}
eks_managed_node_groups = {
ds-eks-ng1 = {
min_size = 2
max_size = 12
desired_size = 2
instance_types = ["m6i.2xlarge", "m7i.2xlarge", "m6a.2xlarge", "c6i.2xlarge"]
capacity_type = "SPOT"
use_custom_launch_template = false
create_launch_template = false
disk_size = 100
labels = {
"node-managed-by" = "eks-ng1"
}
iam_role_additional_policies = {
AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}
ami_id = "ami-xxxxxxxxxx" # hidden
enable_bootstrap_user_data = true
pre_bootstrap_user_data = templatefile("templates/bootstrap.sh.tpl", {
cluster_name = local.cluster_name
})
post_bootstrap_user_data = templatefile("templates/bootstrap.sh.tpl", {
cluster_name = local.cluster_name
})
}
}
tags = {
"karpenter.sh/discovery" = local.cluster_name
}
}
Bootstrap script
#!/bin/bash
set -xe
/etc/eks/bootstrap.sh ${cluster_name}
--kubelet-extra-args "--max-pods=40 --system-reserved cpu=500m,memory=500Mi,ephemeral-storage=1Gi --kube-reserved cpu=1000m,memory=1024Mi,ephemeral-storage=3Gi"
My attempt was to try and inject my bootstrap script into the launch template user data , but it was not working with my approach .
References used to put together a solution – No luck yet .
https://github.com/terraform-aws-modules/terraform-aws-eks/issues/1770#issuecomment-1227047342
https://github.com/terraform-aws-modules/terraform-aws-eks/blob/16f46db94b7158fd762d9133119206aaa7cf6d63/modules/eks-managed-node-group/variables.tf#L53
https://github.com/terraform-aws-modules/terraform-aws-eks/issues/2059