kube-scheduler
Identifies the right node to place a container on based on a container’s resource requirements, worker nodes capacity or any other policies or constraints (taints, tolerations, node affinity rules)
Scheduler Phases
Scheduling Queue (queueSort extension point) - Pods are sorted by priorityClassName in their pod spec, which connects to a K8s resource “kind: PriorityClass” resource. Higher values go first.
“PrioritySort” plugin
Filtering (preFilter, filter, postFilter) : Nodes (not pods) that cannot run the pod will be filtered out.
“NodeResourcesFit” plugin. “NodeName” plugin (check to see if name is mentioned in pod spec). NodeUnschedulable - Filters out nodes with unschedulable=true set.
Scoring (preScore, score, reserve) : Nodes are scored with different weights. Based on free space/resources after absorbing CPU . Higher scores win “NodeResourcesFit” (Same plugin). “ImageLocality” Associates high scores where the pod image is already present. Helps pods are placed on a node that already has the image, but if no nodes were available, Binding Phase (permit, preBind, bind, postBind) : Pod is bound to the node with the higher school. DefaultBinder plugin.
We can customize scheduler plugins with extension points.
A scheduler can have multiple scheduling profiles.
The kube-scheduler talks directly to the API Server. It does not talk to the kubelets
You can write your own schedule, even in bash
Important Notes
- Continuously monitors the api server , realizes new pod with no node assigned. Scheduler identifies the right pod , and communicates that back to the apiserver. Api server updates the info in etcd cluster. API server passes that info to the kubelet in the right node
- Only responsible for which pod goes on which node. It doesn’t actually place the pod on the node. That’s the job of the kubelet. Kubelet creates the pod on the node.
- Which node a pod is placed on depending on criteria
- Node might be dedicated to certain applications
To dedicate a node to specific pods you must combine taints/tolerations and node affinity. You need to scare away other pods (taints) and ensure our pods are only attracted to our color (affinity)
Taints and Tolerations
Taints only restrict nodes from accepting certain pods. It does not tell the pod to go only to a specific node.
A taint does not guarantee that a tolerated pod won’t be scheduled on other nodes.
A taint is set on the master node that prevents pods from being scheduled on that node.
kubectl describe node kubemaster | grep Taint
Taint - Bugspray that keeps bugs from biting a human. The bug is intolerant to the smell.
Bugs toleration level to that particular taint. Other bugs might be more tolerant.
People are nodes< bugs are pods. Taints & Tolerations are used to set restrictions on what pods can be scheduled on a node.
K8s tries to put the pods on the available nodes. Scheduler (with no restrictions) will equally balance the nodes. Now, let’s assume node01 has dedicated resources. Taint the node (call it blue) . By default, pods have no tolerations (therefore any taint causes all pods to skip that node) (so basically, this node is poison, but only certain pods can tolerate the poison)
kubectl taint nodes node-name key=value:taint-effect
Taint Effects: NoSchedule, PreferNoSchedule (try avoiding, no guarantees), NoExecute (New pods will not be scheduled, existing pods will be evicted)
kubectl taint nodes node1 app-blue:NoSchedule
Tolerations are added to pods.
spec:
tolerations:
- key: “app”
operator: “Equal”
value: “blue”
effect: “NoSchedule”
When the pods are created, they are either not scheduled on nodes, or evicted from different nodes. <— Does a toleration immediately evict from other nodes?
There are two special cases: An empty key with operator Exists matches all keys, values and effects which means this will tolerate everything. An empty effect matches all effects with key key1. Certain taints are built in by the system: node.kubernetes.io/ not-ready : Node is not ready (NodeCondition != Ready unreachable, memory-pressure, disk-pressure, pid-pressure, network-unavailable, unschedulable
node.cloudprovider.bunernetes.io/uninitialized: When the kubelet is started with "external" cloud provider, this taint is set on a node to mark it as unusable. After a controller from the cloud-controller-manager initializes this node, the kubelet removes this taint.
DaemonSet controller automatically adds tolerations to daemons to prevent them from breaking.
Node Selector
In a default setup, any pod can go to any node. PodC can go to node 2 or 3 ,which might not be desired. We can set limitations on pods so they only run on particular nodes. The first is to use a NodeSelector (simple and easy)
spec:
nodeSelector:
size: Large
^ Checks for labels attached to nodes.
to Label a node:
kubectl label nodes <node-name> key=value
For more complicated things like “Medium or Blue but not Small” use Node Affinity
Node Afininty
You cannot use advanced Or/Not selections with NodeSelectors. NodeAffinity can help. With great power comes great complexity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: size
operator: In
values:
- Large
Various Operators like “In” , “NotIn”, or “Exists” (a label exists at all) Check docs for specific operators.
If NodeAffinity cannot match a given node,
If a node is relabeled, the type of Node Affinity is looked at: “requiredDuringSchedulingIngoredDuringExecution” “preferredDuringSchedulingIngoredDuringExecution” planned: “requireDuringSchedulingRequireDuringExecution”
Resource Requests
K8s assumes a pod, or a container inside a pod requires .5CPU, 256Mi Memory
These are defined as LimitRanges per namespace:
apiVersion: v1
kind: LimitRange
metadata:
name: mem-limit-range # or cpu-limit-range
spec:
limits:
- default:
memory: 512Mi
defaultRequest:
memory: 256Mi
type: Container
Scheduler will look at resource requests to identify a node that has a sufficient amount of resources available.
CONTAINER Spec:
resources:
requests:
memory: “1Gi”
cpu: 1
limits:
memory: “2Gi”
cpu: 2
Limits and Requests are set PER container in the pod spec!!
0.1 CPU = 100m - Lowest is 1m (m==milli) 1 CPU = 1 vCPU (1 AWS vCpu, 1 GCP Core, 1 Azure Core, or 1 Hyperthread)
Memory:
“G” Gigabyte “Gi” GibiByte “M” / “Mi” “K” / “Ki” (Kibibyte) or 1,024 Bytes
By default, containers are limited to only 1 VCPU and 512 MB If you don’t lke the default, you can change the limits
Kubernetes will throttle the CPU so a container does not go over its limits. Kubernetes allows containers to use MORE memory than its limit. If this happens constantly the pod will be terminated.
Multiple Schedulers
apiVersion: kubescheduler.config.k8s.io/v1 kind: KubeSchedulerConfiguration Kubernetes is highly extensible. You can write your own scheduler program and deploy it has the default scheduler, or as an additional scheduler. Each scheduler must have its own name. The default is “default-scheduler”
scheduler-config.yaml
apiVersion: kubescheduler.config.k8s.io/v1 kind: KubeSchedulerConfiguration profiles:
- schedulerName: my-scheduler
You can re-use the default cube-scheduler with a new config file, or build a new one.
—config command line
To deploy a scheduler as a pod, create a pod definition file. During scheduler configuration, you need to specify “leaderElection” for when you have multiple copies of the scheduler running on different controller/master nodes. If multiple copies are running on different nodes, only one can be active at a time.
To use a new scheduler, simply add schedulerName: to the pod spec