Writing a Kubernetes Scheduler Plugin

Posted at — Sep 15, 2023

This post isn’t a tutorial on how to write a Kubernetes scheduler; it’s a reminder to myself that software isn’t magic and sometimes projects that sound really challenging are achievable with some persistence. I’m proud of this project, even though I eventually chose a different approach to the problem I was trying to address, and I hope that it’ll serve as a useful experience report for anyone thinking of building a scheduler plugin (or for the sig-scheduling maintainers). This post assumes some familiarity with Kubernetes pods, nodes, and the concept of scheduling.

Background

I previously worked on the Tekton Pipelines project, a CI/CD platform built on Kubernetes. Tekton users build CI/CD Pipelines composed of a graph of Tasks, and run these Pipelines via PipelineRuns. Under the hood, Tekton orchestrates PipelineRun execution by creating Kubernetes pods when each TaskRun in the PipelineRun is ready to execute. TaskRun outputs are written to Kubernetes PersistentVolumes for storage, and fed as inputs into subsequent TaskRuns.

Unfortunately, Tekton’s artifact storage was a leaky abstraction. Tekton delegated scheduling of TaskRun pods to the Kubernetes scheduler, meaning they could be assigned to any node, even if the TaskRuns needed to share artifacts via PersistentVolumeClaims. PersistentVolumes aren’t typically used for concurrent read/write usage across multiple nodes. If TaskRuns that shared storage happened to be assigned to different nodes, or a TaskRun needed to access multiple PersistentVolumes that happened to be on separate nodes, users would experience frustrating issues such as unschedulable pods or TaskRuns running sequentially when they could have run in parallel. Our design proposal has additional technical details on this problem.

While the Kubernetes API provides several ways of controlling scheduling, there’s no supported way as of writing to forcibly run a group of pods on the same node. There’s an existing “coscheduling” scheduler plugin that aims to meet this need, so I decided to experiment with a similar plugin to see if it could be adapted to my use case.

How my plugin works

In a nutshell, the Kubernetes scheduler works as follows:

Pods ready for scheduling are added to a queue.
During a “scheduling cycle”, the scheduler pops a pod off the queue and selects a node for it.
During a “binding cycle”, the pod is assigned to that node.

Kubernetes has a scheduling framework that allows plugins to register at multiple points during the scheduling process. My plugin registered two extension points:

I first used a “PreFilter” extension point, which allows a scheduler to pre-process pod info at the beginning of a scheduling cycle and return a set of “candidate” nodes for filtering. When a new pod was ready for scheduling, my PreFilter plugin determined whether any node was already running pods associated with the same Tekton PipelineRun, and if so, returned that node as the only valid candidate.
Next, my scheduler used a “Filter” extension point during the scheduling cycle to determine which of the candidate nodes were suitable for running the pod. My Filter plugin optionally filtered out any nodes already running pods for other Tekton PipelineRuns, depending on its configuration.

This worked and was surprisingly simple, with less than 200 lines of code. The Kubernetes “scheduler-plugins” repo has a number of helpful examples that were very useful.

Deploying the plugin on GKE

On local clusters, scheduler plugins can replace the default scheduler, or run in parallel with it as a second scheduler. However, I wanted to deploy my plugin on GKE, which doesn’t support replacing the default scheduler, so I had to use multiple schedulers. In addition to the official documentation on running multiple schedulers, the following blog posts provided some starting examples of scheduler plugin configuration:

I created a deployment to run my second scheduler, and mounted its configuration via a configmap. Scheduler plugin configuration is poorly documented, so while the blog posts I found served as helpful starting examples, I wasn’t able to use them exactly as written. I found the following debugging strategies useful in creating a working deployment:

Inspecting events: Events are useful for ensuring that a given pod is handled by the correct scheduler. You can use kubectl get events --field-selector involvedObject.name=$POD_NAME or kubectl describe pod $POD_NAME to get events associated with a specific pod. For example, the following events are associated with a pod handled by the GKE default scheduler:

Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  52m   default-scheduler  Successfully assigned default/catalog-publish-trigger-tekton-upstream-28250970-tghlt to gke-dogfooding-default-pool-f62aa79c-94oa

Ensuring the correct plugins are installed: On startup, the scheduler logs the name of the profile it’s using, and the installed plugins. In the following example, the logs indicate that the scheduler name is “one-node-per-pipelineRun”, and in addition to all the plugins that are installed by default, there is a plugin installed called “OneNodePerPipelineRun”. (In this context, “multi-point” just means that the plugin may use multiple extension points in the scheduler framework.)

I0302 19:24:21.146841       1 configfile.go:105] "Using component config" config=<
        apiVersion: kubescheduler.config.k8s.io/v1
        ... # Truncated
        profiles:
        - pluginConfig:
          - args:
              apiVersion: kubescheduler.config.k8s.io/v1
              kind: DefaultPreemptionArgs
              minCandidateNodesAbsolute: 100
              minCandidateNodesPercentage: 10
            name: DefaultPreemption
          ... # Truncated
          plugins:
            multiPoint:
              enabled:
              - name: PrioritySort
                weight: 0
                ... # Truncated
              - name: DefaultBinder
                weight: 0
              - name: OneNodePerPipelineRun
                weight: 0
            ... # Truncated
          schedulerName: one-node-per-pipelineRun

The scheduler’s logs tended to lead me down the wrong path at least as often as they led me down the right one, so they were of limited use as a debugging tool.

The difficulty of debugging, and the poor tools and documentation available for doing so, were one of the major factors that led me to decide not to use this plugin.

I didn’t end up using it

This scheduler worked well as a prototype, but I eventually decided that the scheduler framework didn’t feel production-ready enough to use in, well… production, for several reasons:

As mentioned previously, it was very difficult to debug, with few helpful tools and poor documentation.
I found few, if any, examples of scheduler plugins used in production systems. None of the plugins in the Kubernetes “scheduler-plugins” repo were described as “stable” or used for “production workloads”, and the blog posts I found about it were mainly tutorials.
The scheduler framework package is part of the k8s.io/kubernetes package, which maintainers explicitly discourage using as a dependency.
Because my plugin was built with a specific version of Kubernetes, I wasn’t sure whether my plugin would work with other versions as well, or in clusters where nodes happened to be running different Kubernetes versions. Anecdotally, matching the Kubernetes version used in my plugin with the Kubernetes version running on my cluster resolved some hard-to-debug issues that crashed my scheduler.
Running multiple schedulers meant there were two separate scheduling queues, making it hard to predict which pods would be assigned to a node. For example, I tried using this plugin in an environment that used preemptible pods to trigger the cluster autoscaler. Since the preemptible pods used the default scheduler and the PipelineRun pods used my custom scheduler, the preemptible pods kept getting scheduled again on the newly created nodes, instead of the actual workloads that were supposed to preempt them!

In addition, a scheduler plugin would likely have been hard for Kubernetes cluster operators to install, and might need to be customized for different Kubernetes implementations.

What I did instead

Instead of using a scheduler plugin, we chose to build coscheduling logic directly into Tekton. When a PipelineRun is created, Tekton creates a balloon pod first: a placeholder, no-op pod intended to “anchor” the PipelineRun to the node. Tekton then adds inter-pod affinity to the other PipelineRun pods to ensure they are scheduled on the same node. The balloon pod is created via a statefulset, which is also responsible for managing the storage used by the PipelineRun. If you’re interested in the full design proposal, it’s available here.

Lee Bernick

Software Engineer, NYC