Lee Bernick

Software Engineer, NYC

Writing a Kubernetes Scheduler Plugin

Posted at — Sep 15, 2023

This post isn’t a tutorial on how to write a Kubernetes scheduler; it’s a reminder to myself that software isn’t magic and sometimes projects that sound really challenging are achievable with some persistence. I’m proud of this project, even though I eventually chose a different approach to the problem I was trying to address, and I hope that it’ll serve as a useful experience report for anyone thinking of building a scheduler plugin (or for the sig-scheduling maintainers). This post assumes some familiarity with Kubernetes pods, nodes, and the concept of scheduling.

Background

I previously worked on the Tekton Pipelines project, a CI/CD platform built on Kubernetes. Tekton users build CI/CD Pipelines composed of a graph of Tasks, and run these Pipelines via PipelineRuns. Under the hood, Tekton orchestrates PipelineRun execution by creating Kubernetes pods when each TaskRun in the PipelineRun is ready to execute. TaskRun outputs are written to Kubernetes PersistentVolumes for storage, and fed as inputs into subsequent TaskRuns.

Unfortunately, Tekton’s artifact storage was a leaky abstraction. Tekton delegated scheduling of TaskRun pods to the Kubernetes scheduler, meaning they could be assigned to any node, even if the TaskRuns needed to share artifacts via PersistentVolumeClaims. PersistentVolumes aren’t typically used for concurrent read/write usage across multiple nodes. If TaskRuns that shared storage happened to be assigned to different nodes, or a TaskRun needed to access multiple PersistentVolumes that happened to be on separate nodes, users would experience frustrating issues such as unschedulable pods or TaskRuns running sequentially when they could have run in parallel. Our design proposal has additional technical details on this problem.

While the Kubernetes API provides several ways of controlling scheduling, there’s no supported way as of writing to forcibly run a group of pods on the same node. There’s an existing “coscheduling” scheduler plugin that aims to meet this need, so I decided to experiment with a similar plugin to see if it could be adapted to my use case.

How my plugin works

In a nutshell, the Kubernetes scheduler works as follows:

Kubernetes has a scheduling framework that allows plugins to register at multiple points during the scheduling process. My plugin registered two extension points:

  1. I first used a “PreFilter” extension point, which allows a scheduler to pre-process pod info at the beginning of a scheduling cycle and return a set of “candidate” nodes for filtering. When a new pod was ready for scheduling, my PreFilter plugin determined whether any node was already running pods associated with the same Tekton PipelineRun, and if so, returned that node as the only valid candidate.

  2. Next, my scheduler used a “Filter” extension point during the scheduling cycle to determine which of the candidate nodes were suitable for running the pod. My Filter plugin optionally filtered out any nodes already running pods for other Tekton PipelineRuns, depending on its configuration.

This worked and was surprisingly simple, with less than 200 lines of code. The Kubernetes “scheduler-plugins” repo has a number of helpful examples that were very useful.

Deploying the plugin on GKE

On local clusters, scheduler plugins can replace the default scheduler, or run in parallel with it as a second scheduler. However, I wanted to deploy my plugin on GKE, which doesn’t support replacing the default scheduler, so I had to use multiple schedulers. In addition to the official documentation on running multiple schedulers, the following blog posts provided some starting examples of scheduler plugin configuration:

I created a deployment to run my second scheduler, and mounted its configuration via a configmap. Scheduler plugin configuration is poorly documented, so while the blog posts I found served as helpful starting examples, I wasn’t able to use them exactly as written. I found the following debugging strategies useful in creating a working deployment:

Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  52m   default-scheduler  Successfully assigned default/catalog-publish-trigger-tekton-upstream-28250970-tghlt to gke-dogfooding-default-pool-f62aa79c-94oa
I0302 19:24:21.146841       1 configfile.go:105] "Using component config" config=<
        apiVersion: kubescheduler.config.k8s.io/v1
        ... # Truncated
        profiles:
        - pluginConfig:
          - args:
              apiVersion: kubescheduler.config.k8s.io/v1
              kind: DefaultPreemptionArgs
              minCandidateNodesAbsolute: 100
              minCandidateNodesPercentage: 10
            name: DefaultPreemption
          ... # Truncated
          plugins:
            multiPoint:
              enabled:
              - name: PrioritySort
                weight: 0
                ... # Truncated
              - name: DefaultBinder
                weight: 0
              - name: OneNodePerPipelineRun
                weight: 0
            ... # Truncated
          schedulerName: one-node-per-pipelineRun

The scheduler’s logs tended to lead me down the wrong path at least as often as they led me down the right one, so they were of limited use as a debugging tool.

The difficulty of debugging, and the poor tools and documentation available for doing so, were one of the major factors that led me to decide not to use this plugin.

I didn’t end up using it

This scheduler worked well as a prototype, but I eventually decided that the scheduler framework didn’t feel production-ready enough to use in, well… production, for several reasons:

In addition, a scheduler plugin would likely have been hard for Kubernetes cluster operators to install, and might need to be customized for different Kubernetes implementations.

What I did instead

Instead of using a scheduler plugin, we chose to build coscheduling logic directly into Tekton. When a PipelineRun is created, Tekton creates a balloon pod first: a placeholder, no-op pod intended to “anchor” the PipelineRun to the node. Tekton then adds inter-pod affinity to the other PipelineRun pods to ensure they are scheduled on the same node. The balloon pod is created via a statefulset, which is also responsible for managing the storage used by the PipelineRun. If you’re interested in the full design proposal, it’s available here.