Daily DevOps Interview Questions Day #53

Daily Interview Questions for SRE and DevOps engineers

Jun 28, 2025

Question for the day:

So currently our company runs a microservices platform with many development teams pushing code multiple times daily. We have Gitlab CI/CD runners in Kubernetes for different kind of builds and deploys. We suddenly ran into a few scenarios

Gitlab worker pods getting evicted during large CI builds
CI jobs failing due to resource contention with production services
30-minute CI builds taking 2+ hours during peak development hours
Gitlab Runners failing halfway through due to OOM

What would be some of the fixes you would suggest to help alleviate the issues

Answer:

A junior/mid level engineer response:

I would expect a junior or mid level engineer to cover certain aspects

Increase cluster size by adding more nodes
Set memory limits on runner pods to prevent OOM
Maybe increase CPU requests
Restart pods when they fail

This is a fine answer but this answer addresses alot of the issues and fixes short term problems.

A senior level engineer response:

I would expect a senior level engineer to immediately recognize that this is a classic resource contention and workload isolation problem

Immediate: Implement proper resource requests/limits as a stopgap
Design dedicated node pools with workload-specific characteristics
Leverage spot instances for ephemeral CI workloads

This expects an engineer with in depth knowledge of how to use taints and tolerations and usage of nodegroups and nodepools I would also expect a senior level engineer to have

apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    workload-type: gitlab-ci
  tolerations:
  - key: ci-workload
    value: "true"
    effect: NoSchedule

Migration strategy without disrupting existing pipelines
Capacity planning based on development team growth

How would you approach this scenario? What other Kubernetes challenges have you faced in production? Share your thoughts in the comments

DevOps Daily

Discussion about this post