Daily DevOps Interview Questions Day #53
Daily Interview Questions for SRE and DevOps engineers
Question for the day:
So currently our company runs a microservices platform with many development teams pushing code multiple times daily. We have Gitlab CI/CD runners in Kubernetes for different kind of builds and deploys. We suddenly ran into a few scenarios
Gitlab worker pods getting evicted during large CI builds
CI jobs failing due to resource contention with production services
30-minute CI builds taking 2+ hours during peak development hours
Gitlab Runners failing halfway through due to OOM
What would be some of the fixes you would suggest to help alleviate the issues
Answer:
A junior/mid level engineer response:
I would expect a junior or mid level engineer to cover certain aspects
Increase cluster size by adding more nodes
Set memory limits on runner pods to prevent OOM
Maybe increase CPU requests
Restart pods when they fail
This is a fine answer but this answer addresses alot of the issues and fixes short term problems.
A senior level engineer response:
I would expect a senior level engineer to immediately recognize that this is a classic resource contention and workload isolation problem
Immediate: Implement proper resource requests/limits as a stopgap
Design dedicated node pools with workload-specific characteristics
Leverage spot instances for ephemeral CI workloads
This expects an engineer with in depth knowledge of how to use taints and tolerations and usage of nodegroups and nodepools I would also expect a senior level engineer to have
apiVersion: v1
kind: Pod
spec:
nodeSelector:
workload-type: gitlab-ci
tolerations:
- key: ci-workload
value: "true"
effect: NoSchedule
Migration strategy without disrupting existing pipelines
Capacity planning based on development team growth
How would you approach this scenario? What other Kubernetes challenges have you faced in production? Share your thoughts in the comments