Project outline:
we are looking for a site reliability engineer with experience in incident response. In this role, you will help shipt understand where we can improve stability and reliability. There will be a focus on the intersection of systems engineering and data science, building the tooling and culture necessary to transform raw incident logs into actionable reliability strategies.
skill requirements:
- engineering background: 4+ years in sre, devops, or systems engineering roles managing production environments at scale.
- data proficiency: strong experience with sql and data analysis
- coding skills: expertise in one or more programming languages such as golang, java, python, or c++.
- observability expertise: deep understanding of alerting systems, distributed tracing, structured logging, and metrics collection.
- systems design: experience with container orchestration (kubernetes) and cloud infrastructure (gcp).
experience requirements:
- statistical mindset: experience applying statistical methods (e.g., outlier detection, regression analysis) to system performance data.
- the "human factor": a passion for resilience engineering and understanding how human decision-making impacts system reliability.
- communication: ability to translate complex technical failures into clear, non-technical business impact reports