Sr. Site reliability engineer
about the role
you will be the team’s go-to person for infrastructure, monitoring, and production health. You’ll manage kubernetes-based systems, build and improve observability tooling, and use data to surface problems before they become incidents. When code changes are needed to make systems more observable, you’ll make them yourself.
What you'll do
own and improve our monitoring, alerting, and observability systems
build dashboards and metrics that give the team real insight into production health
manage kubernetes infrastructure — resource allocation, diagnostics, and keeping things running well
query data with sql to understand system behavior, spot trends, and investigate anomalies
design alerting that is actionable and sustainable — no fatigue, no noise
use ai to accelerate incident response and root cause analysis, and find ways to improve observability workflows for the whole team
instrument the application codebase to improve observability — structured logging, metrics, tracing, error reporting
fix bugs and contribute to the broader codebase as needed
maintain and improve our infrastructure-as-code and deployment processes
identify and document improvements that require deeper engineering work
qualifications & technical skills:
strong sql. You can write complex queries and use data to tell a stor