Job description:
5+ yrs experience as an sre (again, if we get one lead with 5+ yrs experience, the other person can be 3+ yrs experience but must be a sre)
will be responsible for developing and maintaining automation tools and processes to streamline infrastructure management, reduce manual tasks, and improve efficiency.
primary focus will be capacity management and sdlc
capacity planning - demand forecasting, capacity planning across all application infra, continuous resource optimization.
infra mgmt.
– maintain prod env, bcp mgmt., define acceptable downtime or failures and ensure ha/resiliency .
create and maintain sre infra – tools, scripts, integration with core eng platforms (prometheus/grafana etc.)
observability and reporting - define metrics and thresholds and build framework to capture these metrics/trends and generate alerts.
alert monitoring and governance – monitoring the alerts generated by observability framework and planning for addressing them as per pre-agreed schedule.
incident management – bridge between support and eng team, improve incident response and manage follow-ups and actions working with both teams
performance optimization – work with engg team to continuously provide feedback and optimize system performance.
slo and sli adherence as per the application team's guidelines.