What you'll do:
as a lead support engineer for srs distribution, a wholly owned subsidiary of the home depot, you will be an integral part of our applications production support and sre team as a lead (with devops background). You will be responsible for ensuring the stability, reliability, and performance of business-critical applications and services in production environments. This role will oversee the l1/l2 support teams, manage daily operational activities while driving improvements in system reliability, workflow automation, and the effectiveness of monitoring and incident management processes.
the ideal candidate is hands-on with technical expertise in azure devops, incident and problem management, system reliability, observability & apm tools (azure, new relic), and continuous improvement processes.
key responsibilities
1. Production support & operations
* lead day-to-day application production support for mission-critical systems (l2/l3).
* act as the primary escalation point for critical application issues and outages
* manage and coordinate incident response, root cause analysis (rca), and collaborate with dev team for problem resolution.
* coordinate root cause analysis (rca) and post-incident reviews to drive long-term stability
* ensure slas and uptime targets are consistently met across all supported applications.
* work closely with development teams to deploy hotfixes, enhancements, and configuration changes safely.
2. Site reliability engineering (sre)
* implement and manage sre practices such as error budgets, slis/slos, and proactive reliability improvements.
* design and maintain monitoring, alerting, and logging frameworks to ensure proactive issue detection.
* optimize application performance and scalability through observability and telemetry (apm, logs, metrics, traces).
* design and implement monitoring dashboards and alerting using tools like azure monitor, application insights, grafana, or prometheus.
3. Devops & azure expertise
* good understanding of azure devops and azure application architecture.
* partner with development teams to improve release automation, environment provisioning, and deployment reliability.
* manage ci/cd pipelines and release processes using azure devops
* collaborate with development teams to embed reliability and supportability into the sdlc.
4. Leadership & collaboration
* lead and mentor a team of production support engineers.
* establish operational runbooks, standard operating procedures (sops), and escalation matrices.
* collaborate with cross-functional teams (development, qa, infrastructure, and business) to ensure end-to-end reliability.
* communicate effectively with business stakeholders regarding system health, incident status, and performance metrics.
* drive post-incident reviews and continuous improvement initiatives.
requirements we look for:
* strong experience with azure devops, servicenow, observability and apm tools like new relic, azure appinsight and etc.
* good understanding in azure cloud services (app services, aks, functions, service bus, sql, etc.).
* knowledge of ci/cd, infrastructure as code (iac), and automation scripting (powershell, bash, python).
* deep understanding of sre principles and tools for monitoring, alerting, and logging.
* strong grasp of incident management frameworks (itil, major incident processes).
qualities that stir our souls (and make you stand out):
* excellent problem-solving and analytical thinking.
* strong communication and leadership capabilities.
* ability to manage multiple stakeholders and prioritize under pressure.
* proven experience leading a 24x7 support or on-call environment.
* bachelor's degree in computer science, engineering, or related field (or equivalent experience).
* 8+ years of experience in application support, it operations, or sre roles
* microsoft azure certifications (e.g., az-400, az-104, or az-305) are a plus.
* experience with itsm tools (servicenow, jira service management).