Overview
about the team: the platform infrastructure engineering team transforms complex infrastructure systems into simple, efficient, and reliable solutions, focusing on scalable operational methodologies to drive business impact and cost savings. Zillow group incident management (zgim) drives best practices for change management, manages major incidents, and drives root cause analysis that improves product availability for zillow customers so they can unlock life’s next chapter. The team works closely with software development engineers while supporting all users and brands. Your work with us will be highly visible throughout zillow group and have a significant impact on all parts of the business.
about the role: the senior incident manager plays a key role in maintaining availability of zillow services. Incident managers drive proactive readiness, change management, incident management, and root cause analysis. This requires working effectively in real-time with highly technical engineers and business leaders. Incident managers identify and build processes, demonstrate calm under pressure, and adaptively work with all types of stakeholders.
in addition to a competitive base salary and benefits, this position is also eligible for equity awards based on factors such as experience, performance and location.
this role has been categorized as a teleworker position. Teleworkers do not have a permanent corporate office workplace and, instead, work from a physical location of their choice which must be identified to the company. Employees may live in any part of mexico, but preferably in mexico city, as we would encourage attendance for occasional in-office events.
get to know us: zillow is reimagining real estate to make it easier to unlock life’s next chapter. As the most-visited real estate website in the united states, zillow and its affiliates help movers find and win their home through digital solutions, first class partners, and easier buying, selling, financing and renting experiences.
zillow group is an equal opportunity employer committed to fostering an inclusive, innovative environment with the best employees. We are committed to equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, and gender identity. If you have a disability or special need that requires accommodation, please contact your recruiter directly. Qualified applicants with arrest or conviction records will be considered for employment in accordance with applicable state and local law.
responsibilities
* incident management: own live-site incident management during customer-impacting events, driving immediate engagement where needed and providing executive-facing updates.
* lead cross-functional teams through detailed incident response in real-time.
* participate in an on-call rotation to ensure 24/7 major incident response.
* change management: develop processes and influence product engineering teams to lower incident occurrence, detection, and resolution time.
* drive development to improve resilience through testing, deployment, observability, etc.
* monitor various metrics to ensure sla compliance and drive improvement where needed.
* root cause analysis: drive analysis of major incidents through cross-functional teams to uncover root cause and contributing factors.
* analyze trends to gain insight and drive improvements to products and processes.
* lead problem review sessions and drive on-time completion of process steps.
* process improvement: identify and initiate improvement in team processes, including developing standard operating procedures and giving and receiving cross-training.
* translate technical concepts into business language to clarify issues and impact.
* create reports and present data to drive understanding and improvement.
qualifications
* you take initiative when you see an issue and are exhilarated by being in the middle of the action. You effectively communicate with both technical staff and executives under high pressure. You can take control of an urgent situation and coordinate multiple work streams without alienating anyone or missing a crucial input.
* bs / ba degree in computer science, information systems, or related discipline, or a minimum of 5 years’ related work experience.
* 3+ years of experience leading major incident war rooms during live incidents.
* proficiency driving cross-functional decisions in ambiguous situations.
* proficiency triaging multiple incoming issues and addressing according to priority and severity.
* sufficient technical background, ideally in networking, systems, and software development.
* hands-on experience analyzing incidents, root causes, weaknesses, corrective actions, etc.
* proficiency guiding and influencing technical teams and leaders on incident processes.
* proficiency communicating technical updates to non-technical stakeholders.
* business acumen to understand business issues and align communications strategies to outcomes accordingly.
* experience creating tableau dashboards from operational data.
* itilv3 foundation certification a plus
* proficiency with google and microsoft office documents and co-authoring features.
* proficiency working from home and virtually with distributed teams.
* self-starter with a high degree of initiative in scaling programs to large organizations.
#j-18808-ljbffr