Incident and problem management position objective
key role in ensuring it services remain stable, resilient, and aligned with business priorities. Combining operational leadership with technical understanding, this role oversees major incident management, service reliability, and continuous improvement across the it landscape. The service operations manager acts as a bridge between technical engineering teams and business stakeholders, ensuring both rapid resolution and long‑term service health. The role will work with a variety of stakeholders from multiple technical domains such as network, cloud, infra, itsm, monitoring tools, and business applications.
main responsibilities
* incident management lead: manage critical/major incidents, conduct troubleshooting calls, and ensure services are recovered in minimum time. Assess and articulate the business impact of technical issues to guide prioritization and stakeholder communication.
* review high‑level and low‑level designs (hlds/llds) and service documentation to understand dependencies, data flows, and resilience mechanisms.
* actively contribute to troubleshooting discussions, proposing diagnostic steps or recovery options based on technical understanding.
* maintain stakeholders updated on recovery progress via incident notifications and register the incident timeline.
* perform level‑1 troubleshooting for low priority tickets (p3/p4), following runbooks and leading escalated/long‑running p3 cases until completion.
* regularly follow‑up on the low priority tickets backlog, perform internal escalations when required to maintain the backlog under control and resolve prioritized cases.
problem management
conduct root cause analysis meetings with the technical teams to identify and recommend actions to prevent incidents from recurring. Recommend improvements in internal ways of working to maximize the teams' effectiveness during incident troubleshooting and reduce the time to restore. Ensure permanent fixes are defined and implemented for known bugs or repetitive low‑priority incidents. Monitor and analyze operational metrics (availability, mttr, sla compliance, incident trends) to identify opportunities for improvement. Cultivate a problem management mindset and follow‑up on the recommendations with the internal stakeholders until completion.
change management
review change requests, internally assess impact on smart er with the teams, and share approval feedback during the change advisory board. Conduct post‑reviews for failed changes, align required actions to prevent recurrences, and follow up with stakeholders to ensure completion.
reporting and governance
provide appropriate reporting and data to stakeholders to highlight the stability of the global it systems. Participate in local and global governance, presenting information and ensuring feedback is received from stakeholders and applied to the service management processes. Work collaboratively with technical teams from global technology and local business units to ensure adherence to the general service management processes, and facilitate recurring meetings to discuss operational matters in incident, problem, and change management areas.
requirements
* bachelor's degree in telecommunication engineering, computer science, electronics, or an equivalent technical field.
* advanced english proficiency (mandatory). Advanced spanish proficiency.
* 5+ years of experience in it operations, service management, or technical support roles.
* proven background in incident, problem, and change management, ideally in complex hybrid environments.
* strong understanding of infrastructure, applications, and networks, with the ability to read architecture diagrams and understand component dependencies.
* experience managing operations for business‑critical applications (e.g., mobile apps, client portals, or extranet platforms).
* hands‑on experience with cloud platforms, such as azure portal, and their operational management.
* knowledge of itil or similar frameworks; knowledge of cloud applications; experience with monitoring and observability platforms.
* experience communicating with diverse stakeholders during incidents and recovery processes, translating technical issues into business impact for non‑technical audiences.
* ability to cover on‑call shifts.
#j-18808-ljbffr