Incident and problem management
position objective
key role in ensuring it services remain stable, resilient, and aligned with business priorities. Combining operational leadership with technical understanding, this role oversees major incident management, service reliability, and continuous improvement across the it landscape. The service operations manager acts as a bridge between technical engineering teams and business stakeholders, ensuring both rapid resolution and long-term service health. The role will work with a variety of stakeholders from multiple technical domains such as network, cloud, infra, itsm, monitoring tools, and business applications.
main responsibilities
incident management
lead critical/major incidents, manage the troubleshooting calls, and ensure the services are recovered in minimum time. Assess and articulate the business impact of technical issues to guide prioritization and stakeholder communication. Review high-level and low-level designs (hlds/llds) and service documentation to understand dependencies, data flows, and resilience mechanisms. Actively contribute to troubleshooting discussions, proposing diagnostic steps or recovery options based on technical understanding. Maintain all stakeholders updated on the recovery progress via incident notifications and register the incident timeline. Perform level 1 troubleshooting for low priority tickets (p3/p4), based on acquired knowledge and instruction runbooks provided by the smart er teams, and lead escalated/long running p3 cases until completion. Regularly follow-up on the low priority tickets backlog with the stakeholders and perform internal escalations when required to maintain the backlog under control or to resolve prioritized cases.
problem management
conduct the root cause analysis meetings with the technical teams to identify and provide recommendations to prevent incidents from recurring. Recommend improvements in internal ways of working to maximize the teams' effectiveness during incidents troubleshooting and reduce the time to restore. Ensure permanent fixes are defined and implemented for known bugs or repetitive low priority incidents. Monitor and analyze operational metrics (availability, mttr, sla compliance, incident trends) to identify opportunities for improvement. Cultivate the problem management mindset and follow-up on the problem management recommendations with the internal stakeholders until completion.
change management
review change requests and internally assess impact on smart er with the teams, and share the approval feedback during the change advisory board. Conduct the post-reviews for the failed changes, align the required actions to prevent recurrences, and follow-up with the stakeholders to ensure completion.
reporting and governance
provide appropriate reporting and data to stakeholders in order to highlight the stability of the global it systems. Participate in local and global governance with the purpose of presenting information and ensuring feedback is received from stakeholders and applied to the service management processes. Work collaboratively with technical teams from global technology and local business units in order to ensure adherence to the global service management processes, and facilitate recurring meetings to discuss operational matters on incident, problem, and change management areas.
requirements
bachelor's degree in telecommunication engineering, computer science, electronics, or equivalent technical field. Advanced english proficiency (mandatory). Advanced spanish proficiency. 5+ years of experience in it operations, service management, or technical support roles. Proven background in incident, problem, and change management, ideally in complex hybrid environments. Strong understanding of infrastructure, applications, and networks, with the ability to read architecture diagrams and understand component dependencies. Experience managing operations for business-critical applications (such as mobile apps, client portals, or extranet platforms). Hands on experience with cloud platforms (such as azure portal) and their operational management. Knowledge of itil or similar frameworks. Knowledge of cloud applications. Experience with monitoring and observability platforms. Experience communicating with diverse stakeholders during incidents and recovery processes, with the ability to translate technical issues into business impact for non-technical audiences. Ability to cover on- call shifts.