We are seeking a system reliability expert to ensure high system availability and performance. This role is critical for our business operations.
key responsibilities:
* monitor system performance and availability using various tools
* respond to incidents, analyze root causes, and implement corrective actions
* develop incident response playbooks and strategies
requirements:
* strong system administration background
* cloud computing experience (aws)
* scripting and programming skills (python, go, bash)
* familiarity with monitoring tools (prometheus, grafana, elk stack)
about our business:
we provide digital transformation services to leading organizations worldwide.