Senior site reliability engineer (middle/senior) id38916
3 weeks ago be among the first 25 applicants
agileengine is an inc. 5000 company that creates award-winning software for fortune 500 brands and trailblazing startups across 17+ industries. We rank among the leaders in areas like application development and ai/ml, and our people-first culture has earned us multiple best place to work awards.
why join us
if you’re looking for a place to grow, make an impact, and work with people who care, we’d love to meet you!
what you will do
 * shift: monday – thursday 8am – 7pm pst (11am – 10pm est) with rotating on‑call;
 * on call shifts: every 6 weeks, for one week as primary responder and next week as secondary;
 * manage alerts daily, check systems, and escalate issues as needed;
 * be part of a team that provides 24×7 on‑call support for critical saas events;
 * be available in case of emergencies when team members are not available or need help;
 * document issues and remediation steps;
 * proactively create appropriate monitors in the eks/k8s ecosystem;
 * deploy to eks/k8s cluster using terraform and helm;
 * learn and maintain existing infrastructure running under docker swarm;
 * improve existing infrastructure health by implementing checks and scripts to correct known issues;
 * maintain and develop deployment code;
 * automate manual tasks;
 * implement/integrate new technologies in our cloud infrastructure;
 * collaborate with other teams and departments to provide the highest level of support and assistance;
 * apply a real customer focus when planning deployments/updates, having the customer in the forefront of the mind, and considering the impact on them before making changes;
 * work closely on solutions with support, customer success, migration, and professional services teams to provide the best in class saas service to our customers;
 * perform rca and take necessary corrective actions to prevent the recurrence of issues;
 * create and assign alert‑related actions to the appropriate team after the investigation;
 * handle support requests for environment‑specific actions;
 * identify and provide automation requirements to improve rca.
must haves
 * 2+ years of professional experience;
 * experience working with datadog;
 * hands‑on experience as an aws cloud engineer;
 * working knowledge of eks/terraform/helm;
 * working experience with docker and docker swarm;
 * good understanding of aws iam roles and policies;
 * experience logging and monitoring aws resources using cloudwatch logs;
 * experience working in a linux environment;
 * proficient in bash and/or python scripting;
 * a strong understanding of web technologies such as rest apis;
 * working experience with monitoring solutions, such as grafana and prometheus;
 * excellent oral and written communication skills;
 * customer‑facing communication skills to effectively explain issues and rcas to them;
 * experience in product/application support for saas‑based products;
 * understanding of apis, databases, systems architecture, and design;
 * designing, implementing, and operating in a devsecops;
 * excellent communication skills, both written and verbal;
 * ability to work independently as well as within a collaborative environment;
 * a technical aptitude with the desire to learn new and evolving technologies;
 * upper‑intermediate english level.
nice to haves
 * experience with gcp or azure;
 * certifications: aws certified devops engineer – professional or aws certified advanced networking specialty.
perks and benefits
 * professional growth: accelerate your professional journey with mentorship, techtalks, and personalized growth roadmaps.
 * competitive compensation: we match your ever‑growing skills, talent, and contributions with competitive usd‑based compensation and budgets for education, fitness, and team activities.
 * a selection of exciting projects: join projects with modern solutions development and top‑tier clients that include fortune 500 enterprises and leading product brands.
 * flextime: tailor your schedule for an optimal work‑life balance, by having the options of working from home and going to the office – whatever makes you the happiest and most productive.
seniority level
 * mid‑senior level
employment type
 * full‑time
job function
 * it services and it consulting
referrals increase your chances of interviewing at agileengine by 2x
get notified about new senior site reliability engineer jobs in puebla, puebla, mexico.
#j-18808-ljbffr