Site reliability engineer: nearshore yay867

Guadalajara, Jal

ConsultNet Technology Services and Solutions

Publicada el 14 septiembre

Descripción

Join to apply for the Site Reliability Engineer: Nearshore role at ConsultNet Technology Services and Solutions
Are you looking for a career that makes a positive difference in your life and also in the lives of learners and educators across the globe? Do you want to work with fun and social people in a positive and engaged virtual office environment?
We are hiring a Software Developer who will build and support reliable, high capacity and well-performing systems in support of our mission to reimagine learning for millions of students and learners worldwide. We call this work "Site Reliability Engineering".
As a Site Reliability Engineer, you will work in a small team accountable for telemetry, cost, security, performance and reliability in AWS infrastructure. You will collaborate in a DevOps model with product development teams; designing, deploying and managing automation tools that increase predictability as well as time to market while reducing cost. If you love to build developer tools and automation, know AWS services inside out, have complex distributed system experience, and like engineering software solutions to solve cloud-related problems, then you will thrive in this position.
Our stack
Code: Python, Javascript, PHP, NodeJS, Go. AI platform; OpenAI and Bedrock
RDBMS: PostGreSQL, MySQL
Cache: ElastiCache (Valkey/Redis/memcached), DynamoDB
Containers: ECS & Docker
Cloud: Amazon AWS
Telemetry: New Relic (preferred), Datadog, CloudWatch
Build: GitHub Actions (preferred), Jenkins (nice to have), CircleCI (preferred), GitHub Enterprise and more
Run: PagerDuty, Exigence
Config Management and provisioning: Puppet, Ansible (Nice to have)
Web: Apache httpd, Nginx
Infrastructure as Code: Terraform (preferred), Serverless, CloudFormation
Your contributions
AI-Driven Automation & Agentic AI: Leverage Agentic AI technologies and intelligent automation to augment SRE workflows including anomaly detection, incident triage, root cause analysis, and automated remediation.
Innovate with AI-powered agents to reduce manual toil and improve system self-healing capabilities.
Continuously evaluate emerging AI/ML techniques to enhance operational efficiency and platform reliability.
Cloud Engineering
Hands-on design, analysis, development, and troubleshooting of highly-distributed, large-scale production systems and event-driven, cloud-native AI-powered services.
Ensure repeatability, traceability, and transparency of infrastructure automation (infrastructure-as-code, monitoring-as-code) tailored for AI/ML workloads.
Participate in continual learning of the AWS ecosystem, game day scenarios, and professional conferences, with a focus on AI infrastructure innovations.
Collaborate with development teams to architect scalable and resilient AI platform components within our software stack.
Actively monitor AWS costs and utilize cost optimization tools to maximize ROI while maintaining strict Service Level Objectives (SLOs) for AI workloads.
Observability Engineering
Own the reliability, uptime, security, cost efficiency, operations, capacity planning, resiliency, and performance analysis of AI platform services.
Define, monitor, and report on AI-specific service level indicators (SLIs) and service level objectives (SLOs) to ensure trustworthy AI service delivery.
Support on-call rotations for operational duties, focusing on rapid incident resolution and automation of recurring issues through Agentic AI and automation frameworks.
Maintain and enhance telemetry systems that provide deep visibility into AI model performance, system health, and business-impacting metrics.
Develop and enforce standard observability processes and tooling to promote sustainable operational excellence of AI systems.
DevSecOps
Promote healthy software development and deployment practices in AI/ML environments, including compliance with agile methodologies and continuous delivery pipelines.
Partner with Cybersecurity to develop and automate responses to emerging AI-specific security risks and vulnerabilities
Systems Engineering
Collaborate with system administrators on middleware, network, storage, databases, and virtualization maintenance, especially in support of AI infrastructure needs.
Automate legacy on-premises system maintenance and facilitate smooth migration to cloud-native AI platforms.
Resiliency Engineering
Work with development teams to identify AI system failure modes and potential blast radius to reduce risk exposure.
Validate and continuously improve monitoring and observability configurations to capture AI system anomalies.
Coordinate failure injection testing, including chaos engineering experiments specific to AI workloads.
Observe and document steady-state production behavior, growth patterns, and resource usage of AI services.
Plan and forecast capacity needs to support AI platform growth, communicating trends to leadership and adapting scaling strategies for anticipated load increases.
Drive improvements to software and infrastructure to meet resiliency and availability goals for AI systems.
Performance Engineering
Enhance performance, availability, and scalability of AI platform infrastructure by troubleshooting servers, networks, hardware, and capacity bottlenecks.
Plan, execute, and report on performance tests targeting AI model inference latency, throughput, and scalability.
Tune systems for low latency and high throughput to meet AI service-level objectives.
Support load testing initiatives using tools like and AI workload simulators to validate platform robustness under realistic conditions.
Qualifications
Experience as a software engineer, with practical experience developing, debugging, and deploying enterprise applications, including AI/ML-powered services.
Proficiency with AI platforms and cloud AI services such as OpenAI APIs, AWS Bedrock, Anthropic Claude, and other large language model (LLM) frameworks.
Experience managing and deploying AI models at scale, with understanding of LLM inference, fine-tuning, and monitoring.
Experience with infrastructure automation technologies like Terraform to provision and manage scalable AI infrastructure.
Expertise in container/container-fleet orchestration technologies such as ECS, Kubernetes, or equivalent, including running AI workloads in containerized environments.
Versatility in troubleshooting diverse hosting technologies: web servers, application platforms, operating systems, networks, virtualization, storage, and databases.
Expertise with continuous deployment lifecycles (CI/CD), including automating AI model deployments and platform updates.
Cloud database operations and deployment experience (e.g., RDS MySQL/Postgres/Aurora) supporting AI application data needs.
Experience with application caching strategies and managing high concurrency workloads typical in AI inference pipelines.
Proficiency with Lean/Agile deployment processes (Blue/Green, ZDT, Canary deployments) and traffic routing (load balancers, DNS strategies) for AI services.
Familiarity with telemetry and observability SaaS systems such as New Relic, Datadog, or equivalent to monitor AI system health and performance.
Hands-on experience with New Relic products like APM, Synthetics, Infrastructure, Logs, Workloads, etc.
Strong problem-solving, root cause analysis, and systems engineering skills in complex, distributed AI environments.
Excellent communication skills and ability to design/manage escalation response plans for AI service incidents, emphasizing proactive, customer-focused, collaborative, and data-driven approaches.
Demonstrated expertise in building and managing highly scaled, reliable, and secure production infrastructure in cloud environments for AI platforms.
Nice to Have
Ability to translate effectively between development, operations, security, product, and management teams—a critical skill in cross-functional AI projects.
Comfortable in polyglot environments; conversational or fluent in 2-3 of the following: JavaScript/TypeScript, Python (especially popular for AI), PHP, Ruby, Golang, Java, Bash, Markdown, reStructuredText, HCL, JSON, YAML, and TOML.
Familiarity with AI-specific tooling and frameworks (e.g., MLflow, Kubeflow, Seldon, or similar) is a plus.
BS Degree in Computer Science, related technical field, or equivalent industry experience preferred.
Seniority level Entry level
Employment type Contract
Job function Engineering and Information Technology
Industries Insurance
Referrals increase your chances of interviewing at ConsultNet Technology Services and Solutions by 2x
Note : This refined description preserves the intent and content while removing extraneous elements and ensuring proper HTML structure. It does not add facts beyond the original job description.

#J-18808-Ljbffr

Aplicar

Crear una alerta

Guardar