We’re looking for a senior devops engineer who thrives in complex, cloud-native environments and is passionate about automation, scalability, and performance. You’ll play a key leadership role in building, deploying, and supporting production systems across azure, kubernetes, and a modern devops stack. This is a highly visible position where your ability to drive infrastructure best practices, mentor peers, and lead initiatives will directly impact the resilience and velocity of our engineering teams.
key responsibilities
* architect and evolve scalable ci/cd pipelines using github actions and azure devops to support multi-service deployments.
* own and operate production infrastructure in azure kubernetes service (aks), including capacity planning, secrets management, and rollout strategies.
* implement and manage infrastructure as code with terraform, promoting reuse, modularity, and collaboration across teams.
* lead observability efforts using datadog, setting up actionable slos/slis, monitors, apm instrumentation, and dashboards for various personas (developers, sres, leadership).
* design secure and scalable messaging patterns using azure service bus and event hub, ensuring ordering, fault tolerance, and performance.
* define and enforce devsecops best practices across environments—handling rbac, audit logging, access controls, and automated policy enforcement.
* mentor other devops and engineering team members; drive incident postmortems, root cause analysis, and process improvements.
* collaborate closely with application developers, qa, product managers, and architecture teams to drive operational excellence across all stages of software delivery.
tech skills: required
git, github (including gitops, access control, hooks), github actions (multi-stage workflows, approval gates, rollbacks), azure devops (yaml pipelines), terraform (state management, modular design, secrets handling), azure kubernetes service (aks), helm, docker, azure service bus (message ordering, dead-letter management), azure event hub (partitioning, scaling), azure resource manager (arm templates), datadog (apm, monitors, slo/sli setup, custom dashboards), ci/cd best practices, monitoring and alerting strategies, rbac in azure, troubleshooting distributed systems, infrastructure as code (iac)
tech skills: nice to have
azure logic apps, azure cognitive search (index security and access), kube-proxy, kubernetes networking and service discovery, kubernetes jobs and cronjobs, sidecar and init containers, git pre-commit hooks (for sanity checks), github organization-level administration, event-driven architecture patterns, secure secrets management (key vault, github secrets), incident response and rca facilitation, scripting (bash, powershell), programming experience (python, go, or node.js for devops tooling), experience with monorepo setups (e.g., nx)
required experience & skills
* 6+ years of devops/sre experience, with at least 2+ years in a senior or lead capacity.
* expert‑level knowledge of git and github workflows, including automation, hooks, access control, and gitops practices.
* deep understanding of kubernetes primitives (pods, jobs, services, config maps, volumes) and advanced concepts like init/sidecar containers, traffic draining, and fault recovery.
* strong experience building resilient ci/cd pipelines with conditional logic, approvals, and rollback mechanisms.
* proficient in managing terraform state and modules in collaborative, multi‑team environments.
* demonstrated expertise in securing and operating azure services including service bus, event hub, logic apps, and cognitive search.
* hands‑on experience with datadog apm, monitors, slos, and logging integrations.
* track record of designing for high availability, scalability, and observability in production cloud systems.
preferred qualifications
* experience leading migrations of legacy systems to kubernetes or iac.
* proven ability to troubleshoot latency, reliability, or pipeline issues in distributed systems.
* comfortable navigating large monorepos, trunk‑based development, and service‑oriented architectures.
* familiarity with event‑driven and fault‑tolerant patterns at enterprise scale.
#j-18808-ljbffr