Objective of the role
responsible for driving the strategic direction, operational excellence, and continuous evolution of site reliability engineering practices across critical systems and services. This role leads a team of sre engineers and complex initiatives, ensuring high availability, scalability, and performance. The senior lead of sre fosters cross-functional collaboration, anticipates future infrastructure needs, and aligns sre practices with business and product priorities, while cultivating a culture of ownership, automation, and resilience and driving operational excellence with engineering teams.
main responsibilities
* build, lead, and inspire high-performing sre teams, fostering a culture of operational ownership, engineering excellence, and continuous learning.
* define and execute the strategic roadmap for sre, integrating best practices in reliability, incident management, observability, and infrastructure automation in alignment with business and product goals.
* elevate observability across the stack by designing and enforcing standards for telemetry, structured logging, distributed tracing, and service-level dashboards. Ensure 100% coverage of business-critical systems with actionable metrics and alerting along with the engineering teams.
* act as the technical escalation point for the most complex production issues, leading hands-on incident response and deep root cause analysis in large-scale, low-latency, event-driven architectures.
* champion automation-first infrastructure practices, enforcing iac, immutable deployments, and auto-remediation patterns that reduce manual intervention and accelerate delivery.
* drive architectural and operational improvements through close partnership with product engineering, platform, security, and architecture teams. Proactively identify and mitigate systemic reliability risks and performance bottlenecks.
* lead the definition, adoption, and review of slis, slos, and error budgets, ensuring they are embedded into engineering and product decision-making processes.
* operationalize change management, chaos engineering, and dr strategies, validating readiness through frequent simulations and failover exercises.
* mentor and develop sre leads and senior engineers, scaling internal capabilities and reinforcing technical depth across the organization.
* represent sre in architecture boards, and business reviews, aligning engineering reliability strategies with company-wide objectives.
* promote a culture of autonomy and proactive engineering, encouraging teams to own their services end-to-end with accountability and resilience thinking.
* serve as a cultural leader within spin, fostering psychological safety, ownership, and a sense of mission to serve millions of people across latam with secure, reliable financial technology.
required knowledge and experience
* bachelor’s degree in computer science, software engineering, or related field (or equivalent experience).
* 10+ years of experience in sre, devops, or software engineering roles, with at least 4+ years in leadership roles.
* strong experience leading distributed sre or platform teams in complex, production-scale environments.
* deep understanding of reliability engineering principles, cloud-native infrastructure on aws, observability, and incident response.
* hands-on experience with infrastructure as code, ci/cd pipelines, containers, and orchestration tools.
* strong architectural and performance optimization skills across cloud and hybrid infrastructure.
* demonstrated ability to influence and collaborate across engineering, product, and business teams.
* familiarity with regulatory and security frameworks relevant to infrastructure reliability.
* excellent communication and leadership skills, with experience presenting to senior stakeholders.
* strategic thinking, systems-level problem solving, and a proactive approach to continuous improvement.