Role summary
as a senior site reliability engineer, observability you will focus on building and maintaining a mature observability environment to accelerate engineering teams while reducing cognitive load.
this role enables engineers to continue building and supporting crucial products and services that have a profound impact in the industry.
key responsibilities
* build and orchestrate modern otel-based observability platform
* support multiple telemetry types, like metrics, logs and traces
* define and support modern governance in observability and problems at scale
* ensure reliability, security, and performance exceed defined slas
* work with engineers from across the company to help troubleshoot issues, deploy new products and services, and increase velocity while decreasing cognitive load
* lead the design and deployment of monitoring/observability services to detect and alert the team of needed action
* ingest, aggregate, transform, and utilize data from a multitude of sources in real time data pipeline
* oversee the availability, performance, and supportability of observability infrastructure
* create processes around alert response operations and support the team to ensure reliable delivery of oracle data
* make recommendations to ensure sufficient metrics are collected to create alerts with every new feature release
requirements
* 7+ years of relevant professional experience
* ability to develop software outside typical infrastructure requirements and configurations
* experience programming in c, c++, java, python, go, perl, or ruby
* expert knowledge in all aspects of designing, developing, and managing large real-time systems
* experience with monitoring and logging, including exporting metrics using prometheus, building grafana dashboards, and working with centralized logging solutions like an elk stack, splunk or grafana stack
* experience with distributed systems and container orchestration, including maintaining or building kubernetes clusters
* strong communication skills, including giving and receiving constructive feedback and planning meetings and code reviews
desired qualifications
* excitement for blockchain, web 3.0, and similar decentralized technologies
* experience running infrastructure in blockchain/web3 space
* ability to scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity
* experience working remotely in a distributed team
* a strong desire to grow and challenge yourself
tools and services
* aws; terraform/terragrunt; kubernetes, calico and argocd; prometheus and grafana; github actions; packer
all roles are global and remote-based. Unless otherwise stated, we ask that you try to overlap some working hours with eastern standard time (est).