Senior Site Reliability Engineer, Observability
About the Role
You will build and operate a modern OpenTelemetry (OTEL)-based observability platform that supports metrics, logs, and traces. You will design, deploy, and maintain monitoring and alerting systems, ingest and transform telemetry data, and ensure the availability, performance, and security of observability infrastructure. You will collaborate with engineers across the company to troubleshoot issues, deploy new services, and reduce cognitive load through automation and governance. You will create and improve alert response processes and recommend metrics to instrument new features, championing reliability and doing the work right the first time.
Requirements
- 7+ years of relevant professional experience
- Experience on devops, infrastructure, SRE, and/or platform teams
- Ability to develop software beyond typical infrastructure configurations
- Experience programming in C, C++, Java, Python, Go, Perl, or Ruby
- Expert knowledge designing, developing, and managing large real-time systems
- Experience with monitoring and logging including Prometheus and Grafana
- Experience with centralized logging solutions such as ELK Stack, Splunk, or Grafana Stack
- Experience with distributed systems and container orchestration
- Experience maintaining or building Kubernetes clusters
- Strong communication skills and ability to give and receive constructive feedback
Responsibilities
- Build and orchestrate a modern OpenTelemetry based observability platform
- Support metrics, logs, and traces telemetry types
- Define and enforce observability governance at scale
- Ensure reliability, security, and performance meet SLAs
- Troubleshoot issues and support engineers across the company
- Design and deploy monitoring and observability services
- Ingest, aggregate, transform, and utilize data in real time pipelines
- Oversee availability, performance, and supportability of observability infrastructure
- Create processes for alert response operations
- Ensure sufficient metrics are collected for new feature releases
- Champion reliability and secure practices
