Job for Experts

Senior Site Reliability Engineer

About the Role

You will design, deploy, and operate mission-critical, multi-region distributed infrastructure with tested recovery strategies and measurable reliability objectives. You will own large-scale Infrastructure as Code using Terraform, build and maintain secure CI/CD pipelines, and operate Kubernetes clusters with GitOps. You will define SRE practices (SLOs, error budgets, observability), run disaster recovery tests, write runbooks, and take end-to-end ownership of incidents and reliability outcomes. You will collaborate with product, security, and engineering teams to translate requirements into secure, compliant platform services and continuously improve customer-facing deployments.

Requirements

  • 7+ years of experience in SRE platform engineering or infrastructure engineering operating production distributed systems
  • Strong multi-cloud experience with AWS GCP or Azure with SME-level depth in AWS or GCP
  • Proven experience running multi-region production systems including disaster recovery testing runbooks and real incident ownership
  • Deep hands-on experience with Kubernetes at scale (EKS GKE AKS) including GitOps workflows and production-grade security controls
  • Extensive experience with Terraform-first Infrastructure as Code in large real-world environments
  • Strong security and compliance mindset including Zero Trust principles secrets management (Vault or cloud-native equivalents) and exposure to regulated environments (PCI SOC 2 HIPAA NIST)
  • Comfortable owning systems end to end with clear metrics and outcomes to show impact

Responsibilities

  • Design, build, and operate highly available multi-region distributed systems with clear recovery strategies and tested RTO/RPO
  • Define the reliability roadmap platform architecture and operational standards with SRE leadership
  • Own large-scale Infrastructure as Code using Terraform including reusable modules and multi-account patterns
  • Operate and scale Kubernetes environments (EKS GKE AKS) using GitOps practices ArgoCD Helm and strong RBAC and network policies
  • Build and maintain secure CI/CD pipelines including blue/green and canary deployments promotion and rollback strategies and artifact integrity (SBOM signing)
  • Define and improve SRE practices including SLOs error budgets observability and measurable reductions in MTTR and MTTA
  • Translate customer and business requirements into reliable secure platform services in partnership with product and engineering
  • Contribute to operational support runbooks disaster recovery testing and continuous improvement of customer-facing deployments

Skills