Job for Experts
Senior Site Reliability Engineer
About the Role
You will design, deploy, and operate mission-critical, multi-region distributed infrastructure with tested recovery strategies and measurable reliability objectives. You will own large-scale Infrastructure as Code using Terraform, build and maintain secure CI/CD pipelines, and operate Kubernetes clusters with GitOps. You will define SRE practices (SLOs, error budgets, observability), run disaster recovery tests, write runbooks, and take end-to-end ownership of incidents and reliability outcomes. You will collaborate with product, security, and engineering teams to translate requirements into secure, compliant platform services and continuously improve customer-facing deployments.
Requirements
- 7+ years of experience in SRE platform engineering or infrastructure engineering operating production distributed systems
- Strong multi-cloud experience with AWS GCP or Azure with SME-level depth in AWS or GCP
- Proven experience running multi-region production systems including disaster recovery testing runbooks and real incident ownership
- Deep hands-on experience with Kubernetes at scale (EKS GKE AKS) including GitOps workflows and production-grade security controls
- Extensive experience with Terraform-first Infrastructure as Code in large real-world environments
- Strong security and compliance mindset including Zero Trust principles secrets management (Vault or cloud-native equivalents) and exposure to regulated environments (PCI SOC 2 HIPAA NIST)
- Comfortable owning systems end to end with clear metrics and outcomes to show impact
Responsibilities
- Design, build, and operate highly available multi-region distributed systems with clear recovery strategies and tested RTO/RPO
- Define the reliability roadmap platform architecture and operational standards with SRE leadership
- Own large-scale Infrastructure as Code using Terraform including reusable modules and multi-account patterns
- Operate and scale Kubernetes environments (EKS GKE AKS) using GitOps practices ArgoCD Helm and strong RBAC and network policies
- Build and maintain secure CI/CD pipelines including blue/green and canary deployments promotion and rollback strategies and artifact integrity (SBOM signing)
- Define and improve SRE practices including SLOs error budgets observability and measurable reductions in MTTR and MTTA
- Translate customer and business requirements into reliable secure platform services in partnership with product and engineering
- Contribute to operational support runbooks disaster recovery testing and continuous improvement of customer-facing deployments
Skills
Zero-TrustPciSystem UpgradesSbomTerragruntHederaSecret ManagementError BudgetMttrRunbooksSigningEvmSoc 2NistSolidityHardhatArgocdRbacSloEksGitopsTerraformSecurityInfrastructure-As-CodeObservabilityAwsCi/CdAzureGcpHelmComplianceKubernetesSmart ContractDisaster RecoveryGkeTerraform ModulesAksBlue/GreenCanaryMttaSecrets ManagementHipaaVault
