Job for Experts

Senior Site Reliability Engineer

Senior Full Time Devops Jobs by Hashgraph

About the Role

You will design, deploy, and operate mission-critical, multi-region distributed infrastructure with tested recovery strategies and measurable reliability objectives. You will own large-scale Infrastructure as Code using Terraform, build and maintain secure CI/CD pipelines, and operate Kubernetes clusters with GitOps. You will define SRE practices (SLOs, error budgets, observability), run disaster recovery tests, write runbooks, and take end-to-end ownership of incidents and reliability outcomes. You will collaborate with product, security, and engineering teams to translate requirements into secure, compliant platform services and continuously improve customer-facing deployments.

Requirements

7+ years of experience in SRE platform engineering or infrastructure engineering operating production distributed systems
Strong multi-cloud experience with AWS GCP or Azure with SME-level depth in AWS or GCP
Proven experience running multi-region production systems including disaster recovery testing runbooks and real incident ownership
Deep hands-on experience with Kubernetes at scale (EKS GKE AKS) including GitOps workflows and production-grade security controls
Extensive experience with Terraform-first Infrastructure as Code in large real-world environments
Strong security and compliance mindset including Zero Trust principles secrets management (Vault or cloud-native equivalents) and exposure to regulated environments (PCI SOC 2 HIPAA NIST)
Comfortable owning systems end to end with clear metrics and outcomes to show impact

Responsibilities

Design, build, and operate highly available multi-region distributed systems with clear recovery strategies and tested RTO/RPO
Define the reliability roadmap platform architecture and operational standards with SRE leadership
Own large-scale Infrastructure as Code using Terraform including reusable modules and multi-account patterns
Operate and scale Kubernetes environments (EKS GKE AKS) using GitOps practices ArgoCD Helm and strong RBAC and network policies
Build and maintain secure CI/CD pipelines including blue/green and canary deployments promotion and rollback strategies and artifact integrity (SBOM signing)
Define and improve SRE practices including SLOs error budgets observability and measurable reductions in MTTR and MTTA
Translate customer and business requirements into reliable secure platform services in partnership with product and engineering
Contribute to operational support runbooks disaster recovery testing and continuous improvement of customer-facing deployments

Senior Site Reliability Engineer

About the Role

Requirements

Responsibilities

Skills

Similar Jobs