Site Reliability Engineer

About the Role

You will operate and scale production infrastructure, focusing on reliability, automation, and security. You will run and troubleshoot Kubernetes clusters, build declarative infrastructure with Terraform, and design CI/CD workflows using GitOps patterns (ArgoCD, ApplicationSets, GitHub Actions). You will implement and maintain observability using time-series metrics, logs, and dashboards (Prometheus, Loki, Mimir, Grafana, CloudWatch). You will diagnose networking and storage issues across distributed systems, automate operational workflows with Python, Go, or Bash, participate in an on-call rotation and incident response, and contribute to postmortems and threat models to improve system reliability and security.

Requirements

  • Experience operating production Kubernetes clusters
  • Experience with Terraform or similar infrastructure-as-code tools
  • Experience with GitOps and ArgoCD (ApplicationSets or similar patterns)
  • Experience designing CI/CD pipelines (GitHub Actions, CodeBuild, or similar)
  • Experience with observability tools (Prometheus, Loki, Mimir, Grafana, CloudWatch)
  • Proficiency in Linux and shell scripting
  • Programming experience in Python or Go
  • Experience with cloud platforms (AWS, GCP, or Azure)
  • Participation in on-call rotations and incident response
  • Experience designing secure infrastructure and contributing to threat models

Responsibilities

  • Operate production Kubernetes clusters and manage system components
  • Build and maintain declarative infrastructure using Terraform or similar tools
  • Design and manage CI/CD workflows for infrastructure and applications with GitOps tools
  • Implement and maintain observability using metrics, logs, and dashboards
  • Diagnose and troubleshoot networking and storage issues in distributed systems
  • Automate operational workflows using scripting or programming (Python, Go, Bash)
  • Participate in on-call rotation, respond to incidents, and drive postmortems
  • Apply security-minded design, including least-privilege and threat modeling

Benefits

  • Remote-first global workforce and New York office
  • Annual company offsite and team onsites
  • Professional reimbursement program
  • Medical, dental & vision coverage (US and some other countries)
  • 401k retirement plan with company match (US only)
  • Wellness stipend
  • Home office setup / ergonomic equipment program

Skills

Apply Now
Site Reliability Engineer at Offchain Labs | JobStash