Site Reliability Engineer

United States Remote Full Time Devops Jobs by Offchain Labs

About the Role

You will operate and scale production infrastructure, focusing on reliability, automation, and security. You will run and troubleshoot Kubernetes clusters, build declarative infrastructure with Terraform, and design CI/CD workflows using GitOps patterns (ArgoCD, ApplicationSets, GitHub Actions). You will implement and maintain observability using time-series metrics, logs, and dashboards (Prometheus, Loki, Mimir, Grafana, CloudWatch). You will diagnose networking and storage issues across distributed systems, automate operational workflows with Python, Go, or Bash, participate in an on-call rotation and incident response, and contribute to postmortems and threat models to improve system reliability and security.

Requirements

Experience operating production Kubernetes clusters
Experience with Terraform or similar infrastructure-as-code tools
Experience with GitOps and ArgoCD (ApplicationSets or similar patterns)
Experience designing CI/CD pipelines (GitHub Actions, CodeBuild, or similar)
Experience with observability tools (Prometheus, Loki, Mimir, Grafana, CloudWatch)
Proficiency in Linux and shell scripting
Programming experience in Python or Go
Experience with cloud platforms (AWS, GCP, or Azure)
Participation in on-call rotations and incident response
Experience designing secure infrastructure and contributing to threat models

Responsibilities

Operate production Kubernetes clusters and manage system components
Build and maintain declarative infrastructure using Terraform or similar tools
Design and manage CI/CD workflows for infrastructure and applications with GitOps tools
Implement and maintain observability using metrics, logs, and dashboards
Diagnose and troubleshoot networking and storage issues in distributed systems
Automate operational workflows using scripting or programming (Python, Go, Bash)
Participate in on-call rotation, respond to incidents, and drive postmortems
Apply security-minded design, including least-privilege and threat modeling

Benefits

Remote-first global workforce and New York office
Annual company offsite and team onsites
Professional reimbursement program
Medical, dental & vision coverage (US and some other countries)
401k retirement plan with company match (US only)
Wellness stipend
Home office setup / ergonomic equipment program

Skills

Loki Incident Response Go Bash Argocd Linux Cloudwatch Automation Gitops Terraform Storage Security Networking Observability Ci/Cd Grafana Prometheus Python Kubernetes

About the Role

Requirements

Responsibilities

Benefits

Skills

Similar Jobs