Site Reliability Engineer (SRE)

Senior United States Devops Jobs by Fireblocks

About the Role

You will improve and establish monitoring, alerting, and observability for services and infrastructure. You will handle critical alerts and incidents, work with R&D to identify root causes, write RCAs, and define corrective actions. You will document runbooks and automate procedures using Python, Lambda, shell scripts, ArgoCD, and Ansible, and perform periodic on-call duties and emergency response.

Requirements

3+ years experience as an SRE or infrastructure backend in a SaaS environment
Proficiency in Python, JavaScript, and Bash
3+ years experience with alerting and monitoring systems such as DataDog, Coralogix, Splunk, New Relic, or Prometheus
Experience with Linux systems from kernel to shell
Experience with cloud platforms such as AWS, Google Cloud, or Azure
Experience with configuration management tools such as Ansible, Chef, Puppet, or ArgoCD
Experience with Docker, Kubernetes, and Helm
Experience with source control systems such as Git, Bitbucket, GitLab, Phabricator, or Gerrit
Strong analytical and troubleshooting skills
Strong verbal and written communication skills

Responsibilities

Improve and establish monitoring, alerting, and observability for services and infrastructure
Handle critical alerts and incidents and coordinate resolution across teams
Research blockchain workflows to identify optimization opportunities and improve monitoring
Identify root causes for incidents, write RCAs, and define corrective actions
Document runbooks and automate procedures using Python, Lambda, shell scripts, ArgoCD, and Ansible
Perform periodic on-call duties and emergency response
Communicate and escalate issues to senior management and R&D

Skills

Node Runbook Coralogix Nginx New Relic Ansible Incident Response Google Cloud Bash Argocd Linux Alerting Automation Javascript Monitoring Puppet Observability Aws Azure Prometheus Datadog Git Chef Helm

About the Role

Requirements

Responsibilities

Skills

Similar Jobs