Site Reliability Engineer (SRE)

About the Role

You will improve and establish monitoring, alerting, and observability for services and infrastructure. You will handle critical alerts and incidents, work with R&D to identify root causes, write RCAs, and define corrective actions. You will document runbooks and automate procedures using Python, Lambda, shell scripts, ArgoCD, and Ansible, and perform periodic on-call duties and emergency response.

Requirements

  • 3+ years experience as an SRE or infrastructure backend in a SaaS environment
  • Proficiency in Python, JavaScript, and Bash
  • 3+ years experience with alerting and monitoring systems such as DataDog, Coralogix, Splunk, New Relic, or Prometheus
  • Experience with Linux systems from kernel to shell
  • Experience with cloud platforms such as AWS, Google Cloud, or Azure
  • Experience with configuration management tools such as Ansible, Chef, Puppet, or ArgoCD
  • Experience with Docker, Kubernetes, and Helm
  • Experience with source control systems such as Git, Bitbucket, GitLab, Phabricator, or Gerrit
  • Strong analytical and troubleshooting skills
  • Strong verbal and written communication skills

Responsibilities

  • Improve and establish monitoring, alerting, and observability for services and infrastructure
  • Handle critical alerts and incidents and coordinate resolution across teams
  • Research blockchain workflows to identify optimization opportunities and improve monitoring
  • Identify root causes for incidents, write RCAs, and define corrective actions
  • Document runbooks and automate procedures using Python, Lambda, shell scripts, ArgoCD, and Ansible
  • Perform periodic on-call duties and emergency response
  • Communicate and escalate issues to senior management and R&D

Skills

Apply Now
Site Reliability Engineer (SRE) at Fireblocks | JobStash