Site Reliability Engineer (SRE)
About the Role
You will improve and establish monitoring, alerting, and observability for services and infrastructure. You will handle critical alerts and incidents, work with R&D to identify root causes, write RCAs, and define corrective actions. You will document runbooks and automate procedures using Python, Lambda, shell scripts, ArgoCD, and Ansible, and perform periodic on-call duties and emergency response.
Requirements
- 3+ years experience as an SRE or infrastructure backend in a SaaS environment
- Proficiency in Python, JavaScript, and Bash
- 3+ years experience with alerting and monitoring systems such as DataDog, Coralogix, Splunk, New Relic, or Prometheus
- Experience with Linux systems from kernel to shell
- Experience with cloud platforms such as AWS, Google Cloud, or Azure
- Experience with configuration management tools such as Ansible, Chef, Puppet, or ArgoCD
- Experience with Docker, Kubernetes, and Helm
- Experience with source control systems such as Git, Bitbucket, GitLab, Phabricator, or Gerrit
- Strong analytical and troubleshooting skills
- Strong verbal and written communication skills
Responsibilities
- Improve and establish monitoring, alerting, and observability for services and infrastructure
- Handle critical alerts and incidents and coordinate resolution across teams
- Research blockchain workflows to identify optimization opportunities and improve monitoring
- Identify root causes for incidents, write RCAs, and define corrective actions
- Document runbooks and automate procedures using Python, Lambda, shell scripts, ArgoCD, and Ansible
- Perform periodic on-call duties and emergency response
- Communicate and escalate issues to senior management and R&D
