Site Reliability Engineer DevOps
About the Role
You will proactively monitor production systems and applications for performance issues, outages, or anomalies. You will develop and improve monitoring, alerting, and observability stacks. You will troubleshoot incidents, perform root cause analysis, and resolve technical problems to minimize downtime. You will respond to service disruptions, handle technical requests from clients, and keep stakeholders informed of progress. You will collaborate with DevOps, RnD and QA to implement fixes and permanent solutions, document resolutions and procedures, automate development and release processes, build tools to improve customer self-deployment, and deploy updates and fixes.
Requirements
- 5+ years of hands-on experience in DevOps, Site Reliability Engineering, Production Support, or other infrastructure roles
- Proficient in Prometheus
- Proficient in Grafana
- Familiarity with Docker
- Familiarity with Kubernetes
- Experience with AWS
- Experience with GCP
- Experience with bare-metal infrastructure
- Knowledge of PostgreSQL administration
- Experience with EVM clients
- Experience with RPC nodes
- Experience with Terraform
- Experience with Ansible
- Based in US or LATAM timezones
Responsibilities
- Monitor production systems and applications for performance issues, outages, and anomalies
- Develop and improve monitoring and observability systems and infrastructure
- Set up and improve monitoring and alerting
- Troubleshoot technical issues and find root causes
- Perform root cause analysis for production errors
- Respond to and resolve service disruptions and incidents
- Handle technical requests from clients and communicate progress
- Collaborate with DevOps, RnD, and QA to implement fixes and permanent solutions
- Log issues and document issue resolutions and procedures
- Recommend and implement process improvements
- Design procedures for system troubleshooting and maintenance
- Automate development and release processes
- Build tools to improve customer self-deployment experience
- Deploy updates and fixes
Benefits
- Attendance opportunities at Ethereum and project-based conferences and events
- Flexibility and autonomy in a decentralized work environment
- 100% remote role in US/LATAM timezones
