Senior SRE Engineer
About the Role
You will maintain and operate AWS infrastructure to ensure 24/7 stability and performance. You will monitor systems, handle incidents, and develop custom scripts to automate tasks. You will analyze and resolve platform issues, optimize architecture and performance, and support high availability, backups, and troubleshooting for applications and databases. You will collaborate with backend, product, and infrastructure teams on system design and deployments, document operations and incidents, support internal IT needs, and participate in the on-call rotation.
Requirements
- 5+ years of Linux system administration experience; 24/7 ops experience is a plus
- Hands-on experience with AWS services including EC2, Lambda, Aurora, ElastiCache (Redis), CloudWatch, CloudFront, EKS, IAM
- Proficient in scripting with Bash, Python, and Golang
- Experience with container orchestration Kubernetes
- Experience with infrastructure as code tools such as Terraform, Helm, Kustomize
- Familiarity with CI/CD tools including Jenkins, GitHub Actions, Argo Workflows/CD
- Experience with Airflow and DAG development
- Knowledge of scalable system design and related tools such as MongoDB, Kafka, load balancers, and message queues
- Solid understanding of information security best practices
- Strong problem-solving, communication, and teamwork skills; able to work independently under pressure
Responsibilities
- Maintain and operate AWS infrastructure to ensure 24/7 stability and performance
- Monitor systems using Zabbix and ELK, handle incidents, and develop custom scripts
- Analyze and resolve platform issues and optimize architecture and performance
- Support high availability, backup, and troubleshooting for applications and databases
- Collaborate with backend, product, and infrastructure teams on system design and deployment
- Document operations and incidents and support internal IT needs
- Participate in on-call rotation
Skills
Information SecurityZabbixAuroraCloudfrontKustomizeDagArgo WorkflowsElasticacheArgo CdIncident ResponseBashAirflowJenkinsLinuxMongodbLambdaEksCloudwatchGithub ActionsTerraformElkLoad BalancerMonitoringInfrastructure-As-CodeAwsCi/CdBackupIamHelmHigh AvailabilityPythonEc2KubernetesGolangMessage QueueRedisKafkaScripting
