Senior MLOps Engineer – LLMOps
About the Role
You will build and scale infrastructure for large language models and agentic systems. You will create CI/CD workflows for model training, evaluation, and deployment, automate model versioning and approval processes, and deploy scalable serving and observability stacks. You will monitor cost, latency, and performance, run offline and online evaluations, implement regression tests and human-in-the-loop workflows, and enable researchers with sandboxes, dashboards, and reproducible environments. You will continuously evaluate and integrate state-of-the-art LLM tools and enforce reliability, compliance, and governance.
Requirements
- Write high-quality maintainable software primarily in Python
- Strong background in containerization and orchestration such as Docker and Kubernetes
- Experience with infrastructure-as-code and deployment tools such as Terraform and CI/CD pipelines
- Experience with monitoring and logging frameworks such as Datadog Prometheus and OpenTelemetry
- Knowledge of MLOps best practices including model versioning rollback strategies and drift detection
- Experience with scalable model and agent serving infrastructure such as vLLM Triton and BentoML
- Experience deploying and maintaining LLM and agentic workflows in production including cost latency and performance monitoring
- Demonstrated ownership pragmatic engineering and measurable delivery
Responsibilities
- Build reusable CI/CD workflows for model training evaluation and deployment
- Automate model versioning approval workflows and compliance checks
- Build modular and scalable AI infrastructure including vector databases feature stores and model registries
- Integrate observability tooling and implement monitoring and logging
- Embed AI models and agents into real-time applications and workflows
- Evaluate and integrate state-of-the-art AI tools and libraries
- Drive AI reliability governance and uptime
- Improve AI and ML model performance
- Ensure data accuracy consistency and reliability for training and inference
- Deploy infrastructure for offline and online evaluation including regression testing and cost monitoring
- Implement human-in-the-loop workflows
- Provide sandboxes dashboards and reproducible environments for researchers
Benefits
- Paid time off (PTO)
- Holidays
- Parental leave
- Equity plan participation
- Remote-first work arrangement
Skills
Multi-Agent ArchitectureRelease ManagementData Pipeline AutomationInference OptimizationOpentelemetryVector DatabaseFeature StoreExperiment TrackingLangchainGithub ActionsTerraformRegression TestingObservabilityCi/CdMlopsPrometheusDatadogLlmLangfuseModel VersioningModel RegistryLlamaindexVllmMlflowBentomlTritonDrift DetectionCost MonitoringHuman-In-The-Loop
