Staff MLOps Engineer – LLMOps

About the Role

You will build and scale the technical infrastructure for AI/ML systems with a focus on LLMs and agentic workflows. You will create reusable CI/CD workflows for model training, evaluation, and deployment and automate model versioning, approval workflows, and compliance checks. You will design and operate a modular AI infrastructure stack including vector databases, feature stores, model registries, and observability tooling. You will partner with engineering and data science to embed models and agents into real-time applications, evaluate and integrate state-of-the-art tools, and deploy production LLM and agentic workflows with monitoring for cost, latency, and performance. You will enable researchers to iterate quickly by providing sandboxes, dashboards, and reproducible environments, and ensure data accuracy, consistency, and reliability for better training and inference.

Requirements

  • Proficiency writing high quality maintainable software primarily in Python
  • Strong background in containerization and orchestration Docker and Kubernetes
  • Experience with infrastructure as code and deployment including Terraform and CI/CD pipelines
  • Experience with monitoring and logging frameworks such as Datadog Prometheus and OpenTelemetry
  • Knowledge of MLOps best practices including model versioning rollback strategies automated evaluation and drift detection
  • Experience building scalable model and agent serving infrastructure such as vLLM Triton and BentoML
  • Experience deploying and maintaining LLM and agentic workflows in production including monitoring cost latency and performance and capturing traces
  • Demonstrated ownership pragmatism and ability to balance infrastructure elegance with iterative delivery

Responsibilities

  • Build reusable CI/CD workflows for model training evaluation and deployment
  • Automate model versioning approval workflows and compliance checks
  • Build modular and scalable AI infrastructure including vector databases feature stores and model registries
  • Partner with engineering and data science to embed AI models and agents into real time applications
  • Continuously evaluate and integrate state of the art AI tools
  • Drive AI reliability governance and uptime
  • Improve AI and ML model performance and ensure data accuracy consistency and reliability
  • Deploy infrastructure for offline and online evaluation including regression testing cost monitoring and human in the loop workflows
  • Provide sandboxes dashboards and reproducible environments to enable rapid research iteration

Benefits

  • Eligibility to participate in TRM’s equity plan

Skills

Apply Now
Staff MLOps Engineer – LLMOps at TRM Labs | JobStash