Staff MLOps Engineer – LLMOps

LeadSalary: 220K - 240KUnited States Ai Jobs by TRM Labs

About the Role

You will build and scale the technical infrastructure for AI/ML systems with a focus on LLMs and agentic workflows. You will create reusable CI/CD workflows for model training, evaluation, and deployment and automate model versioning, approval workflows, and compliance checks. You will design and operate a modular AI infrastructure stack including vector databases, feature stores, model registries, and observability tooling. You will partner with engineering and data science to embed models and agents into real-time applications, evaluate and integrate state-of-the-art tools, and deploy production LLM and agentic workflows with monitoring for cost, latency, and performance. You will enable researchers to iterate quickly by providing sandboxes, dashboards, and reproducible environments, and ensure data accuracy, consistency, and reliability for better training and inference.

Requirements

Proficiency writing high quality maintainable software primarily in Python
Strong background in containerization and orchestration Docker and Kubernetes
Experience with infrastructure as code and deployment including Terraform and CI/CD pipelines
Experience with monitoring and logging frameworks such as Datadog Prometheus and OpenTelemetry
Knowledge of MLOps best practices including model versioning rollback strategies automated evaluation and drift detection
Experience building scalable model and agent serving infrastructure such as vLLM Triton and BentoML
Experience deploying and maintaining LLM and agentic workflows in production including monitoring cost latency and performance and capturing traces
Demonstrated ownership pragmatism and ability to balance infrastructure elegance with iterative delivery

Responsibilities

Build reusable CI/CD workflows for model training evaluation and deployment
Automate model versioning approval workflows and compliance checks
Build modular and scalable AI infrastructure including vector databases feature stores and model registries
Partner with engineering and data science to embed AI models and agents into real time applications
Continuously evaluate and integrate state of the art AI tools
Drive AI reliability governance and uptime
Improve AI and ML model performance and ensure data accuracy consistency and reliability
Deploy infrastructure for offline and online evaluation including regression testing cost monitoring and human in the loop workflows
Provide sandboxes dashboards and reproducible environments to enable rapid research iteration

Benefits

Eligibility to participate in TRM’s equity plan

Staff MLOps Engineer – LLMOps

About the Role

Requirements

Responsibilities

Benefits

Skills

Similar Jobs