Senior ML Engineer – ML/Inference

About the Role

You will lead the deployment, optimization, and lifecycle management of ML models for production inference and agentic systems. You will design and build scalable inference pipelines, implement model serving infrastructure for low latency and high throughput, and tune models using quantization, pruning, and distillation. You will develop and maintain RAG systems and vector database configurations, create CI/CD workflows for model updates, monitor model performance in production, detect drift, and improve reliability and explainability. You will benchmark and evaluate large language and multimodal models and integrate model APIs and agentic frameworks into customer-facing systems.

Requirements

  • 5+ years of experience in applied ML or ML infrastructure engineering
  • Expertise in model serving and inference optimization (e.g., TensorRT ONNX vLLM Triton DeepSpeed)
  • Proficiency in Python and experience building APIs and pipelines with FastAPI PyTorch and Hugging Face tooling
  • Experience designing and tuning RAG systems and working with vector databases (Milvus Weaviate LanceDB pgvector)
  • Solid foundation in MLOps practices including versioning (MLflow DVC) orchestration (Airflow Kubeflow) and monitoring (Prometheus Grafana Sentry)
  • Familiarity with distributed compute systems (Kubernetes Ray Slurm) and cloud ML stacks (AWS SageMaker GCP Vertex AI Azure ML)
  • Understanding of prompt engineering agentic frameworks and LLM evaluation
  • Experience with quantization pruning and distillation techniques
  • Strong collaboration documentation and cross-functional communication skills
  • Preferred background in HPC ML infrastructure regulated or energy-aware environments

Responsibilities

  • Own the end-to-end lifecycle of ML models from training artifacts to production inference
  • Design build and maintain scalable inference pipelines using orchestration frameworks
  • Implement and optimize model serving infrastructure for latency throughput and cost
  • Develop and tune Retrieval-Augmented Generation systems and vector database configurations
  • Integrate model APIs and agentic workflows into customer-facing systems
  • Evaluate benchmark and optimize large language and multimodal models
  • Design CI/CD workflows for ML systems ensuring reproducibility and observability
  • Develop internal tools for dataset management feature stores and evaluation pipelines
  • Monitor production model performance detect drift and drive reliability improvements
  • Explore and integrate emerging agentic and orchestration frameworks

Skills

Apply Now