NAVA Software solutions is looking for a Machine Learning Operations Engineer
Details:
Machine Learning Operations (MLOps) Engineer AWS (with LLM Focus)
Location: Remote work
Duration: 12 months
Responsibilities:
- LLMOptimized MLOps Infrastructure: Design and implement MLOps infrastructure on AWS tailored for LLMs leveraging services like SageMaker EC2 (with GPU instances) S3 ECS/EKS Lambda and more.
- LLM Deployment Pipelines: Build and manage CI/CD pipelines specifically for LLM deployment addressing unique challenges like model size inference optimization and versioning.
- LLMOps Practices: Implement LLMOps best practices for monitoring model performance drift detection prompt management and feedback loops for continuous improvement.
- RESTful API Development: Design and develop RESTful APIs to expose LLM capabilities to other applications and services ensuring scalability security and optimal performance.
- Model Optimization: Apply techniques like quantization distillation and pruning to optimize LLM models for efficient inference on AWS infrastructure.
- Monitoring and Observability: Establish comprehensive monitoring and alerting mechanisms to track LLM performance latency resource utilization and potential biases.
- Prompt Engineering and Management: Develop strategies for prompt engineering and management to enhance LLM outputs and ensure consistency and safety.
- Collaboration: Work closely with data scientists researchers and software engineers to integrate LLM models into production systems effectively.
- Cost Optimization: Continuously optimize LLMOps processes and infrastructure for costefficiency while maintaining high performance and reliability.
Qualifications:
- Experience: 3 years of experience in MLOps or a related field with handson experience in deploying and managing LLMs.
- AWS Expertise: Strong proficiency in AWS services relevant to MLOps and LLMs including SageMaker EC2 (with GPU instances) S3 ECS/EKS Lambda and API Gateway.
- LLM Knowledge: Deep understanding of LLM architectures (e.g. Transformers) training techniques and inference optimization strategies.
- Programming Skills: Proficiency in Python and experience with infrastructureascode tools (e.g. Terraform CloudFormation) REST API frameworks (e.g. Flask FastAPI) and LLM libraries (e.g. Hugging Face Transformers).
- Monitoring: Familiarity with monitoring and logging tools for LLMs such as Prometheus Grafana and CloudWatch.
- Containerization: Experience with Docker and container orchestration (e.g. Kubernetes ECS) for LLM deployment.
- Problem Solving: Excellent problemsolving and troubleshooting skills in the context of LLMs and MLOps.
- Communication: Strong communication and collaboration skills to effectively work with crossfunctional teams