Title: LLMOps Engineer
Description & Requirements
Position Summary
LLMOps(Large language model operations) Engineer will play a pivotal role in building and maintaining the infrastructure and pipelines for our cuttingedge Generative AI applications establishing efficient and scalable systems for LLM research evaluation training and finetuning. Engineer will be responsible for managing and optimizing large language models (LLMs) across various platforms This position is uniquely tailored for those who excel in crafting pipelines cloud infrastructure environments and workflows. Your expertise in automating and streamlining the ML lifecycle will be instrumental in ensuring the efficiency scalability and reliability of our Generative AI models and associated platform. LLMOps engineer s expertise will ensure the smooth deployment maintenance and performance of these AI platforms and powerful large language models.
You will follow Site Reliability Engineering & MLOps principles and will be encouraged to contribute your own best practices and ideas to our ways of working.
Reporting to the Head of Cloud Native operations you will be an experienced thought leader and comfortable engaging senior managers and technologists. You will engage with clients display technical leadership and guide the creation of efficient and complex products/solutions.
Key Responsibilities
Technical & Architectural Leadership
Contribute to the technical delivery of projects ensuring a high quality of work that adheres to best practices brings innovative approaches and meets client expectations. Project types include following (but not limited to):
o Solution architecture Proof of concepts (PoCs) MVP design develop and implementation of ML/LLM pipelines for generative AI models encompassing data ingestion preprocessing training deployment and monitoring.
o Automate ML tasks across the model lifecycle.
Contribute to thought leadership across the Cloud Native domain with an expert understanding of advanced AI solutions using Large Language Models (LLM) & Natural Language Processing (NLP) techniques and partner technologies.
Collaborate with crossfunctional teams to integrate LLM and NLP technologies into existing systems.
Ensure the highest levels of security and compliance are maintained in all ML and LLM operations.
Stay abreast of the latest developments in ML and LLM technologies and methodologies integrating these innovations to enhance operational efficiency and model effectiveness.
Collaborate with global peers from partner ecosystems on joint technical projects. This partner ecosystem includes Google Microsoft AWS IBM Red Hat Intel Cisco and Dell / VMware etc.
Service Delivery
Provide a technical handson contribution. Create scalable infra to support enterprise loads (distributed GPU compute foundation models orchestrating across multiple cloud vendors etc.)
Ensuring the reliable and efficient platform operations.
Apply data science machine learning deep learning and natural language processing methods to analyse process and improve the model s data and performance.
Create and optimize prompts and queries for retrieval augmented generation and prompt engineering techniques to enhance the model s capabilities and user experience w.r.t Operations & associated platforms.
Clientfacing influence and guidance engaging in consultative client discussions and performing a Trusted Advisor role.
Provide effective support to Sales and Delivery teams.
Support sales pursuits and enable revenue growth.
Define the modernization strategy for client platform and associated IT practices create solution architecture and provide oversight of the client journey.
Innovation & Initiative
Always maintain handson technical credibility keep in front of the industry and be prepared to show and lead the way forward to others.
Engage in technical innovation and support position as an industry leader.
Actively contribute to sponsorship of leading industry bodies such as the CNCF and Linux Foundation.
Contribute to thought leadership by writing Whitepapers blogs and speaking at industry events.
Be a trusted knowledgeable internal innovator driving success across our global workforce.
Client Relationships
Advise on best practices related to platform & Operations engineering and cloud native operations run client briefings and workshops and engage technical leaders in a strategic dialogue.
Develop and maintain strong relationships with client stakeholders.
Perform a Trusted Advisor role.
Contribute to technical projects with a strong focus on technical excellence and ontime delivery.
Mandatory Skills & Experience
Expertise in designing and optimizing machinelearning operations with a preference for LLMOps.
Proficient in Data Science Machine Learning Python SQL Linux/Unix shell scripting.
Experience on Large Language Models and Natural Language Processing (NLP) and experience with researching training and finetuning LLMs. Contribute towards finetune Transformer models for optimal performance in NLP tasks if required.
Implement and maintain automated testing and deployment processes for machine learning models w.r.t LLMOps.
Implement version control CI/CD pipelines and containerization techniques to streamline ML and LLM workflows.
Develop and maintain robust monitoring and alerting systems for generative AI models ensuring proactive identification and resolution of issues.
Research or engineering experience in deep learning with one or more of the following: generative models segmentation object detection classification model optimisations.
Experience implementing RAG frameworks as part of availableready products.
Experience in setting up the infrastructure for the latest technology such as Kubernetes Serverless Containers Microservices etc.
Experience in scripting / programming to automate deployments and testing worked on tools like Terraform and Ansible. Scripting languages like Python bash YAML etc.
Experience on CI/CD opensource and enterprise tool sets such as Argo CD Jenkins (others like Jenkins X Circle CI Argo CD Tekton Travis Concourse an advantage).
Experience with the GitHub/DevOps Lifecycle
Experience in Observability solutions (Prometheus EFK stacks ELK stacks Grafana Dynatrace AppDynamics)
Experience in atleast one of the clouds for example Azure/AWS/GCP
Significant experience on microservicesbased containerbased or similar modern approaches of applications and workloads.
You have exemplary verbal and written communication skills (English). Able to interact and influence at the highest level you will be a confident presenter and speaker able to command the respect of your audience.
Desired Skills & Experience
Bachelor level technical degree or equivalent experience; Computer Science Data Science or
Engineering background preferred; Master s Degree desired.
Experience in LLMOps or related areas such as DevOps data engineering or ML infrastructure.
Handson experience in deploying and managing machine learning and large language model pipelines in cloud platforms (e.g. AWS Azure) for ML workloads.
Familiar with data science machine learning deep learning and natural language processing concepts tools and libraries such as Python TensorFlow PyTorch NLTK etc.
Experience in using retrieval augmented generation and prompt engineering techniques to improve the model s quality and diversity to improve operations efficiency. Proven experience in developing and finetuning Language Models (LLMs).
Stay uptodate with the latest advancements in Generative AI conduct research and explore innovative techniques to improve model quality and efficiency.
The perfect candidate will already be working within a System Integrator Consulting or Enterprise organisation with 8 years of experience in a technical role within the Cloud domain.
Deep understanding of core practices including SRE Agile Scrum XP and Domain Driven Design. Familiarity with the CNCF opensource community.
Enjoy working in a fastpaced and dynamic environment using the latest technologies
sql,linux/unix shell scripting,nlp techniques,azure,generative ai applications,gcp,cloud platforms (aws/azure/gcp),infrastructure setup for latest technology,scripting and programming,researching, training, and fine-tuning llms,monitoring and alerting systems,aws,deep learning experience,cd,ci,ml/llm pipelines,security and compliance,version control,data science,observability solutions,linux,large language models,ci/cd pipelines,verbal and written communication skills,deep learning,llmops,machine learning,ml lifecycle automation,automated testing and deployment processes for machine learning models,python,containerization techniques,kubernetes,unix,ci/cd opensource and enterprise tool sets,cloud infrastructure