Responsibilities:
- Implementing and designing scalable optimized data pipelines for (pre) processing ETL for machine learning models.
- Develop and maintain conceptual and logical data models using data modeling guidelines from the clients;
- Document and maintain business glossary in the enterprise data catalog solution;
- Evaluate business data models and physical data models for variances and discrepancies;
- Support project team in adopting business data models;
- Guide project team to map physical data models to business glossary.
Knowledge/Experience:
For senior experience:
- Handson technologies and frameworks used in ML like sklearn MLFlow TensorFlow;
- Building complex data pipelines e.g. ETL;
- Experience working in cloud environment data cloud platforms (e.g. GCP);
- Understanding of code management repositories like GIT/SVN;
- Familiar with software engineering practices like versioning testing documentation code review;
- Experience with Apache Airflow;
- Experience in setting up both SQL as well as noSQL databases;
- Experience with monitoring and observability (ELK stack);
- Deployment and provisioning with automation tools e.g. Docker Kubernetes Openshift CI/CD;
- Knowledge of MLOps architecture and practices;
- Relevant work experience in ML projects;
- Knowledge of data manipulation and transformation e.g. SQL;
- Setting up/troubleshoot SQL and NoSQL databases.
For medium experience:
- Design and Develop Data Pipelines: Create efficient and scalable data pipelines using GCP services such as Dataflow (Apache Beam) Dataproc (Apache Spark) and Pub/Sub;
- Data Storage Solutions: Implement and manage data storage solutions using GCP services such as BigQuery Cloud Storage and Cloud SQL;
- Data Analysis and Reporting: Optimize SQL queries for data analysis and reporting in BigQuery.