Title: AI/ML Data Engineer
Location: Remote
Duration: 1 year
We are currently in the process of migrating our Clinical Trial Management System (CTMS) and are seeking a skilled AI/ML Data Engineer to join our team. As an AI/ML Data Engineer you will be responsible for developing and managing a new OCR (Optical Character Recognition) and document classification model. Our existing model which processes around 200 studies and 75000 documents annually has seen a significant decline in accuracy. You will play a crucial role in rebuilding this model to adapt to new document templates and structures ensuring the accuracy and efficiency of our document processing workflow.
Key Responsibilities:
- Develop and Implement New OCR and Classification Models:
- Rebuild the OCR and document classification model to achieve high accuracy and reduce manual intervention.
- Utilize NLP techniques to enhance the models performance.
- Ensure the model can effectively classify documents as valid or invalid and flag those that require manual review.
- Data Analysis and Model Evaluation:
- Analyze existing data and model performance to identify areas of improvement.
- Conduct rigorous testing and validation of the new model to ensure it meets the required accuracy standards.
- Collaboration and Integration:
- Work closely with the data engineering team to integrate the new model into the CTMS migration.
- Collaborate with crossfunctional teams to understand document structures and ensure the model aligns with business needs.
- Continuous Improvement:
- Monitor the models performance postdeployment and make necessary adjustments to maintain high accuracy.
- Stay updated with the latest advancements in OCR NLP and machine learning to continually improve the model.
Qualifications:
- Educational Background:
- Bachelors or Masters degree in Computer Science Data Science Machine Learning or a related field.
- Experience:
- Proven experience in developing and deploying OCR and NLP models.
- Experience with machine learning frameworks and libraries.
- Prior experience in handling large volumes of documents and ensuring data quality.
- Technical Skills:
- Handson development with Microsoft stack (Azure Data Factory SSIS Databricks) as well as Python and SQL.
- Proficiency in Python and relevant libraries.
- Strong understanding of data preprocessing feature extraction and model evaluation.
- Experience with database management and integration.
- Soft Skills:
- Ability to work independently and collaboratively in a fastpaced environment
- Excellent problemsolving skills and attention to detail.
- Strong communication skills to collaborate with team members and stakeholders.
- Hyper communication and inquisitive.
- Drivers with high level of selfmotivation.
- Extreme accountability and ownership.
- Handson executionists vs theorists.
- Critical thinking.
- Experience in the healthcare or clinical research industry.
- Familiarity with CTMS and document management systems.
- Knowledge of cloud platforms and DevOps practices.
- Experience with Validated Systems
Additional Details
Recruiters Here are the most recent updated notes from manager regarding the role and what he is looking for:
- Current OCR and classification exists in Python using OCR/NLP
- Accuracy was once as high as 98% now below 60% requiring more manual intervention
- Existing process covers 200 studies and around 75k documents per year
- Need data scientist to build new OCR and Classification model as part of CTMS migration
- Classifying as valid or not
- Does it look like others If not manually review.
- Document templates and structures have changed since original so need to rebuild
- Primary responsibility to rebuild and manage the module
- Data engineering work being done by other members of the team but will need to work with them