We are seeking a skilled Data Engineer to join our Data Science team. The ideal candidate will be responsible for designing building and maintaining scalable data pipelines and infrastructure to support data analytics machine learning and RetrievalAugmented Generation (RAG) type Large Language Model (LLM) workflows. This role requires a strong technical background excellent problemsolving skills and the ability to work collaboratively with data scientists analysts and other stakeholders.
Key Responsibilities:
- Data Pipeline Development:
- Design develop and maintain robust and scalable ETL (Extract Transform Load) processes.
- Ensure data is collected processed and stored efficiently and accurately.
- Data Integration:
- Integrate data from various sources including databases APIs and thirdparty data providers.
- Ensure data consistency and integrity across different systems.
- RAG Type LLM Workflows:
- Develop and maintain data pipelines specifically tailored for RetrievalAugmented Generation (RAG) type Large Language Model (LLM) workflows.
- Ensure efficient data retrieval and augmentation processes to support LLM training and inference.
- Collaborate with data scientists to optimize data pipelines for LLM performance and accuracy.
- Semantic/Ontology Data Layers:
- Develop and maintain semantic and ontology data layers to enhance data integration and retrieval.
- Ensure data is semantically enriched to support advanced analytics and machine learning models.
- Collaboration:
- Work closely with data scientists analysts and other stakeholders to understand data requirements and deliver solutions.
- Provide technical support and guidance on datarelated issues.
- Data Quality and Governance:
- Implement data quality checks and validation processes to ensure data accuracy and reliability.
- Adhere to data governance policies and best practices.
- Performance Optimization:
- Monitor and optimize the performance of data pipelines and infrastructure.
- Troubleshoot and resolve datarelated issues in a timely manner.
- Support for Analysis:
- Support shortterm adhoc analysis by providing quick and reliable data access.
- Contribute to longerterm goals by developing scalable and maintainable data solutions.
- Documentation:
- Maintain comprehensive documentation of data pipelines processes and infrastructure.
- Ensure knowledge transfer and continuity within the team.
Technical Requirements:
- Education and Experience:
- Bachelors or Masters degree in Computer Science Engineering or a related field.
- 3 years of experience in data engineering or a related role.
- Technical Skills:
- Proficiency in Python (mandatory).
- Experience with other programming languages such as Java or Scala is a plus.
- Experience with SQL and NoSQL databases (e.g. MySQL PostgreSQL MongoDB).
- Familiarity with big data technologies (e.g. Hadoop Spark Kafka).
- Experience with cloud platforms (e.g. AWS Azure Google Cloud) and their data services.
- RAG Type LLM Skills:
- Experience with data pipelines for LLM workflows including data retrieval and augmentation.
- Familiarity with natural language processing (NLP) techniques and tools.
- Understanding of LLM architectures and their data requirements.
- Semantic/Ontology Data Layers:
- Familiarity with semantic and ontology data layers and their application in data integration and retrieval.
- Tools and Frameworks:
- Experience with ETL tools and frameworks (e.g. Apache NiFi Airflow Talend).
- Familiarity with data visualization tools (e.g. Tableau Power BI) is a plus.
- Soft Skills:
- Strong analytical and problemsolving skills.
- Excellent communication and collaboration abilities.
- Ability to work in a fastpaced dynamic environment.
Preferred Qualifications:
- Experience with machine learning and data science workflows.
- Knowledge of data governance and compliance standards.
- Certification in cloud platforms or data engineering.
Required Skills : Must be local to Cincinnati