Who we are
Artmac Soft is a technology consulting and serviceoriented IT company dedicated to providing innovative technology solutions and services to Customers.
Job Description:
Job Title : Lead Data Engineer
Job Type : W2
Experience : 515 Years
Location : Austin Texas
We are looking for a Lead Data Engineer who will be responsible for implementing and scaling data collection storage processing and filtering for finetuning large language models (LLMs) within Conversational Engineering. They will be critical for enabling cuttingedge research safety systems and product development.
Responsibilities:
- Experience as a data engineer with a strong background in designing and building largescale data pipelines.
- Must have knowledge on Python SQL Big Data Pyspark and Data engineering.
- Have extensive experience with cloud platforms like AWS Google Cloud or Azure for data storage processing and management.
- Have handson experience with ETL orchestration tools such as Apache Airflow Dagster or Prefect for managing complex data workflows.
- Possess deep expertise in distributed computing frameworks such as Apache Spark Hadoop or Flink and have handson experience optimizing data processing at scale.
- Are proficient in programming languages commonly used in data engineering such as Python and have a solid understanding of data structures and algorithms.
- Are wellversed in various data storage technologies including distributed file systems (e.g. HDFS S3) databases (e.g. Cassandra HBase) and data warehouses (e.g. Redshift BigQuery).
- Possess knowledge of natural language processing (NLP) techniques and have worked with text data preprocessing normalization and feature extraction.
- Are passionate about staying uptodate with the latest advancements in data engineering and NLP and are eager to apply innovative techniques to solve challenging problems.
- Design build and manage scalable data pipelines for collecting storing processing and filtering large volumes of text data for finetuning LLMs.
- Develop and optimize data storage architectures to handle the massive scale of data required for training stateoftheart language models.
- Implement efficient data preprocessing cleaning and feature extraction techniques to ensure highquality data for model training.
- Collaborate with machine learning engineers and researchers to understand their data requirements and provide tailored solutions for LLM finetuning.
- Design and implement robust and faulttolerant systems for data ingestion processing and delivery.
- Optimize data pipelines for performance scalability and costefficiency leveraging distributed computing frameworks and cloud platforms.
- Ensure the security privacy and compliance of data according to industry best practices and regulatory requirements.
Qualification:
- Bachelors degree or equivalent combination of education and experience