The Principal Data Engineer is responsible for driving the design development and implementation of Ambrys data infrastructure and solutions. This role will play a pivotal part in building and maintaining scalable reliable and efficient data pipelines data warehouses and data lakes. The Principal Data Engineer will collaborate closely with data architects scientists and analysts to ensure that data is accessible secure and aligned with business objectives. As a Principal Data Engineer at Ambry youll approach tasks with a customerbased cloudfirst mindset to support and enhance various data platform products including Ambrys data lakes streams and warehouses. This role will be primarily responsible for building monitoring and operationalizing our data streams which are hydrated via CDC (change data capture) from a suite of 20 onprem and cloud databases
Essential Functions
* Build Kafka connectors to sync updates from source data stores
* Build partitioned Kafka topics to sync updates to destination data marts
* Build multiplexed data analytics workloads using Apache Flink to monitor streaming metrics and perform realtime data transformations
* Build dashboards using Datadog and Cloudwatch to ensure system health and user support
* Build opinionated but accommodating schema registries that ensure data governance
* Work closely with your West Coast based scrum team to submit and review PRs daily maintain documentation and backlogs validate builds across multiple environments and deploy at a
24week sprint cadence
* Design reasonable database schemas with query access patterns as the forethought Build and maintain CI/CD pipelines using infrastructureascode
* Iteratively migrate onprem ETL jobs written in PHP into AWS
Flink and Glue processes Partner with QA Engineers in building automated test suites
* Partner with endusers to resolve service disruptions and evangelize our data product offerings Vigilantly oversee data quality and alert upstream data producers of all disparities latency and defects
* Develop and maintain the overall data platform architecture strategy roadmap and implementation plans to support the companys datadriven initiatives and business objectives.
* Design and implement scalable secure and highperformance data architectures including data warehouses data lakes and data pipelines leveraging both onpremises and cloud technologies
* Establish data governance policies standards and best practices for data management data quality data security and data privacy across the organization.
* Lead the development and implementation of realtime data streaming solutions including eventdriven architectures data ingestion transformation and consumption using technologies like Apache Kafka Apache Flink and AWS Managed Streaming for Kafka (MSK).
* Oversee the creation and maintenance of Business Intelligence
(BI) platforms data visualization tools and selfservice analytics capabilities to enable datadriven decisionmaking across the organization.
* Lead and manage a team of data engineers database administrators and data analysts fostering their professional growth promoting best practices and ensuring adherence to organizational standards and processes
* Other duties as assigned
Qualifications
* Basic understanding of genomic concepts and terminology
* Experience with PyFlink
* Experience with AWS Kinesis
* Willing to work PST hours between 8:00 AM 5:00 PM or 9:00
AM 6:00 PM
* Strong familiarity with any combination of our tech stacks in order of importance: Apache Kafka (MSK flavor preferred) Debezium Python Apache Flink or PySpark Streaming MySQL (RDS flavors preferred) Python CDK or Terraform Athena Glue Lambda Appflow HANA/4 PHP Redis Docker Javascript
* Experience building data APis and offering Data as a Service
* Experience integrating with Saas platforms such as SAP and Salesforce
* Experience or willingness to learn working with PHP MVC frameworks such as Symfony
* Experience with Atlassian products le. Jira Confluence Bamboo
* Experience with system diagramming tools such as Miro LucidCharts or Visio
* 6 years experience working with professional scrum teams and/or equivalent schooling
* 4 years experience using Git versioning control
* 3 years experience designing and indexing relational databases
* 2 years experience building and operationalizing realtime data
* Bachelors or masters degree in computer data math or life sciences or equivalent work experience