Data Engineer (Python PySpark Apache Airflow NoSQL)
Location: Bengaluru Karnataka India (Onsite)
Experience: 35 years
Responsibilities:
Build optimize and maintain scalable ETL pipelines for data ingestion and processing.
Develop and manage workflows using Apache Airflow for scheduling and orchestrating tasks.
Work with distributed computing technologies (PySpark) to handle largescale datasets.
Design and implement data architectures that scale with growing business needs.
Implement data lake and data warehousing solutions using both structured and unstructured data.
Collaborate with data scientists and analytics teams to ensure data quality and availability.
Optimize existing data models and pipelines for performance and scalability.
Use NoSQL databases (e.g. MongoDB Cassandra) for large scalable data storage solutions.
Ensure high data integrity security and quality through monitoring and validation processes.
Write clear documentation and maintain data engineering best practices
Skills & Qualifications:
Strong proficiency in Python PySpark and SQL.
Experience working with Apache Airflow for orchestration.
Handson experience with distributed computing and big data tools (PySpark Hadoop).
Familiarity with cloud platforms (AWS GCP) and tools like S3 EMR Lambda etc.
Experience with NoSQL databases (e.g. MongoDB Cassandra) and relational databases.
Strong understanding of data warehousing concepts ETL processes and data lake architecture.
Experience with data pipeline monitoring logging and alerting.
Strong knowledge of Docker and containerized environments.
Familiarity with DevOps and CI/CD practices for data engineering.
Excellent problemsolving communication and teamwork skills.
About the Company:
CuberaTech founded in 2020 is a data company revolutionizing Big Data Analytics through a data value share paradigm where the users entrust their data to us. Our deployment of deep learning techniques enables us to harness this data making us a source of the richest Zero party data. Further by stitching together all the relevant pieces of data from zero first and secondparty sources we enable advertisers to define and create custom audiences to maximize the programmatic ROAS.
pyspark,ci/cd,nosql,cassandra,sql,devops,mongodb,apache airflow,airflow,python,aws,hadoop,docker,apache,gcp