Overview
The position of Pyspark Databricks is crucial to our organization as it involves developing and implementing data processing pipelines using Pyspark on the Databricks platform ensuring efficient and scalable data processing and analysis.
Key responsibilities
- Design and develop data processing pipelines using Pyspark and Databricks.
- Optimize and tune the performance of Spark jobs and clusters.
- Implement data transformations and aggregations for analysis and reporting.
- Collaborate with data engineers and data scientists to integrate data from various sources into the pipelines.
- Develop and maintain ETL processes to extract transform and load data into the data lake.
- Monitor and troubleshoot data processing jobs to ensure reliability and stability.
- Conduct code reviews and provide technical guidance to junior team members.
- Implement best practices for data pipeline development and documentation.
- Work with crossfunctional teams to gather and analyze requirements for data processing.
- Participate in the evaluation and implementation of new technologies related to big data processing.
Required qualifications
- Bachelors degree in Computer Science Engineering or a related field.
- Proven experience in developing data processing pipelines using Pyspark and Databricks.
- Strong proficiency in Python programming language.
- Handson experience with Big Data technologies and distributed computing.
- Expertise in optimizing Spark jobs and clusters for performance.
- Knowledge of ETL processes and data warehousing concepts.
- Experience with data integration and data modeling.
- Ability to troubleshoot and debug complex data processing issues.
- Excellent understanding of SQL and NoSQL databases.
- Strong communication and collaboration skills to work effectively in a team environment.
- Certifications in Pyspark Databricks or related technologies are a plus.
- Familiarity with cloud platforms such as AWS Azure or Google Cloud is beneficial.
python,big data,etl,spark,data