Posted at: 22 January
Big Data Engineer (PySpark/Spark) – REF 114 – 01
Company
Xogito
Xogito is a global B2B digital services provider specializing in data management, real-time bidding, e-commerce, and advertising technology solutions.
Remote Hiring Policy:
Xogito supports a fully remote work environment and hires globally, welcoming applications from various regions without specific restrictions.
Job Type
Full-time
Allowed Applicant Locations
Denmark, Worldwide
Job Description
Purpose of the Role
We are seeking a skilled PySpark/Spark Developer to join our dynamic team and contribute to the design and implementation of data-driven solutions. You will be responsible for developing and optimizing distributed data processing pipelines, enabling large-scale data analytics, and ensuring the efficient handling of big data. If you are passionate about working with cutting-edge technologies in a fast-paced environment, this role is for you.
Duties and Responsibilities
- Design, develop, and maintain data pipelines using PySpark and Apache Spark to process and transform large-scale datasets efficiently.
- Collaborate with data scientists, analysts, and engineers to understand data requirements and translate them into scalable solutions.
- Optimize Spark jobs for performance and scalability in distributed environments.
- Build and deploy big data solutions in cloud environments (e.g., AWS, Azure, GCP) using services such as EMR, Databricks, or similar.
- Implement solutions for real-time data streaming using Spark Streaming or similar frameworks.
- Develop and maintain data models, ensuring data integrity and consistency.
- Troubleshoot and debug issues in existing pipelines, ensuring high reliability and availability of systems.
- Stay updated with the latest trends and advancements in the big data ecosystem.
- Document technical solutions, data flows, and pipeline architecture to ensure knowledge sharing.
Required Experience & Knowledge
- 3+ years of experience working with Apache Spark and PySpark in production environments.
- Proficiency in Python (for PySpark), with a strong understanding of data structures and algorithms.
- Solid experience with distributed data processing frameworks and handling large datasets.
- Familiarity with cloud services like AWS (e.g., S3, EMR, Glue), Azure (e.g., Databricks, Synapse), or GCP (e.g., Dataflow, BigQuery).
- Experience with Hadoop ecosystems (e.g., HDFS, Hive, or HBase).
- Knowledge of real-time data processing frameworks like Kafka or Spark Streaming.
- Proficiency in working with structured and unstructured data formats such as JSON, Parquet, and Avro.
- Understanding of data lake architectures, data partitioning, and schema evolution.
- Hands-on experience with version control systems (e.g., Git) and CI/CD pipelines.
Skills and Attributes
- Strong analytical and problem-solving abilities, with attention to detail.
- Excellent collaboration and communication skills to work in cross-functional teams.
- Ability to adapt quickly to new technologies and a fast-paced work environment.
- High level of ownership and accountability for deliverables.
Required Education & Qualifications
- Bachelor’s or Master’s degree in Computer Science, Data Engineering, or a related field (or equivalent practical experience).
- Advanced level of spoken and written English.
- Relevant certifications in big data technologies, cloud platforms, or Spark are a plus.