Posted at: 22 January

Big Data Engineer (PySpark/Spark) – REF 114 – 01

Company

Xogito

Xogito is a global B2B digital services provider specializing in data management, real-time bidding, e-commerce, and advertising technology solutions.

Remote Hiring Policy:

Xogito supports a fully remote work environment and hires globally, welcoming applications from various regions without specific restrictions.

Job Type

Full-time

Allowed Applicant Locations

Brazil, Worldwide

Apply Here

Job Description

Purpose of the Role

We are seeking a skilled PySpark/Spark Developer to join our dynamic team and contribute to the design and implementation of data-driven solutions. You will be responsible for developing and optimizing distributed data processing pipelines, enabling large-scale data analytics, and ensuring the efficient handling of big data. If you are passionate about working with cutting-edge technologies in a fast-paced environment, this role is for you.

Duties and Responsibilities

Design, develop, and maintain data pipelines using PySpark and Apache Spark to process and transform large-scale datasets efficiently.
Collaborate with data scientists, analysts, and engineers to understand data requirements and translate them into scalable solutions.
Optimize Spark jobs for performance and scalability in distributed environments.
Build and deploy big data solutions in cloud environments (e.g., AWS, Azure, GCP) using services such as EMR, Databricks, or similar.
Implement solutions for real-time data streaming using Spark Streaming or similar frameworks.
Develop and maintain data models, ensuring data integrity and consistency.
Troubleshoot and debug issues in existing pipelines, ensuring high reliability and availability of systems.
Stay updated with the latest trends and advancements in the big data ecosystem.
Document technical solutions, data flows, and pipeline architecture to ensure knowledge sharing.

Required Experience & Knowledge

3+ years of experience working with Apache Spark and PySpark in production environments.
Proficiency in Python (for PySpark), with a strong understanding of data structures and algorithms.
Solid experience with distributed data processing frameworks and handling large datasets.
Familiarity with cloud services like AWS (e.g., S3, EMR, Glue), Azure (e.g., Databricks, Synapse), or GCP (e.g., Dataflow, BigQuery).
Experience with Hadoop ecosystems (e.g., HDFS, Hive, or HBase).
Knowledge of real-time data processing frameworks like Kafka or Spark Streaming.
Proficiency in working with structured and unstructured data formats such as JSON, Parquet, and Avro.
Understanding of data lake architectures, data partitioning, and schema evolution.
Hands-on experience with version control systems (e.g., Git) and CI/CD pipelines.

Skills and Attributes

Strong analytical and problem-solving abilities, with attention to detail.
Excellent collaboration and communication skills to work in cross-functional teams.
Ability to adapt quickly to new technologies and a fast-paced work environment.
High level of ownership and accountability for deliverables.

Required Education & Qualifications

Bachelor’s or Master’s degree in Computer Science, Data Engineering, or a related field (or equivalent practical experience).
Advanced level of spoken and written English.
Relevant certifications in big data technologies, cloud platforms, or Spark are a plus.

Apply Here