Freelancer profile translated to English.

Description

Senior Data Engineer specializing in AWS cloud architectures, with strong expertise in designing scalable and event-driven data platforms. I focus on implementing event-oriented serverless architectures (EventBridge, SQS, Lambda) and advanced workflow orchestration via Step Functions (SFN) to manage complex and distributed processing.

Databricks expert, I design and industrialize high-performance batch pipelines in PySpark (ELT/ETL), optimized for processing very large data volumes in Lakehouse environments (Delta Lake). Accustomed to AWS environments (EMR, S3, DynamoDB, Redshift, MWAA), I build robust, automated, and multi-environment end-to-end data solutions, with a particular focus on performance, reliability, and scalability.

Languages

Chinese
Native or bilingual
French
Fluent
English
Fluent

Workplace preferences

Can work on-site

Paris (up to 30km)

ENGIE
Data Engineer
ENERGY AND UTILITIES
August 2024 - Today (1 year and 10 months)
Paris, France
Billing Orchestration of the billing system for offers (BSH+, BSH, BSMA, 100SPOT)

- Design and implementation of a complete AWS infrastructure with Terraform, managing a multi-service architecture (Databricks Workflows, Lambda, API Gateway, EventBridge, DynamoDB, S3, Step Functions, SQS, KMS, CloudWatch…)
- Building large-scale ETL pipelines on Databricks with PySpark for billing data processing
- Implementation of an event-driven architecture (EventBridge + SQS + Lambda) for decoupling and orchestration of billing system components
- Development of a Serverless data distribution layer with DynamoDB for high concurrency access
- Design, development, and deployment of RESTful APIs via API Gateway and Lambda, exposing normalized data to other billing system components
- Setup of a multi-environment CI/CD pipeline (dev/staging/preprod/prod) with GitHub Actions, ensuring reliable and repeatable deployments
Spark Python Databricks AWS Event-driven architecture
Dalkia
Data Solutions Architect
ENERGY AND UTILITIES
July 2023 - July 2024 (1 year)
Paris, France
- Design of the target architecture for IoT data: Definition of a Lakehouse on AWS for sensor streams (temperature, pressure). Specification of differentiated ingestion (initial, incremental, replay) via Spark/EMR, structured storage in S3 Standardized, deduplication via Kafka offset, and hourly partitioning. Writing the technical design document detailing the layers (*raw* → *standardized*), S3 buckets, and IAM roles.
- Data warehouse governance and industrialization: Comparative audit of Provisioned Redshift (for scheduled ETLs) vs. Serverless (for business self-service). Writing a technical design document detailing the governance strategy: fine-grained access control (users, roles, IAM policies), manual Workload Management (WLM) configuration, and transactional merge mechanism to ensure historical integrity during incremental updates or replays.
- Project support and technical alignment: Facilitating workshops with Dev, PO, Urbanization, and Business teams to translate needs into technical specifications. Solution validation via PoCs (PySpark, Airflow) and design of generic Airflow DAGs with anti-concurrency locking.
Cloud AWS PySpark Python Apache Kafka Amazon Redshift
Education Zhixing
Big Data Engineer
EDUCATION AND E-LEARNING
February 2022 - May 2023 (1 year and 3 months)
Shanghai, China
- Design and deployment of a data warehouse from scratch: Layered modeling (ODS, DIM, DWD/DWM/DWS) to centralize business data (visits, intentions, registrations, attendance). Management of slowly changing dimensions (SCD Type 2 via "zipper" tables) to ensure historical consistency. Development of 30+ tables and 10+ key metrics (conversion rate, retention, attendance), with daily incremental ingestion (~16 GB/day) automated via Airflow.
- Implementation of a real-time recommendation system: Kafka → Spark Structured Streaming pipeline to analyze student responses in micro-batches. Dynamic calculation of metrics (Top questions by subject/level) and generation of personalized recommendations via a Spark MLlib ALS (Collaborative Filtering) model. Results exposed in MySQL for web and BI teams.
- Optimization of the Big Data platform (Cloudera Hadoop): Advanced tuning of Hive (partitioning, vectorization, map joins, skew management) and Spark (repartitioning, memory tuning) to process 300k records/day/table without OOM. Automation of full/incremental ETLs (Sqoop, PySpark, Shell) on a 10-node cluster (200 TB raw).
Spark Kafka Cloudera Hadoop Airflow Python