Freelancer profile translated to English.

Description

Currently Staff Engineer at Aiven and formerly SRE at Datadog and Criteo, I specialize in the scalability and resilience of high-throughput data infrastructures. I assist CTOs and engineering teams in overcoming their distributed architecture challenges. I work part-time (evenings, weekends, asynchronously).

🎯 My Services

• Architecture Audit & Diagnosis: Analysis of your architecture, practices, and infrastructure. Delivery of an action plan (Target Architecture, Quick-wins) to ensure reliability and enable scaling.

• Advisory / Fractional Staff: Regular asynchronous support. Review of your RFCs, validation of technological choices, mentoring, and support for complex decisions.

📊 Track Record

• Aiven: Re-architecture of a 2000-Kafka-cluster orchestrator (alerts reduced by 5x across 100+ regions). Creation of an on-demand monitoring and billing pipeline (eBPF/Vector/ClickHouse) processing traffic from 150k+ servers.

• Datadog: Deployment of 200+ Kafka/Cassandra clusters. Creation of the internal Kubernetes framework (CDK8s), replacing Helm, and migration of all stateful infrastructure with zero downtime.

• Criteo: Cross-DC Kafka replication (petabytes/day with zero downtime). Industrialization of 40k servers via Chef for multi-tenant container orchestration systems. Design and implementation of a self-diagnosing system for container crashes.

🛠 TECHNICAL STACK

• Kafka, Kubernetes (CDK8s, Helm), Terraform, Datadog, Prometheus, AWS, GCP.

• Go, Python, Java, Bash.

• Reliability, Production Readiness, SLI/SLO, Capacity Planning, Chaos Engineering.

Languages

French
Native or bilingual
English
Fluent

Workplace preferences

Remote only

Primarily works remotely

Aiven
Staff Software Engineer
March 2022 - Today (4 years and 3 months)
Lyon, France
Technical Leadership & Cross-Org Influence
- Acted as Technical advisor for the streaming organization, aligning engineering execution with product and business OKRs. Unblocked stalled initiatives, guided teams through complex delivery challenges, and provided architectural direction on company-wide initiatives. Partnered with leadership to translate product strategy into actionable, distributed-systems-ready technical goals.

Scalable Billing & High-Throughput Data Systems
- Drove the design and rollout of a pay-as-you-consume billing platform, reducing revenue leakage. Built a multi-cloud network monitoring pipeline (Vector, Kafka, ClickHouse) classifying traffic of 100k+ servers with 60-second resolution and sustaining millions of daily events.

Site Reliability & Operational Resilience
- Redefined the SRE operating model by introducing a domain-oriented SME structure, improving parallelism, reducing context switching, and strengthening collaboration across product and infrastructure.

Performance Optimization, Resiliency & Cost Efficiency
- Engineered optimizations across internal platforms and customer-facing workloads:
Re-architected Kafka-backed scheduling system, cutting alerts by 5× across 100+ regions, minimizing downtime for thousands of clusters.

- Doubled network throughput for customer workloads by exploiting AWS EBS internals with LVM, unlocking performance gains without additional cost.

Codebase Modernization & Velocity
- Re-designed core data placement logic into a modular, testable architecture. Increased coverage 3×, accelerated feature delivery, and enabled faster onboarding for new engineers, improving team velocity and ownership of critical systems.

Core Stack & Practices: Python, Kafka, Prometheus, Grafana, Zookeeper, ClickHouse, Vector, OpenSearch, AWS, GCP, Distributed systems, Observability, On-call, Incident Management, SLI/SLO, Performance tuning.
Team Leadership Apache Kafka Python Software Architecture Distributed Architecture
Datadog
Site Reliability Engineer
DIGITAL AND IT
January 2020 - March 2022 (2 years and 2 months)
Lyon, France
Built a new Kubernetes framework with CDK8s and created tooling that enabled a smooth migration from Helm with zero downtime; adopted company-wide to manage hundreds of services.

Cut infrastructure deployment time from 3–4 hours to under 10 minutes by rolling out Terraform-based continuous deployment integrated with GitLab CI.

Migrated 40+ applications from a global Chef-managed Kafka cluster to dedicated, Kubernetes-hosted Kafka clusters with no downtime, using custom mirroring strategies tailored to business SLAs.

Deployed and managed 200+ Kafka, Cassandra, and Postgres clusters across four global datacenters, all orchestrated via Kubernetes to improve reliability and consistency.

Standardized deployment and coding practices across dozens of services by introducing shared libraries, reducing configuration drift and simplifying maintenance.

Implemented authentication for all Go and Python services interacting with Kafka and Cassandra, closing critical security gaps across the platform.

Automated 8recurring team operations using a Temporal-based workflow engine, freeing engineers to focus on higher-value projects.

Established the first embedded SRE model in the Alerting platform, scaling reliability practices across teams and improving on-call outcomes.

Provided expertise in observability, capacity planning, and deployment strategies, shaping reliability culture within the Alerting org and reducing incidents linked to misconfigured rollouts.
Observability Site Reliability Engineering Golang Apache Kafka Incident Management
Criteo
Site Reliability Engineer
DIGITAL AND IT
March 2018 - January 2020 (1 year and 10 months)
Paris, France
Implementation of an in house solution for the replication cross datacenters of Kafka clusters, handling Petabytes of data per day.

Design algorithms based on Spark evaluating the storage cost of field's in data schemas. Allowing our team to reduce by 15% our biggest dataset.

Deploying dozens of Kafka clusters with Chef, hundreds of topics and thousands of Protobuf schemas on a daily basis while ensuring no downtime.

Automation and Industrialization of an Orchestrated platform with Mesos and Consul.

Configuration management for ~40k physical machines in 8 DCs with Chef.

Reducing “time-to-diagnose” workload by developing a service autonomously diagnosing application and infrastructure failures, based on Prometheus, ES and Grafana.

Worked in depth on resource isolation ( mainly CPU and Network ) to improve fairness, efficiency and reduce noisy neighbor issue on Criteo platform.

Tuning Linux kernel parameters and mechanisms to improve performances for critical latency-sensitive applications.
Docker Site Reliability Engineering Automation Monitoring Apache Kafka